Hacker News new | past | comments | ask | show | jobs | submit login
Protecting paths in macro expansions by extending UTF-8 (nullprogram.com)
30 points by nalgeon 11 months ago | hide | past | favorite | 13 comments



This article seems to assume paths will be valid UTF-8, which isn't true on either Linux certainly, and Windows as far as I know.

of course we could say "paths must be valid UTF-8 for this program to work" (quite a few Rust programs do require this, as they store paths in standard Rust strings, which themselves must be valid UTF-8), but if your concern is dodgy paths breaking things, you probably need to check for that somewhere?


Yes, in Windows the actual implementation requires that paths consist of sequences of 16-bit values, and those values have some constraints, but the constraints don't include requiring that they're valid UTF-16, and so they aren't necessarily valid Unicode text, and so there may be no UTF-8 representation. In Unix they're just bytes and the bytes may not be UTF-8.

In Rust terms what Windows is doing is basically [u16] and what Unix does is basically [u8] and neither of these is necessarily meaningful human text.

Internally Rust's OsString is probably like the hack in this blog post, all valid UTF-8 is just stored as UTF-8 which means everything else must be using the byte values which aren't needed in UTF-8. But Rust is explicit that this is opaque and not guaranteed to stay the same across compiler or library versions.


Afaik Rust (currently) uses WTF-8 [0] to store Windows (WTF-16) strings which is a very useful encoding if you need to deal with such strings in your programs written in other languages as well. The conversion is essentially the same as UTF-16 -> to UTF-8 except you interpret unmatched surrogate pairs as (reserved) Unicode values with the same value and encode those to UTF-8 as you would any other Unicode value. So this doesn't use exactly the same trick as TFA - instead of using invalid UTF-8 encodings it uses "normal" UTF-8 encodings of invalid Unicode values (specifically exactly those that were reserved when extending UCS-2 to UTF-16). Or in other words, WTF-16 <-> WTF-8 conversion is the same as UTF-16 <-> UTF-8 conversion but without (some of) the error handling.

[0] https://simonsapin.github.io/wtf-8/


What unix does is closer to [NonZeroU8] which is actually very helpful as can be seen in shell commands like find -print0 | xargs -0, since a NUL character is guarenteed to not be part of the actual arguments. You could do the same here, but I suspect the program in question does not support NUL characters in their strings for the same reason unix doesn't (because that's what C does).


A good observation, I'd forgotten NonZeroU8 is a standard library type. I suspect (but haven't verified) Windows likewise requires [NonZeroU16]


Even Python assumes Unicode, unless one passes “bytes” strings for filenames.

I was recently astounded when small Python script I whipped up to hash and compare binary file content died with a Unicode related exception — from the filename itself!

(Walking a directory using “bytes” paths fixed it)


> In Unix they're just bytes and the bytes may not be UTF-8.

Depends on the Unix. I believe MacOS enforces unicode or at least does some form of unicode normalization.


This sounds like a perfect recipe for the disaster. You have essentially made a separate character encoding that looks like but in fact is unlike UTF-8, so they have to be very strictly separated from each other. In most cases, of course, they will be inevitably mixed to each other.


This looks like a hack that will inevitably bite you in the back sometime in the future, for example if one of the involved programs starts to validate UTF-8 in the future, or your system locale changes, or something similar.


Or a future update starts to use those reserved high bits. Some of the current encoding space restrictions are just that Unicode didn't reserve enough high bit surrogates for UTF-16 to extend indefinitely. (UTF-8 can in theory. UTF-32 has options, including some of these still reserved in UTF-8 but currently unused high byte codepoints, UTF-16 is accidentally stuck, for now.) Sure, it is unlikely that we'll see another Unicode plane extension in our lifetimes, but many of the people that bet on UCS-2 when that looked like it covered everything and are consequentially now stuck with the somewhat broken UTF-16 thought the same thing.


Seems like you might as well use Private Use Area characters[0] and keep things valid UTF-8.

(Yes, you will have problems with paths that contain PUA characters. But people have pointed out that paths aren't necessarily valid UTF-8, so you can't inline-encode your way out of this anyway. PUA characters are likely vanishingly less common than spaces, so you still mostly solve the problem.)

[0] https://en.wikipedia.org/wiki/Private_Use_Areas


If you are going to manipulate spaces into other things in Unicode there are already so many fun tools like non-breaking spaces and half-width spaces and medium mathematical space. You could even go for weird, rare ASCII-compatible like "form feed".

https://en.wikipedia.org/wiki/Whitespace_character

Seems more fun to use something that exists, is rare, and is already weirdly space-like. (Though yes, you have to find a way to escape it if someone is crazy enough to do something like name a file with a "form feed" in the middle.)


If you insist going that way there's a perfectly cromulent "File Separator" ASCII command character. While it's still possible file names contain it on Linux, it's easier to detect and sanitize or better, reject any such input.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: