Protecting paths in macro expansions by extending UTF-8

CJefferson · 2024-03-05T12:01:58 1709640118

This article seems to assume paths will be valid UTF-8, which isn't true on either Linux certainly, and Windows as far as I know.

of course we could say "paths must be valid UTF-8 for this program to work" (quite a few Rust programs do require this, as they store paths in standard Rust strings, which themselves must be valid UTF-8), but if your concern is dodgy paths breaking things, you probably need to check for that somewhere?

tialaramex · 2024-03-05T13:15:24 1709644524

Yes, in Windows the actual implementation requires that paths consist of sequences of 16-bit values, and those values have some constraints, but the constraints don't include requiring that they're valid UTF-16, and so they aren't necessarily valid Unicode text, and so there may be no UTF-8 representation. In Unix they're just bytes and the bytes may not be UTF-8.

In Rust terms what Windows is doing is basically [u16] and what Unix does is basically [u8] and neither of these is necessarily meaningful human text.

Internally Rust's OsString is probably like the hack in this blog post, all valid UTF-8 is just stored as UTF-8 which means everything else must be using the byte values which aren't needed in UTF-8. But Rust is explicit that this is opaque and not guaranteed to stay the same across compiler or library versions.

account42 · 2024-03-05T14:48:11 1709650091

Afaik Rust (currently) uses WTF-8 [0] to store Windows (WTF-16) strings which is a very useful encoding if you need to deal with such strings in your programs written in other languages as well. The conversion is essentially the same as UTF-16 -> to UTF-8 except you interpret unmatched surrogate pairs as (reserved) Unicode values with the same value and encode those to UTF-8 as you would any other Unicode value. So this doesn't use exactly the same trick as TFA - instead of using invalid UTF-8 encodings it uses "normal" UTF-8 encodings of invalid Unicode values (specifically exactly those that were reserved when extending UCS-2 to UTF-16). Or in other words, WTF-16 <-> WTF-8 conversion is the same as UTF-16 <-> UTF-8 conversion but without (some of) the error handling.

[0] https://simonsapin.github.io/wtf-8/

ekimekim · 2024-03-05T17:22:06 1709659326

What unix does is closer to [NonZeroU8] which is actually very helpful as can be seen in shell commands like find -print0 | xargs -0, since a NUL character is guarenteed to not be part of the actual arguments. You could do the same here, but I suspect the program in question does not support NUL characters in their strings for the same reason unix doesn't (because that's what C does).

tialaramex · 2024-03-06T13:56:28 1709733388

A good observation, I'd forgotten NonZeroU8 is a standard library type. I suspect (but haven't verified) Windows likewise requires [NonZeroU16]

BobbyTables2 · 2024-03-06T14:57:09 1709737029

Even Python assumes Unicode, unless one passes “bytes” strings for filenames.

I was recently astounded when small Python script I whipped up to hash and compare binary file content died with a Unicode related exception — from the filename itself!

(Walking a directory using “bytes” paths fixed it)

toast0 · 2024-03-06T04:23:36 1709699016

> In Unix they're just bytes and the bytes may not be UTF-8.

Depends on the Unix. I believe MacOS enforces unicode or at least does some form of unicode normalization.

lifthrasiir · 2024-03-05T11:46:10 1709639170

This sounds like a perfect recipe for the disaster. You have essentially made a separate character encoding that looks like but in fact is unlike UTF-8, so they have to be very strictly separated from each other. In most cases, of course, they will be inevitably mixed to each other.

gpvos · 2024-03-05T16:16:12 1709655372

This looks like a hack that will inevitably bite you in the back sometime in the future, for example if one of the involved programs starts to validate UTF-8 in the future, or your system locale changes, or something similar.

WorldMaker · 2024-03-05T19:59:06 1709668746

Or a future update starts to use those reserved high bits. Some of the current encoding space restrictions are just that Unicode didn't reserve enough high bit surrogates for UTF-16 to extend indefinitely. (UTF-8 can in theory. UTF-32 has options, including some of these still reserved in UTF-8 but currently unused high byte codepoints, UTF-16 is accidentally stuck, for now.) Sure, it is unlikely that we'll see another Unicode plane extension in our lifetimes, but many of the people that bet on UCS-2 when that looked like it covered everything and are consequentially now stuck with the somewhat broken UTF-16 thought the same thing.

amake · 2024-03-06T02:17:48 1709691468

Seems like you might as well use Private Use Area characters[0] and keep things valid UTF-8.

(Yes, you will have problems with paths that contain PUA characters. But people have pointed out that paths aren't necessarily valid UTF-8, so you can't inline-encode your way out of this anyway. PUA characters are likely vanishingly less common than spaces, so you still mostly solve the problem.)

[0] https://en.wikipedia.org/wiki/Private_Use_Areas

WorldMaker · 2024-03-05T20:05:03 1709669103

If you are going to manipulate spaces into other things in Unicode there are already so many fun tools like non-breaking spaces and half-width spaces and medium mathematical space. You could even go for weird, rare ASCII-compatible like "form feed".

https://en.wikipedia.org/wiki/Whitespace_character

Seems more fun to use something that exists, is rare, and is already weirdly space-like. (Though yes, you have to find a way to escape it if someone is crazy enough to do something like name a file with a "form feed" in the middle.)

rini17 · 2024-03-06T13:45:14 1709732714

If you insist going that way there's a perfectly cromulent "File Separator" ASCII command character. While it's still possible file names contain it on Linux, it's easier to detect and sanitize or better, reject any such input.