If you wanted to prevent undefined behavior at compile time by having the compiler throw an exception, you'd actually need to prevent potentially undefined behavior, for the simple reason that compile time doesn't know what your inputs at runtime are. You could come up with a subset of C, but it would be a restrictive and probably useless subset. C strings go out the window, for instance, because you don't know at compile time if a given char * is indeed a NUL-terminated string. You also wouldn't be able to store that pointer anywhere, because it could be freed as soon as you return from the function - and that's assuming we're disallowing threads and signals in our C subset, or else you just couldn't use the pointer at all. (This, by the way, argues that it's a flaw in the language definition, not in the implementation of anything.)
There are a huge number of languages where you pass strings along with their length. There are also plenty of languages where you know (either via compile-time analysis and an insistence on writing code in an analyzable way, as in Rust, or via mild runtime overhead, as in Swift or Haskell or Java) that the memory remains still valid when you reference it. But those languages are not a subset of C. They either need more information about a program than a C program contains, or they disallow operations that the C abstract machine can perform.
You're right that you can't define a safe subset of C without making it impractical. MISRA C defines a C subset intended to help avoid C's footguns, but it still isn't actually a safe language. There are alternative approaches though:
1. Compile a safe language to C (whether a new language or an existing one)
2. Formal analysis of C, or of some practical subset of C, to prove the absence of undefined behaviour
Work has been done on both approaches.
ZZ compiles to C. [0] Dafny can compile to C++, but it seems that's not its primary target. [1][2]
There are several projects on formal analysis of C. [3][4][5][6]
Is it better to write ZZ or Dafny code than to write Java, Python, Haskell, Rust, or Swift code?
(They look cool and I'm enjoying reading about them, but for the things I do for my day job where C is one of the obvious options, I'm not sure they're better options, but usually one of the languages above is a better option.)
ZZ's whole goal is to transpile to C, which (as their website mentions) has some practical advantages. Consider exotic hardware platforms where there's a vendor-provided proprietary C compiler and no other choice.
Dafny has a slightly different emphasis, as it's about formal verification, so it's more competing with Ada SPARK than with mainstream languages.
I can't comment on the general ergonomics on ZZ and Dafny as I've not tried them out, but presumably they're going to be well behind a language like Java with its huge ecosystem of IDEs etc. Neither seems particularly mature, either, so I wouldn't bet the farm on them.
> Is it better to write ZZ or Dafny code than to write Java, Python, Haskell, Rust, or Swift code?
If you've got the choice and you're just trying to do 'normal day job programming' I imagine that yes you'd be much better off just using Java. If you want a language that's quite like C or C++ but much better regarding safety, there are Rust, Zig, and the too-often-overlooked Ada language.
I think you’re conflating undefined behavior with implementation-defined behavior.
How integers act at the extremes of their values, for example, is implememtation-defined — when you’re using the integer types (rather than using IEEE754 floats for everything to get well-defined formal semantics), you’re implicitly telling the compiler that you don’t care what happens at the boundaries, and that you want whichever behavior the target uarch’s ISA entails.
Think of it like exposing an opaque native FFI type in a managed-runtime language. The semantics of that type aren’t “up to” the runtime; all the runtime does is ensure that only a safe set of interaction primitives on the type are exposed into the managed runtime. (“Safe” in the sense of not corrupting the runtime’s own abstract machine semantics; not in the sense of making any guarantees about the operations leaving the value in a well-defined state per the underlying machine.) in C, integers — and the operations on them — are exactly such “opaque foreign types.” As are pointers, for that matter. (And this is also why casting between integers and pointers is UB — since the internal representations of each aren’t part of the semantic model of the C abstract machine, it cannot make any guarantee about their interconvertability.)
C compilers shouldn’t abort on implementation-defined behavior. Having that behavior be implementation-defined is the well-defined semantics that the C abstract machine applies to such things.
It’s only cases where there’s no clear semantics even per implementation, that the C compiler should swerve and say “I don’t know what that should mean; so I give up.”
I agree that integer overflow is implementation-defined, but I wasn't talking about integer overflow, I was talking about char *.
Reading from a pointer past the bounds of the object being pointed to is undefined, not implementation-defined. How do you propose to write a size_t strlen(const char *s) that successfully compiles with your proposed compiler?
There are a huge number of languages where you pass strings along with their length. There are also plenty of languages where you know (either via compile-time analysis and an insistence on writing code in an analyzable way, as in Rust, or via mild runtime overhead, as in Swift or Haskell or Java) that the memory remains still valid when you reference it. But those languages are not a subset of C. They either need more information about a program than a C program contains, or they disallow operations that the C abstract machine can perform.