Most of your comments are addressed by the fact that CSS2 was designed originally for documents, while the other widget systems never were and simply don't support that type of layout. Nowadays, with flexbox, CSS fully supports natural UI layouts too.
If we adopted your suggestion of using a legacy toolkit like Win32 as the basis of a new HTML/CSS and throwing out the old, it would be much worse than HTML and CSS as they exist today. Win32, Cocoa, Swing, Qt, wxWidgets, and GTK are much worse than CSS for rendering, as far as both performance and layout are concerned.
You didn't address the z-index issue either, as none of these toolkits have things like opacity, which is what causes stacking contexts. To the extent that they do, it's backed by Core Animation as in Cocoa, and that in fact does have the notion of stacking contexts, just as CSS does. The Web isn't significantly more complex in this area. (CSS 2.1 Appendix E is too complex, to be sure, but not in the area you describe.)
Yeah, I think a lot of the problems are because HTML was designed for documents. But it obviously wasn't designed well. I'm not sure how to do it better, but I know that I like using real UI toolkits, and I hate fighting with HTML.
I don't know what benchmarks say about Win32, Cocoa, or Qt rendering, but I'd be surprised if they were "much worse" performance than CSS. Layout isn't great with Win32 or Cocoa, but I love Qt's Layouts (except until you figure out how to change the default margins), and Swings seem okay from what little I've used.
Win32 has had opacity since Windows 2000, Qt has opacity since about then. Cocoa has had NSView.alphaValue since OS X 10.5.
They do have much worse performance when compared to an optimized rasterizer like WebRender (disclaimer: I work on WebRender). That's because they use legacy immediate mode APIs like GDI for painting, which are a poor match for modern GPUs. By contrast, CSS is fully declarative, which allows for much better batching, global culling, etc. Native toolkits cannot fix this without breaking backwards compatibility: their programmatic CPU drawing is fundamental to how they work. The best they can be is GPU-assisted rather than GPU-accelerated.
HTML was designed well for documents overall. It's remarkable how it's "obvious" that HTML is terrible, but every single proposed document-based replacement for it has been significantly worse. Take LaTeX, for example (sometimes brought up as a system "designed right"): you basically can't do floats at all in the system. All packages that simulate them are hacks. If the Web used a system designed like that, the volume of complaints would be deafening at this point.
I'm the first to admit that CSS has a lot of problems. But they're not the problems people always cite. The core model of downward width dependencies and upward height dependencies, with floats as a first-class citizen positioned during line breaking, is sound. Where CSS went wrong is in all the complexity like margin-collapse, border-collapse, mismatched border styles, CSS 2.1 Appendix E painting order, etc. These are the biggest flaws in the design, but nobody cites them as problems: in fact, frequently authors want more and more features that don't make sense and would make CSS worse.
I think several core HTML5 and CSS features actually come from Cocoa (via WebKit/Safari), notably Canvas, transforms and CSS animation.
The whole stacking context business basically seems to describe WebKit's rendering model (and maybe other browsers too, but did they have hardware acceleration before Safari?). I think the W3C spec was derived from the implementation, to a greater extent than the impl being driven by the spec.
Stacking contexts aren't about hardware acceleration. They're needed for things like opacity to work at all. (Think about the alpha blending functions when you have multiple objects stacked on top of one another.)
If we adopted your suggestion of using a legacy toolkit like Win32 as the basis of a new HTML/CSS and throwing out the old, it would be much worse than HTML and CSS as they exist today. Win32, Cocoa, Swing, Qt, wxWidgets, and GTK are much worse than CSS for rendering, as far as both performance and layout are concerned.
You didn't address the z-index issue either, as none of these toolkits have things like opacity, which is what causes stacking contexts. To the extent that they do, it's backed by Core Animation as in Cocoa, and that in fact does have the notion of stacking contexts, just as CSS does. The Web isn't significantly more complex in this area. (CSS 2.1 Appendix E is too complex, to be sure, but not in the area you describe.)