I don't know compilers, but this seems like a lot more steps than I would have e...

undecidabot · on April 2, 2018

Compilers come in all shapes and sizes.

Some are single pass like Turbo Pascal. They directly generate machine code while parsing (no AST!). Niklaus Wirth (Pascal/Modula/Oberon guy) wrote a book following this approach [1].

Some have multiple passes like Chez Scheme. They have many simple passes, which they call a "nanopass". Andrew Keep has a great talk on this approach [2].

In practice, most compilers today are multi-pass, though probably not as many Chez. If we look at Rust, they go from AST -> HIR -> MIR -> LLVM IR -> machine code [3]. There are probably more things going on from LLVM to machine code, but I'm not knowledgeable enough to comment on it.

I think the trade-off here is clear: less passes -> shorter compile times, more passes -> faster code generated, more modular compiler. Martin Odersky (Scala guy) has a paper attempting to get the best of both [4].

[1] http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf

[2] https://www.youtube.com/watch?v=Os7FE3J-U5Q

[3] https://blog.rust-lang.org/2016/04/19/MIR.html

[4] https://infoscience.epfl.ch/record/228518/files/paper.pdf

le-mark · on April 2, 2018

The unmentioned history here is memory use. In ye olden days, machines had much less memory and required more passes, simply because there was no space for all the data. So one might see a pass for preprocessing, lexing, ast generation, optimizations, code generation, etc. Note some of this exist today, ie gcc integrates with gas and flags to dump the assembler.

pjmlp · on April 2, 2018

Yep, implementing a compiler with systems having 512 KB maximum on average did not leave too much space for clever optimizations.

Using compilers on 8 bit systems was even worse (max 64KB).

Many game studios used UNIX/VMS systems, with cross compilers to upload data into ZX and C64 computers as development cycle.

sevensor · on April 2, 2018

Thank you for these links! The Scala paper especially makes it a lot easier to understand the tradeoffs in compiler design.

agumonkey · on April 3, 2018

A. Keep talk is very very nice. Watch it pepple

strangemonad · on April 2, 2018

I see lots of comments about how the multiple passes have to do with limited resources like memory. In MLton’s case, (it’s a whole program optimizing compiler) each trasnlation phase has nothing to do with resource contention. And in general, limited resources is typically handled by separate compilation, not compiler passes.

The more important design choices here have to do with 1) Typed assembly languages (rather than untyped assymbly languages) 2) Type directed translation (rather than syntax directed translation) 3) The ability to reason about the correctness of each step of the translation as the program is gradually lowered from SML to assembly.

For more on the typed assembly language work, check out the work of Morissett and Harper that came out of CMU (specifically the TIL / TAL parts of the ConCert project). The XML type system was developed a little bit before that by Harper and Lillibridge. The general idea of a language being a mathematical object with an elaboration and a semantics where you can reason about progress and preservation at each step is a result of the vision of Robin Milner (there’s a good overview here http://homepages.inf.ed.ac.uk/stg/Milner2012/R_Harper-html4-...)

chubot · on April 2, 2018

It depends on the language: how far away is it from machine code? (among other things)

C compilers are very close to machine code, so it takes few passes to write a simple non-optimizing C compiler. I'm pretty sure that after parsing, the original C compilers were one pass. We talked about them a bit here [1]

ML is further from the machine model, so compiling it takes more work. For example, it has closures, while C doesn't, and that's something extra that you have to deal with, probably in multiple stages of the compiler.

A common theme in functional languages like ML and Lisp is "desugaring passes" or "lowering passes", which basically means turning a language construct into a more basic one, so that you can treat things more uniformly at later stages. The more language features you have, the more potential for desugaring/lowering. OCaml and Haskell in particular have a ton of features that can be treated like this.

Slide 14 here has a (partial) diagram of the Scala compiler, which is extremely deep: https://www.slideshare.net/Odersky/compilers-are-databases

[1] https://news.ycombinator.com/item?id=16610938

bjoli · on April 2, 2018

There are many different approaches. Chezscheme does a gazillion steps using the nanopass framework, but is still fast enough to compile a metric shit-tonne of code without any noticeable delay.

sevensor · on April 2, 2018

Interesting -- I suppose it's analogous to CPU pipelining, where having a larger number of simple stages can outperform having a small number of complicated ones.

Athas · on April 2, 2018

Optimising compilers and compilers for high-level languages tend to have many passes. Mlton is both.