MLton Compiler Overview

Athas · on April 2, 2018

This is a very nice wiki. I am implementing an ML dialect myself, and studying the implementation of MLkit, MosML, and Mlton has been very enlightening. I work with the maintainers of the former two, but my compiler has ended up being more like Mlton in its design.

rwmj · on April 2, 2018

So am I! What parser are you using? I'm using a recursive descent (ie hand written) parser and it has some troubles parsing ML completely.

user2994cb · on April 2, 2018

Just out of interest (I worked on a table-driven parser for SML a couple of decades ago) - what bits of ML syntax is giving you trouble? I remember one bit that (theoretically) needed unbounded lookahead, and user-defined infix operators were fun.

rwmj · on April 3, 2018

There's a difficult problem with precedence and unary operators. For example:

    set_field !list_end 1 cons

cannot be parsed. Anyway since then I rewrote the lexer to work like the sibling comment and I'm still getting that working.

Athas · on April 2, 2018

I am using Happy, which is a yacc-alike for Haskell. I found it mostly straightforward (however, I use the OCaml/Haskell/F# style of lexically distinguishing infix operators from variable names).

mattnewton · on April 2, 2018

Oh wow, I had Professor Fluet in college, and taught me functional programming concepts in haskell and lisp. Cool to see his work on the front page of Hn.

In addition to being a smart guy he was a really good professor. I really enjoyed the class (survey of many different paradigms), even if I did come away jaded and thinking that all the standard languages I have used in my career since are poor approximations of what they could be!

bjg · on April 2, 2018

+1 Professor Fluet was one of my favorite professors, it was his first year at RIT, and he did an awesome job.

rwmj · on April 2, 2018

With work-in-progress support for RISC-V: https://github.com/agoode/mlton/commits/riscv-wip

hajile · on April 2, 2018

Now if we could only get better threading support.

nickpsecurity · on April 2, 2018

The highest-performing compiler for ML's is MLton. The highest-correctness compiler is CakeML. Anyone looking for an interesting project in compilers and verification should consider porting some MLton techniques to CakeML.

https://cakeml.org

tytytytytytytyt · on April 2, 2018

In what ways is it more correct?

steinuil · on April 2, 2018

It uses formal methods to verify that the generated machine code has the same behavior as the source program.

tytytytytytytyt · on April 2, 2018

That's not at all helpful - I meant specific examples...

chrisseaton · on April 2, 2018

They mean 'correct' in the technical PL sense.

So they don't mean there are specific things that are incorrect in MLton - there may not be any examples anyone can give you - they mean that we don't know mathematically how correct MLton is or not, because nobody has done that work, and we do know to a better extent mathematically that CakeML is correct, because it has been mathematically proven to be correct to a certain degree.

nickpsecurity · on April 2, 2018

steinuil gave the right answer. Ill add those methods create precise, mathematical descriptions of what the compiler is supposed to do, what the executable does, proof of equivalence, and in a prover thats implementation is itself much smaller than most compilers. So, it should have way fewer errors than compilers verified with ad hoc methods.

tytytytytytytyt · on April 2, 2018

Ok, but can you give any specific examples of something it got right that another didn't?

nickpsecurity · on April 3, 2018

Do note I'm not being adversarial when I say I don't have to give examples since it uses a known-good technique. The status quo for compilers is they have plenty of bugs. The number of bugs go up like most software with the complexity (eg lines of code or paths). Highly-optimizing compilers should be expected to have more bugs than barely optimizing ones. Evidence on application of formal methods in industry showed the programs had way fewer bugs. Cleanroom and Praxis Correct by Construction w/ SPARK Ada regularly hit some of lowest defect rates in existence.

http://infohost.nmt.edu/~al/cseet-paper.html

http://www.anthonyhall.org/c_by_c_secure_system.pdf

So, we have one trend saying complicated software will have lots of errors with another saying verified software will have fewer errors. The hypothesis we should have at that point is a verified compiler will have either few errors in general (optimism) or much fewer than non-formally-verified compilers (pessimism). The empirical evidence confirms this with bug trackers on most compilers but especially the Csmith testing pitting CompCert against other C compilers:

http://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf

https://embed.cs.utah.edu/csmith/

It was the first time (IIRC) a program was custom designed to toss computer-years worth of input against compilers to find problems. Found hundreds in common compilers despite them being analyzed and used so much that many bugs were already found. When they applied it to CompCert, they got this result:

"The striking thing about our CompCert results is that the middleend bugs we found in all other compilers are absent. As of early 2011, the under-development version of CompCert is the only compiler we have tested for which Csmith cannot find wrong-code errors. This is not for lack of trying: we have devoted about six CPU-years to the task. The apparent unbreakability of CompCert supports a strong argument that developing compiler optimizations within a proof framework, where safety checks are explicit and machine-checked, has tangible benefits for compiler users."

It did as expected where the verified parts had no errors in their code. CakeML uses an even stronger form of verification if I understand their paper correctly. It's also on an easier-to-analyze language since ML was straight-up designed for theorem provers. With decades of evidence, I'm confident in saying the use of formal methods by experts on a compiler for a language designed for easy verification will make that compiler have fewer errors than unverified competition. It will also have feature or performance limitations, too, from their picking simplicity over complexity. Still useful in many cases where ML would already be used, though, which aren't usually performance-focused.

tytytytytytytyt · on April 3, 2018

Thanks!

sevensor · on April 2, 2018

I don't know compilers, but this seems like a lot more steps than I would have expected. Is it actually an unusually large number of transformations, or is that just how compilers are done?

undecidabot · on April 2, 2018

Compilers come in all shapes and sizes.

Some are single pass like Turbo Pascal. They directly generate machine code while parsing (no AST!). Niklaus Wirth (Pascal/Modula/Oberon guy) wrote a book following this approach [1].

Some have multiple passes like Chez Scheme. They have many simple passes, which they call a "nanopass". Andrew Keep has a great talk on this approach [2].

In practice, most compilers today are multi-pass, though probably not as many Chez. If we look at Rust, they go from AST -> HIR -> MIR -> LLVM IR -> machine code [3]. There are probably more things going on from LLVM to machine code, but I'm not knowledgeable enough to comment on it.

I think the trade-off here is clear: less passes -> shorter compile times, more passes -> faster code generated, more modular compiler. Martin Odersky (Scala guy) has a paper attempting to get the best of both [4].

[1] http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf

[2] https://www.youtube.com/watch?v=Os7FE3J-U5Q

[3] https://blog.rust-lang.org/2016/04/19/MIR.html

[4] https://infoscience.epfl.ch/record/228518/files/paper.pdf

le-mark · on April 2, 2018

The unmentioned history here is memory use. In ye olden days, machines had much less memory and required more passes, simply because there was no space for all the data. So one might see a pass for preprocessing, lexing, ast generation, optimizations, code generation, etc. Note some of this exist today, ie gcc integrates with gas and flags to dump the assembler.

pjmlp · on April 2, 2018

Yep, implementing a compiler with systems having 512 KB maximum on average did not leave too much space for clever optimizations.

Using compilers on 8 bit systems was even worse (max 64KB).

Many game studios used UNIX/VMS systems, with cross compilers to upload data into ZX and C64 computers as development cycle.

sevensor · on April 2, 2018

Thank you for these links! The Scala paper especially makes it a lot easier to understand the tradeoffs in compiler design.

agumonkey · on April 3, 2018

A. Keep talk is very very nice. Watch it pepple

strangemonad · on April 2, 2018

I see lots of comments about how the multiple passes have to do with limited resources like memory. In MLton’s case, (it’s a whole program optimizing compiler) each trasnlation phase has nothing to do with resource contention. And in general, limited resources is typically handled by separate compilation, not compiler passes.

The more important design choices here have to do with 1) Typed assembly languages (rather than untyped assymbly languages) 2) Type directed translation (rather than syntax directed translation) 3) The ability to reason about the correctness of each step of the translation as the program is gradually lowered from SML to assembly.

For more on the typed assembly language work, check out the work of Morissett and Harper that came out of CMU (specifically the TIL / TAL parts of the ConCert project). The XML type system was developed a little bit before that by Harper and Lillibridge. The general idea of a language being a mathematical object with an elaboration and a semantics where you can reason about progress and preservation at each step is a result of the vision of Robin Milner (there’s a good overview here http://homepages.inf.ed.ac.uk/stg/Milner2012/R_Harper-html4-...)

chubot · on April 2, 2018

It depends on the language: how far away is it from machine code? (among other things)

C compilers are very close to machine code, so it takes few passes to write a simple non-optimizing C compiler. I'm pretty sure that after parsing, the original C compilers were one pass. We talked about them a bit here [1]

ML is further from the machine model, so compiling it takes more work. For example, it has closures, while C doesn't, and that's something extra that you have to deal with, probably in multiple stages of the compiler.

A common theme in functional languages like ML and Lisp is "desugaring passes" or "lowering passes", which basically means turning a language construct into a more basic one, so that you can treat things more uniformly at later stages. The more language features you have, the more potential for desugaring/lowering. OCaml and Haskell in particular have a ton of features that can be treated like this.

Slide 14 here has a (partial) diagram of the Scala compiler, which is extremely deep: https://www.slideshare.net/Odersky/compilers-are-databases

[1] https://news.ycombinator.com/item?id=16610938

bjoli · on April 2, 2018

There are many different approaches. Chezscheme does a gazillion steps using the nanopass framework, but is still fast enough to compile a metric shit-tonne of code without any noticeable delay.

sevensor · on April 2, 2018

Interesting -- I suppose it's analogous to CPU pipelining, where having a larger number of simple stages can outperform having a small number of complicated ones.

Athas · on April 2, 2018

Optimising compilers and compilers for high-level languages tend to have many passes. Mlton is both.

nikofeyn · on April 2, 2018

i have seen some work regarding a real-time version of mlton. i would love to see this work done more because nothing would please me more to be able to do embedded systems with an ML language.