Hacker News new | past | comments | ask | show | jobs | submit login
MLton Compiler Overview (mlton.org)
114 points by undecidabot on April 2, 2018 | hide | past | favorite | 30 comments



This is a very nice wiki. I am implementing an ML dialect myself, and studying the implementation of MLkit, MosML, and Mlton has been very enlightening. I work with the maintainers of the former two, but my compiler has ended up being more like Mlton in its design.


So am I! What parser are you using? I'm using a recursive descent (ie hand written) parser and it has some troubles parsing ML completely.


Just out of interest (I worked on a table-driven parser for SML a couple of decades ago) - what bits of ML syntax is giving you trouble? I remember one bit that (theoretically) needed unbounded lookahead, and user-defined infix operators were fun.


There's a difficult problem with precedence and unary operators. For example:

    set_field !list_end 1 cons
cannot be parsed. Anyway since then I rewrote the lexer to work like the sibling comment and I'm still getting that working.


I am using Happy, which is a yacc-alike for Haskell. I found it mostly straightforward (however, I use the OCaml/Haskell/F# style of lexically distinguishing infix operators from variable names).


Oh wow, I had Professor Fluet in college, and taught me functional programming concepts in haskell and lisp. Cool to see his work on the front page of Hn.

In addition to being a smart guy he was a really good professor. I really enjoyed the class (survey of many different paradigms), even if I did come away jaded and thinking that all the standard languages I have used in my career since are poor approximations of what they could be!


+1 Professor Fluet was one of my favorite professors, it was his first year at RIT, and he did an awesome job.


With work-in-progress support for RISC-V: https://github.com/agoode/mlton/commits/riscv-wip


Now if we could only get better threading support.


The highest-performing compiler for ML's is MLton. The highest-correctness compiler is CakeML. Anyone looking for an interesting project in compilers and verification should consider porting some MLton techniques to CakeML.

https://cakeml.org


In what ways is it more correct?


It uses formal methods to verify that the generated machine code has the same behavior as the source program.


That's not at all helpful - I meant specific examples...


They mean 'correct' in the technical PL sense.

So they don't mean there are specific things that are incorrect in MLton - there may not be any examples anyone can give you - they mean that we don't know mathematically how correct MLton is or not, because nobody has done that work, and we do know to a better extent mathematically that CakeML is correct, because it has been mathematically proven to be correct to a certain degree.


steinuil gave the right answer. Ill add those methods create precise, mathematical descriptions of what the compiler is supposed to do, what the executable does, proof of equivalence, and in a prover thats implementation is itself much smaller than most compilers. So, it should have way fewer errors than compilers verified with ad hoc methods.


Ok, but can you give any specific examples of something it got right that another didn't?


Do note I'm not being adversarial when I say I don't have to give examples since it uses a known-good technique. The status quo for compilers is they have plenty of bugs. The number of bugs go up like most software with the complexity (eg lines of code or paths). Highly-optimizing compilers should be expected to have more bugs than barely optimizing ones. Evidence on application of formal methods in industry showed the programs had way fewer bugs. Cleanroom and Praxis Correct by Construction w/ SPARK Ada regularly hit some of lowest defect rates in existence.

http://infohost.nmt.edu/~al/cseet-paper.html

http://www.anthonyhall.org/c_by_c_secure_system.pdf

So, we have one trend saying complicated software will have lots of errors with another saying verified software will have fewer errors. The hypothesis we should have at that point is a verified compiler will have either few errors in general (optimism) or much fewer than non-formally-verified compilers (pessimism). The empirical evidence confirms this with bug trackers on most compilers but especially the Csmith testing pitting CompCert against other C compilers:

http://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf

https://embed.cs.utah.edu/csmith/

It was the first time (IIRC) a program was custom designed to toss computer-years worth of input against compilers to find problems. Found hundreds in common compilers despite them being analyzed and used so much that many bugs were already found. When they applied it to CompCert, they got this result:

"The striking thing about our CompCert results is that the middleend bugs we found in all other compilers are absent. As of early 2011, the under-development version of CompCert is the only compiler we have tested for which Csmith cannot find wrong-code errors. This is not for lack of trying: we have devoted about six CPU-years to the task. The apparent unbreakability of CompCert supports a strong argument that developing compiler optimizations within a proof framework, where safety checks are explicit and machine-checked, has tangible benefits for compiler users."

It did as expected where the verified parts had no errors in their code. CakeML uses an even stronger form of verification if I understand their paper correctly. It's also on an easier-to-analyze language since ML was straight-up designed for theorem provers. With decades of evidence, I'm confident in saying the use of formal methods by experts on a compiler for a language designed for easy verification will make that compiler have fewer errors than unverified competition. It will also have feature or performance limitations, too, from their picking simplicity over complexity. Still useful in many cases where ML would already be used, though, which aren't usually performance-focused.


Thanks!


I don't know compilers, but this seems like a lot more steps than I would have expected. Is it actually an unusually large number of transformations, or is that just how compilers are done?


Compilers come in all shapes and sizes.

Some are single pass like Turbo Pascal. They directly generate machine code while parsing (no AST!). Niklaus Wirth (Pascal/Modula/Oberon guy) wrote a book following this approach [1].

Some have multiple passes like Chez Scheme. They have many simple passes, which they call a "nanopass". Andrew Keep has a great talk on this approach [2].

In practice, most compilers today are multi-pass, though probably not as many Chez. If we look at Rust, they go from AST -> HIR -> MIR -> LLVM IR -> machine code [3]. There are probably more things going on from LLVM to machine code, but I'm not knowledgeable enough to comment on it.

I think the trade-off here is clear: less passes -> shorter compile times, more passes -> faster code generated, more modular compiler. Martin Odersky (Scala guy) has a paper attempting to get the best of both [4].

[1] http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf

[2] https://www.youtube.com/watch?v=Os7FE3J-U5Q

[3] https://blog.rust-lang.org/2016/04/19/MIR.html

[4] https://infoscience.epfl.ch/record/228518/files/paper.pdf


The unmentioned history here is memory use. In ye olden days, machines had much less memory and required more passes, simply because there was no space for all the data. So one might see a pass for preprocessing, lexing, ast generation, optimizations, code generation, etc. Note some of this exist today, ie gcc integrates with gas and flags to dump the assembler.


Yep, implementing a compiler with systems having 512 KB maximum on average did not leave too much space for clever optimizations.

Using compilers on 8 bit systems was even worse (max 64KB).

Many game studios used UNIX/VMS systems, with cross compilers to upload data into ZX and C64 computers as development cycle.


Thank you for these links! The Scala paper especially makes it a lot easier to understand the tradeoffs in compiler design.


A. Keep talk is very very nice. Watch it pepple


I see lots of comments about how the multiple passes have to do with limited resources like memory. In MLton’s case, (it’s a whole program optimizing compiler) each trasnlation phase has nothing to do with resource contention. And in general, limited resources is typically handled by separate compilation, not compiler passes.

The more important design choices here have to do with 1) Typed assembly languages (rather than untyped assymbly languages) 2) Type directed translation (rather than syntax directed translation) 3) The ability to reason about the correctness of each step of the translation as the program is gradually lowered from SML to assembly.

For more on the typed assembly language work, check out the work of Morissett and Harper that came out of CMU (specifically the TIL / TAL parts of the ConCert project). The XML type system was developed a little bit before that by Harper and Lillibridge. The general idea of a language being a mathematical object with an elaboration and a semantics where you can reason about progress and preservation at each step is a result of the vision of Robin Milner (there’s a good overview here http://homepages.inf.ed.ac.uk/stg/Milner2012/R_Harper-html4-...)


It depends on the language: how far away is it from machine code? (among other things)

C compilers are very close to machine code, so it takes few passes to write a simple non-optimizing C compiler. I'm pretty sure that after parsing, the original C compilers were one pass. We talked about them a bit here [1]

ML is further from the machine model, so compiling it takes more work. For example, it has closures, while C doesn't, and that's something extra that you have to deal with, probably in multiple stages of the compiler.

A common theme in functional languages like ML and Lisp is "desugaring passes" or "lowering passes", which basically means turning a language construct into a more basic one, so that you can treat things more uniformly at later stages. The more language features you have, the more potential for desugaring/lowering. OCaml and Haskell in particular have a ton of features that can be treated like this.

Slide 14 here has a (partial) diagram of the Scala compiler, which is extremely deep: https://www.slideshare.net/Odersky/compilers-are-databases

[1] https://news.ycombinator.com/item?id=16610938


There are many different approaches. Chezscheme does a gazillion steps using the nanopass framework, but is still fast enough to compile a metric shit-tonne of code without any noticeable delay.


Interesting -- I suppose it's analogous to CPU pipelining, where having a larger number of simple stages can outperform having a small number of complicated ones.


Optimising compilers and compilers for high-level languages tend to have many passes. Mlton is both.


i have seen some work regarding a real-time version of mlton. i would love to see this work done more because nothing would please me more to be able to do embedded systems with an ML language.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: