So they don't mean there are specific things that are incorrect in MLton - there may not be any examples anyone can give you - they mean that we don't know mathematically how correct MLton is or not, because nobody has done that work, and we do know to a better extent mathematically that CakeML is correct, because it has been mathematically proven to be correct to a certain degree.
steinuil gave the right answer. Ill add those methods create precise, mathematical descriptions of what the compiler is supposed to do, what the executable does, proof of equivalence, and in a prover thats implementation is itself much smaller than most compilers. So, it should have way fewer errors than compilers verified with ad hoc methods.
Do note I'm not being adversarial when I say I don't have to give examples since it uses a known-good technique. The status quo for compilers is they have plenty of bugs. The number of bugs go up like most software with the complexity (eg lines of code or paths). Highly-optimizing compilers should be expected to have more bugs than barely optimizing ones. Evidence on application of formal methods in industry showed the programs had way fewer bugs. Cleanroom and Praxis Correct by Construction w/ SPARK Ada regularly hit some of lowest defect rates in existence.
So, we have one trend saying complicated software will have lots of errors with another saying verified software will have fewer errors. The hypothesis we should have at that point is a verified compiler will have either few errors in general (optimism) or much fewer than non-formally-verified compilers (pessimism). The empirical evidence confirms this with bug trackers on most compilers but especially the Csmith testing pitting CompCert against other C compilers:
It was the first time (IIRC) a program was custom designed to toss computer-years worth of input against compilers to find problems. Found hundreds in common compilers despite them being analyzed and used so much that many bugs were already found. When they applied it to CompCert, they got this result:
"The striking thing about our CompCert results is that the middleend bugs we found in all other compilers are absent. As of early 2011, the under-development version of CompCert is the only compiler we have tested for which Csmith cannot find wrong-code errors. This is not for lack of trying: we have devoted about six CPU-years to the task. The apparent unbreakability of CompCert supports a strong argument that developing compiler optimizations within a proof framework, where safety checks are explicit and machine-checked, has tangible benefits for compiler users."
It did as expected where the verified parts had no errors in their code. CakeML uses an even stronger form of verification if I understand their paper correctly. It's also on an easier-to-analyze language since ML was straight-up designed for theorem provers. With decades of evidence, I'm confident in saying the use of formal methods by experts on a compiler for a language designed for easy verification will make that compiler have fewer errors than unverified competition. It will also have feature or performance limitations, too, from their picking simplicity over complexity. Still useful in many cases where ML would already be used, though, which aren't usually performance-focused.