Translating a C++ parser to Haskell

Veedrac · on June 11, 2017

The C++ code here is exactly the kind of thing that game developers complain about when they say C++ encourages inefficiency. Calling Haskell fast because it merely draws with that doesn't really make sense in this context. Yes, it's as fast as C++ code that doesn't care in the slightest about efficiency, but people use C++ for the times when they do.

E: As some quick examples, peeking an istream in a loop is extremely inefficient[1], yet is done pervasively. The Derivation type has a bunch of separately allocated strings, and a lot of types like std::map<string, DerivationOutput> are pointer filled. To check whether a string starts with "r:", they allocate a _new string_ and compare it, rather than just doing a normal string comparison.

[1] https://godbolt.org/g/27o9od

kazinator · on June 11, 2017

> but people use C++ for the times when they do.

Nope; people use C++ for all the times ... when what they know best is C++. Which is all the time for a lot of C++ developers, I suspect.

Then, for a whole bunch of other times not covered by that case, they choose C++ because of tooling and surrounding code into which that code must integrate. If you need a parser, and it has to integrate into a C++ program, you probably choose C++ for that reason, and then write the parser in whatever way you are able while trying to keep it understandable and maintainable, which may not be the most efficient approach. You do that even if you can write the parser in six other languages.

The idea that C++ is always selected for speed and used accordingly rings fallacious to me. It's not even the way it should be; choosing the ideal language (and ideal way of using that language) for every little subtask in a large project is going to be harmfully counterproductive. Unless the language count can be kept to no more than two or three.

Veedrac · on June 11, 2017

I think this is more a matter of me expressing myself poorly than a real disagreement; we seem to be saying the same thing.

What I meant is more that the initial choice of C++ tends to come from a requirement somewhere in the program for the low-level control that C++ needs, rather than the whole thing, and making an argument that Haskell is performant enough to replace C++ only makes sense if it's either fast enough for that case too, or you're willing to have a heterogenous codebase and you don't really care about Haskell's performance anyway.

kazinator · on June 11, 2017

> that the initial choice of C++ tends to come from a requirement somewhere in the program for the low-level control that C++ needs

Even that doesn't always come from such a requirement. Maybe it's just, "the common denominator of expertise on this team is C++ so we will use that".

That original requirement for low-level control is now 5% of the code base, 95% of which is still C++ because of the effect that it's easier to just add more C++ than to balkanize the project with new languages. The parser adds 1%, and scans a config file only about every 100 days days when the C++ server is restarted; why would it have to be fast ...

Veedrac · on June 11, 2017

I agree with what you're saying; I wasn't trying to speak in absolutes.

Gabriel439 · on June 11, 2017

What is the reason that `peek` is inefficient? Is it just because it is a function call that is not inlined or is there another reason?

Veedrac · on June 11, 2017

Really it's the other way around: peek is complex so the compiler didn't inline it. In essence, you do a virtual call which I think winds you up in

https://github.com/gcc-mirror/gcc/blob/1cb6c2eb3b8361d850be8...

which itself redirects to a whole bunch of stuff.

The virtual call alone is a lot more expensive than a simple pointer dereference, and though it's possible proper inlining might have alleviated some of this problem, you're not particularly likely to see it come close to something that's efficient by design.

pjmlp · on June 12, 2017

What does the profiler say regarding overall program execution time?

Veedrac · on June 12, 2017

No idea, I haven't timed it.

pjmlp · on June 12, 2017

So not relevant.

tjalfi · on June 11, 2017

iostream generally has poor performance[0]. TR 18015[1] (Technical Report on C++ Performance) lists some of the implementation challenges.

[0] https://stackoverflow.com/questions/4340396/does-the-c-stand...

[1] http://www.open-std.org/jtc1/sc22/wg21/docs/TR18015.pdf

pjmlp · on June 12, 2017

Game developers always need to be dragged by platform owners to adopt modern concepts and programming languages.

I am old enough to have heard the same complaint, but with Assembly, C and Turbo Pascals as protagonists.

One should program in idiomatic productive code and only if the profiler shows there is a bottleneck, that is actually relevant for the use case at hand, one should bother to spend time optimizing.

To add my voice to the other comments, there are many reasons we get to use C++, not only for bleeding edge performance.

Portable code across mobile OSes, access to the JVM and CLR monitoring APIs, UWP APIs not exposed to .NET, medical devices DLL and COM libraries are just a few examples that come to my mind.

Veedrac · on June 12, 2017

I'm not sure why I'm struggling to express myself here. I'm not saying Nix was wrong not to focus on optimising this; it's not likely to be their bottleneck. I'm saying that the article's argument doesn't hold.

Consider a case where somebody is using C++ for mobile OS portability or to hook into another large project, rather than the performance it offers. In that case, the argument that the parser would be equally fast in Haskell is irrelevant because that was never the use-case.

Consider a case where somebody is using C++ for performance. In that case, the argument that the parser would be equally fast in Haskell is irrelevant because that's not the part that needs to be fast.

The only time "X in Haskell is equally fast to X in C++" is an argument that you care about is when you care about the performance of X, and in that case the code wouldn't look like this.

pjmlp · on June 11, 2017

Nice article, just one small pedantic remark.

> Note that Haskell type synonyms reverse the order of the types compared to C++.

This is true in the context of the article when compared with the presented typedef definitions, however the modern way to do type synonyms in C++ is via using declarations, which are very similar to the Haskell ones.

    using Path = string;
    using PathSet = set<Path>;

_hmpc · on June 11, 2017

"Using" also allows you to have templated aliases, unlike typedefs. There's really no reason to use typedefs.

jchw · on June 11, 2017

Well, of course, there is ONE reason: Having to use old compilers or old standards for other reasons.

I imagine a lot of people are in that place with C++. Personally I've followed C++ for a long time as a hobby and it seems like it's finally getting to where it wants to be. But I'll talk to someone who works writing C++ and the idea of being able to use C++17 is a distant dream.

For Open Source developers and hobbyists though, you might as well take advantage of whatever features Clang+GCC+MSVC offer. I wish someone would standardize `#pragma once` so I don't have to feel guilty using it.

_hmpc · on June 11, 2017

Yeah so the only reason you wouldn't use 'using' is when you can't :)

Modern c++ is awesome. I definitely appreciate being able use c++14 where I work (as well as 'pragma once'). I really look forward to being able to use c++17.

pjmlp · on June 12, 2017

> I wish someone would standardize `#pragma once` so I don't have to feel guilty using it.

It is called C++ Modules Technical Specification. :)

Peaker · on June 11, 2017

Attoparsec is fast, but used to have bad error messages.

How do the 2 parsers compare w.r.t error messaging?

Haskell has Parsec/Megaparsec which have better error messages, but are extremely slow.

Gabriel439 · on June 11, 2017

The trick is to use the `parsers` library, which lets you switch out parsing backends. You can prototype with the `trifecta` library (which has good error messages) and then switch to `attoparsec` when you're done

For this specific post, the main thing I do is ignore the error messages and just look at where parsing fails by retrieving the leftovers when it fails (using the lower-level `parse` function). Usually that gave me enough of a clue to figure out what was going wrong.

kccqzy · on June 11, 2017

I don't think Parsec is "extremely slow." It's not as fast as attoparsec, but not bad either.

Trifecta has the best error messages, but takes some more work to fully make the error messages great.

Peaker · on June 11, 2017

In my experience using Parsec, it was extremely slow.

Took around 20 seconds to parse a few hundred thousand lines (almost no backtracking, too).

Attoparsec/parsec have very similar APIs. I wonder why it cannot parse optimistically with attoparsec, and in case of failure - re-parse with parsec to get a nice error message. This should yield the best of both worlds?

mrkgnao · on June 11, 2017

Megaparsec is more actively maintained than Parsec (which it's a fork of). You could try that. It's not as fast as attoparsec, but it's close.

mrkgnao · on June 11, 2017

Trifecta has poor documentation. I've been thinking about reading the Idris parser to learn trifecta/parsers.

axman6 · on June 11, 2017

Trifecta is best used via the parsers library, which is fairly well documented. This also (in theory) allows you to switch which parser you use between trifecta, parsec, attoparsec and even ReadP. I say in theory because the the semantics of these libraries differ.

quchen · on June 11, 2017

I recommend the much smaller project by Merijn, lambda-except [1]. I learned Trifecta from it, and it’s really not all that difficult, just embarassingly poorly documented. I then wrote the parser for my STGi project based on Trifecta, so that may also be worth a look.

[1]: https://github.com/merijn/lambda-except [2]: https://github.com/quchen/stgi/blob/master/src/Stg/Parser/Pa...

logicchains · on June 11, 2017

Note to the author, I found this post extremely hard to read on mobile. I had to scroll horizontally as the text wouldn't all fit on screen, and zooming seemed broken. When attempting to zoom or horizontally scroll it would sometimes randomly click some invisible popout that would take me to another page.

mrkgnao · on June 11, 2017

Blogger is actively hostile to phone browsers. For horizontal scrolling, you can try dragging at an angle (to fool the page-change JS) and then vertically scrolling to negate the vertical component of the first motion. It becomes pretty natural after a while :/

Gabriel439 · on June 11, 2017

Author here: yes, this bothers me, too (especially the horizontal scroll switching to another post). I'm not sure how to disable that since Blogger doesn't provide a way to disable this.

However, I may be able to fix the zooming. This was triggered by the switch to syntax highlighting using Pandoc which wraps each code block in a scrollable frame if it exceeds the given size. I can probably fix that with appropriate CSS although it will take me some time to figure it out.

tekacs · on June 11, 2017

It would be good to see a fix from this from upstream (but the poster notes that this isn't necessarily simple) of course, but...

To propose another hack, I've found that 'Request Desktop Site' in your mobile browser's menu works well for articles like these (it just reverts to desktop layout).