Hacker News new | past | comments | ask | show | jobs | submit login

I use gron a lot, because I can never remember how to use jq to do anything fancy but can usually make awk work. (I may be unusual in that department, in that I actually like awk)

One warning to note is that gron burns RAM. I've killed 32GB servers working with 15MB JSON files. (I think gron -u is even worse, but my memory is a bit fuzzy here).

https://github.com/adamritter/fastgron as an alternative has been pretty good to me in terms of performance, I think both in speed and RAM usage.




Thanks for mentioning my project (fastgron).

It reads the file into memory once, then just goes through it only once, so it shouldn't need much more memory than the file size.

Also I put a lot of work into making fastgron -u fast, but you can grep the file directly as well.


Thank you for making fastgron! I use it daily and also have it aliased to 'gron' in my shell so I don't accidentally forget to use it.


Thanks, it would be great to make it official gron 2.0, I tried to achieve full functionality parity with the original. Also a serious buffer overflow bug was just fixed, so make sure to upgrade to 0.7.

I'm thinking of doing some marketing (for example a blog entry just to show what was the main learnings in I/O and memory management in order to achieve this speed).


Maybe I'm misunderstanding, but why does it need to read the file into memory at all? Can't it just parse directly as the data streams in? It should be possible to gron-ify a JSON file that is far bigger than the memory available - the only part that needs to stay in memory is the key you are currently working on.


> One warning to note is that gron burns RAM. I've killed 32GB servers working with 15MB JSON files.

That sounds seriously like there is something wrong with the tool


I think it's because Go's encoding/json package doesn't support incremental parsing.


It does - I patched that into my local fork of `gron` two years ago.


Just done a test with my 800MB stress test file.

`jq`: 1m26s 21G resident

`mygron -e --no-sort`: 18m14s 19M resident

`gron --no-sort`: 1m51s OOM killed at 54G resident


Can you try https://github.com/adamritter/fastgron as a comparision?


`fastgron`: 8.5s 2.2G resident

edit: Interestingly whilst doing this test, I piped the output into `fastgron -u` (39.5G resident) and `jq` rejected that. Will have to investigate further but it's a bit of a flaw if it can't rehydrate its own output into valid JSON.


I released fastgron v0.7.5 which contains fixed in string escaping. Could you please take another look?


Already commented on the github issue but will have another look, yep.


Sure, I saw it, thanks!

I fixed the semicolon bug, but of course correctness is more important.


Update for anyone following - 0.7.6 recreates my 800MB input JSON correctly after trip through `fastgron | fastgron -u` which is good work.


Thanks, it's a clear bug. I created a new issue for it: https://github.com/adamritter/fastgron/issues/19


> `gron --no-sort`: 1m51s OOM killed at 54G resident

Oh dear


If I remember correctly, it took a 128GB AWS EC2 to parse that file without OOMing. Go is not that efficient at deep multi-level size- and type-unknown data structures.


Thanks for the follow up. Is your fork public?



I think https://pkg.go.dev/encoding/json#Decoder do support steaming at least. Here is gojq's stream mode https://github.com/itchyny/gojq/blob/main/cli/stream.go


Is there an issue on this?


I have a version of `gron` which uses almost no RAM to parse files (uses the streaming JSON parser rather than loading the file.) Processed a 4GB JSON file on a Pi using it (admittedly, it took forever) taking, IIRC, about 64MB RAM tops.

`gron -u` is basically impossible to optimise unless you know the input is in "sorted" order (ie the order it comes out of `gron`, including the `json.a = {};` bits) in which case my code can handle that in almost no RAM also. But if it's not sorted or you're missing the `json.a = {};` lines, there's not a lot you can do since you have to hold the whole data structure in RAM.


> you have to hold the whole data structure in RAM

Sure, but something is seriously wrong if a 15 MB JSON data structure uses more than 32 GB of RAM.


That 15MB JSON expands when piped through `gron` - my 7MB pathological test file is 143MB and 2M lines after going through `gron` (which is lines like `json[0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][1][1][0][0] = "x";`)

Which is 20 levels of unknown-sized and unknown-typed slices of slices of `any` in Go and that is not super-efficient, alas. It gets worse when you have maps of slices of maps etc. `fastgron` gets around this by being able to manage its own memory.

(`gron` can, however, reconstruct the output correctly if you shuffle the input. `fastgron` cannot. Which suggests to me it's maybe using the same 'output as we go' trick that my `gron` fork uses for its "input is sorted" mode which uses almost no RAM but cannot deal with disordered input.)

(`gron` could/should maybe indicate the maximum size of the slices and if they're a single type which would make things more efficient and I might add that to my fork.)


What could possibly be so memory-intensive about gron? I suppose it could make sense for ungronning, but not in the forward direction.


It buffers all of its output statements in memory before writing to stdout:

https://github.com/tomnomnom/gron/blob/master/main.go#L204


Here's the fastgron printer if you want to compare:

https://github.com/adamritter/fastgron/blob/main/src/print_g...


Why?


It appears to be so it can sort the lines. Not sure how useful that is however.


An ironic near-miss on the UNIX philosophy. There's a great UNIX tool that will handle sorting arbitrarily large files!


It will mess up array indices, though.


Wouldn’t “sort -n” work with indices?


It's tricky to specify the sorting criterium: you have to indicate the column. Gron's output looks like this:

a.b[0].c.d[0]: ... a.b[0].e[0].f: ...


It shouldn't need to buffer the output to do that, right?


Correct.


I remembered where some of my old files are and re-tested; forward-gron was "only" about 7GB for the 15MB file. gron -u was the real killer, clocking in around 53GB.


Yeah that's a bug, no amount of buffering can justify that amount of memory.

And gron -u in theory should use less memory than gron-ifying a JSON, as you just have to fill a data structure in memory as you go.


> as you just have to fill a data structure in memory as you go

You don't know the size, shape, or type of any of the levels in the data structure until you get to a line specifying one part of it. If you did, yep, it would be trivial!


But you do: if the first line is `users[15].name.family_name = "Foo"`, all you need to know is there: there is an array of users, each is a map containing a field called name, which is a map with a field called family_name.

If users[14] is a string, or there are 1500 users, the amount of memory usage to ungrok that line is exactly the same. Prove me wrong, I can't think of any way it would not be trivial, provided one uses the correct datastructures.


> there is an array of users

How big is it? All you know at this point is that it's at least 16 entries long. If the next line starts with `users[150]`, now it's 151 entries. Next line might make it 2000 entries long. You have no idea until you see the line.

> each is a map

But a map of what? `string -> object`? Ok, the next line is `users[15].flange[15]` which means your map is now `string -> (object|array)`.

Then the next line is `users[15].age = 15` and you've got `string -> (object|array|int)`. Each line can change what you've got and in Go this isn't a trivial thing to handle without resorting to `interface{}` (or `any`) all over the show and reflection to handle the management of the data structures.

> Prove me wrong, I can't think of any way it would not be trivial, provided one uses the correct datastructures.

All I can suggest is that you try to build `ungron` in Go and have it correctly handle disordered input. If you find a better way of doing it, I'd be happy to hear about it because I spent several months fighting Go in 2021-22 trying to optimise this without success.


What happens when you ignore software engineering.


Deeply nested structures would get expanded a lot




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: