I use gron a lot, because I can never remember how to use jq to do anything fanc...

xiphias2 · on Dec 4, 2023

Thanks for mentioning my project (fastgron).

It reads the file into memory once, then just goes through it only once, so it shouldn't need much more memory than the file size.

Also I put a lot of work into making fastgron -u fast, but you can grep the file directly as well.

luckman212 · on Dec 5, 2023

Thank you for making fastgron! I use it daily and also have it aliased to 'gron' in my shell so I don't accidentally forget to use it.

xiphias2 · on Dec 5, 2023

Thanks, it would be great to make it official gron 2.0, I tried to achieve full functionality parity with the original. Also a serious buffer overflow bug was just fixed, so make sure to upgrade to 0.7.

I'm thinking of doing some marketing (for example a blog entry just to show what was the main learnings in I/O and memory management in order to achieve this speed).

rjmunro · on Dec 6, 2023

Maybe I'm misunderstanding, but why does it need to read the file into memory at all? Can't it just parse directly as the data streams in? It should be possible to gron-ify a JSON file that is far bigger than the memory available - the only part that needs to stay in memory is the key you are currently working on.

Alifatisk · on Dec 4, 2023

> One warning to note is that gron burns RAM. I've killed 32GB servers working with 15MB JSON files.

That sounds seriously like there is something wrong with the tool

icholy · on Dec 4, 2023

I think it's because Go's encoding/json package doesn't support incremental parsing.

zimpenfish · on Dec 4, 2023

It does - I patched that into my local fork of `gron` two years ago.

zimpenfish · on Dec 4, 2023

Just done a test with my 800MB stress test file.

`jq`: 1m26s 21G resident

`mygron -e --no-sort`: 18m14s 19M resident

`gron --no-sort`: 1m51s OOM killed at 54G resident

xiphias2 · on Dec 5, 2023

Can you try https://github.com/adamritter/fastgron as a comparision?

zimpenfish · on Dec 5, 2023

`fastgron`: 8.5s 2.2G resident

edit: Interestingly whilst doing this test, I piped the output into `fastgron -u` (39.5G resident) and `jq` rejected that. Will have to investigate further but it's a bit of a flaw if it can't rehydrate its own output into valid JSON.

xiphias2 · on Dec 5, 2023

I released fastgron v0.7.5 which contains fixed in string escaping. Could you please take another look?

zimpenfish · on Dec 6, 2023

Already commented on the github issue but will have another look, yep.

xiphias2 · on Dec 6, 2023

Sure, I saw it, thanks!

I fixed the semicolon bug, but of course correctness is more important.

zimpenfish · on Dec 6, 2023

Update for anyone following - 0.7.6 recreates my 800MB input JSON correctly after trip through `fastgron | fastgron -u` which is good work.

xiphias2 · on Dec 5, 2023

Thanks, it's a clear bug. I created a new issue for it: https://github.com/adamritter/fastgron/issues/19

Alifatisk · on Dec 4, 2023

> `gron --no-sort`: 1m51s OOM killed at 54G resident

Oh dear

zimpenfish · on Dec 4, 2023

If I remember correctly, it took a 128GB AWS EC2 to parse that file without OOMing. Go is not that efficient at deep multi-level size- and type-unknown data structures.

icholy · on Dec 4, 2023

Thanks for the follow up. Is your fork public?

zimpenfish · on Dec 4, 2023

There's a branch here - https://github.com/rjp/gron/tree/f/reduce-memory

wwader · on Dec 4, 2023

I think https://pkg.go.dev/encoding/json#Decoder do support steaming at least. Here is gojq's stream mode https://github.com/itchyny/gojq/blob/main/cli/stream.go

Alifatisk · on Dec 4, 2023

Is there an issue on this?

zimpenfish · on Dec 4, 2023

I have a version of `gron` which uses almost no RAM to parse files (uses the streaming JSON parser rather than loading the file.) Processed a 4GB JSON file on a Pi using it (admittedly, it took forever) taking, IIRC, about 64MB RAM tops.

`gron -u` is basically impossible to optimise unless you know the input is in "sorted" order (ie the order it comes out of `gron`, including the `json.a = {};` bits) in which case my code can handle that in almost no RAM also. But if it's not sorted or you're missing the `json.a = {};` lines, there's not a lot you can do since you have to hold the whole data structure in RAM.

sph · on Dec 4, 2023

> you have to hold the whole data structure in RAM

Sure, but something is seriously wrong if a 15 MB JSON data structure uses more than 32 GB of RAM.

zimpenfish · on Dec 4, 2023

That 15MB JSON expands when piped through `gron` - my 7MB pathological test file is 143MB and 2M lines after going through `gron` (which is lines like `json[0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][1][1][0][0] = "x";`)

Which is 20 levels of unknown-sized and unknown-typed slices of slices of `any` in Go and that is not super-efficient, alas. It gets worse when you have maps of slices of maps etc. `fastgron` gets around this by being able to manage its own memory.

(`gron` can, however, reconstruct the output correctly if you shuffle the input. `fastgron` cannot. Which suggests to me it's maybe using the same 'output as we go' trick that my `gron` fork uses for its "input is sorted" mode which uses almost no RAM but cannot deal with disordered input.)

(`gron` could/should maybe indicate the maximum size of the slices and if they're a single type which would make things more efficient and I might add that to my fork.)

treyd · on Dec 3, 2023

What could possibly be so memory-intensive about gron? I suppose it could make sense for ungronning, but not in the forward direction.

haileys · on Dec 3, 2023

It buffers all of its output statements in memory before writing to stdout:

https://github.com/tomnomnom/gron/blob/master/main.go#L204

xiphias2 · on Dec 4, 2023

Here's the fastgron printer if you want to compare:

https://github.com/adamritter/fastgron/blob/main/src/print_g...

codetrotter · on Dec 3, 2023

mongol · on Dec 3, 2023

It appears to be so it can sort the lines. Not sure how useful that is however.

CGamesPlay · on Dec 4, 2023

An ironic near-miss on the UNIX philosophy. There's a great UNIX tool that will handle sorting arbitrarily large files!

tgv · on Dec 4, 2023

It will mess up array indices, though.

codetrotter · on Dec 4, 2023

Wouldn’t “sort -n” work with indices?

tgv · on Dec 4, 2023

It's tricky to specify the sorting criterium: you have to indicate the column. Gron's output looks like this:

a.b[0].c.d[0]: ... a.b[0].e[0].f: ...

loeg · on Dec 4, 2023

It shouldn't need to buffer the output to do that, right?

swsieber · on Dec 4, 2023

Correct.

salted-fry · on Dec 3, 2023

I remembered where some of my old files are and re-tested; forward-gron was "only" about 7GB for the 15MB file. gron -u was the real killer, clocking in around 53GB.

sph · on Dec 4, 2023

Yeah that's a bug, no amount of buffering can justify that amount of memory.

And gron -u in theory should use less memory than gron-ifying a JSON, as you just have to fill a data structure in memory as you go.

zimpenfish · on Dec 4, 2023

> as you just have to fill a data structure in memory as you go

You don't know the size, shape, or type of any of the levels in the data structure until you get to a line specifying one part of it. If you did, yep, it would be trivial!

sph · on Dec 5, 2023

But you do: if the first line is `users[15].name.family_name = "Foo"`, all you need to know is there: there is an array of users, each is a map containing a field called name, which is a map with a field called family_name.

If users[14] is a string, or there are 1500 users, the amount of memory usage to ungrok that line is exactly the same. Prove me wrong, I can't think of any way it would not be trivial, provided one uses the correct datastructures.

zimpenfish · on Dec 5, 2023

> there is an array of users

How big is it? All you know at this point is that it's at least 16 entries long. If the next line starts with `users[150]`, now it's 151 entries. Next line might make it 2000 entries long. You have no idea until you see the line.

> each is a map

But a map of what? `string -> object`? Ok, the next line is `users[15].flange[15]` which means your map is now `string -> (object|array)`.

Then the next line is `users[15].age = 15` and you've got `string -> (object|array|int)`. Each line can change what you've got and in Go this isn't a trivial thing to handle without resorting to `interface{}` (or `any`) all over the show and reflection to handle the management of the data structures.

> Prove me wrong, I can't think of any way it would not be trivial, provided one uses the correct datastructures.

All I can suggest is that you try to build `ungron` in Go and have it correctly handle disordered input. If you find a better way of doing it, I'd be happy to hear about it because I spent several months fighting Go in 2021-22 trying to optimise this without success.

1letterunixname · on Dec 4, 2023

What happens when you ignore software engineering.

nojs · on Dec 3, 2023

Deeply nested structures would get expanded a lot