I use gron a lot, because I can never remember how to use jq to do anything fancy but can usually make awk work. (I may be unusual in that department, in that I actually like awk)
One warning to note is that gron burns RAM. I've killed 32GB servers working with 15MB JSON files. (I think gron -u is even worse, but my memory is a bit fuzzy here).
Thanks, it would be great to make it official gron 2.0, I tried to achieve full functionality parity with the original. Also a serious buffer overflow bug was just fixed, so make sure to upgrade to 0.7.
I'm thinking of doing some marketing (for example a blog entry just to show what was the main learnings in I/O and memory management in order to achieve this speed).
Maybe I'm misunderstanding, but why does it need to read the file into memory at all? Can't it just parse directly as the data streams in? It should be possible to gron-ify a JSON file that is far bigger than the memory available - the only part that needs to stay in memory is the key you are currently working on.
edit: Interestingly whilst doing this test, I piped the output into `fastgron -u` (39.5G resident) and `jq` rejected that. Will have to investigate further but it's a bit of a flaw if it can't rehydrate its own output into valid JSON.
If I remember correctly, it took a 128GB AWS EC2 to parse that file without OOMing. Go is not that efficient at deep multi-level size- and type-unknown data structures.
I have a version of `gron` which uses almost no RAM to parse files (uses the streaming JSON parser rather than loading the file.) Processed a 4GB JSON file on a Pi using it (admittedly, it took forever) taking, IIRC, about 64MB RAM tops.
`gron -u` is basically impossible to optimise unless you know the input is in "sorted" order (ie the order it comes out of `gron`, including the `json.a = {};` bits) in which case my code can handle that in almost no RAM also. But if it's not sorted or you're missing the `json.a = {};` lines, there's not a lot you can do since you have to hold the whole data structure in RAM.
That 15MB JSON expands when piped through `gron` - my 7MB pathological test file is 143MB and 2M lines after going through `gron` (which is lines like `json[0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][1][1][0][0] = "x";`)
Which is 20 levels of unknown-sized and unknown-typed slices of slices of `any` in Go and that is not super-efficient, alas. It gets worse when you have maps of slices of maps etc. `fastgron` gets around this by being able to manage its own memory.
(`gron` can, however, reconstruct the output correctly if you shuffle the input. `fastgron` cannot. Which suggests to me it's maybe using the same 'output as we go' trick that my `gron` fork uses for its "input is sorted" mode which uses almost no RAM but cannot deal with disordered input.)
(`gron` could/should maybe indicate the maximum size of the slices and if they're a single type which would make things more efficient and I might add that to my fork.)
I remembered where some of my old files are and re-tested; forward-gron was "only" about 7GB for the 15MB file. gron -u was the real killer, clocking in around 53GB.
> as you just have to fill a data structure in memory as you go
You don't know the size, shape, or type of any of the levels in the data structure until you get to a line specifying one part of it. If you did, yep, it would be trivial!
But you do: if the first line is `users[15].name.family_name = "Foo"`, all you need to know is there: there is an array of users, each is a map containing a field called name, which is a map with a field called family_name.
If users[14] is a string, or there are 1500 users, the amount of memory usage to ungrok that line is exactly the same. Prove me wrong, I can't think of any way it would not be trivial, provided one uses the correct datastructures.
How big is it? All you know at this point is that it's at least 16 entries long. If the next line starts with `users[150]`, now it's 151 entries. Next line might make it 2000 entries long. You have no idea until you see the line.
> each is a map
But a map of what? `string -> object`? Ok, the next line is `users[15].flange[15]` which means your map is now `string -> (object|array)`.
Then the next line is `users[15].age = 15` and you've got `string -> (object|array|int)`. Each line can change what you've got and in Go this isn't a trivial thing to handle without resorting to `interface{}` (or `any`) all over the show and reflection to handle the management of the data structures.
> Prove me wrong, I can't think of any way it would not be trivial, provided one uses the correct datastructures.
All I can suggest is that you try to build `ungron` in Go and have it correctly handle disordered input. If you find a better way of doing it, I'd be happy to hear about it because I spent several months fighting Go in 2021-22 trying to optimise this without success.
One warning to note is that gron burns RAM. I've killed 32GB servers working with 15MB JSON files. (I think gron -u is even worse, but my memory is a bit fuzzy here).
https://github.com/adamritter/fastgron as an alternative has been pretty good to me in terms of performance, I think both in speed and RAM usage.