Hacker News new | past | comments | ask | show | jobs | submit login
Gron: Make JSON greppable (github.com/tomnomnom)
249 points by mooreds on Dec 3, 2023 | hide | past | favorite | 73 comments



I use gron a lot, because I can never remember how to use jq to do anything fancy but can usually make awk work. (I may be unusual in that department, in that I actually like awk)

One warning to note is that gron burns RAM. I've killed 32GB servers working with 15MB JSON files. (I think gron -u is even worse, but my memory is a bit fuzzy here).

https://github.com/adamritter/fastgron as an alternative has been pretty good to me in terms of performance, I think both in speed and RAM usage.


Thanks for mentioning my project (fastgron).

It reads the file into memory once, then just goes through it only once, so it shouldn't need much more memory than the file size.

Also I put a lot of work into making fastgron -u fast, but you can grep the file directly as well.


Thank you for making fastgron! I use it daily and also have it aliased to 'gron' in my shell so I don't accidentally forget to use it.


Thanks, it would be great to make it official gron 2.0, I tried to achieve full functionality parity with the original. Also a serious buffer overflow bug was just fixed, so make sure to upgrade to 0.7.

I'm thinking of doing some marketing (for example a blog entry just to show what was the main learnings in I/O and memory management in order to achieve this speed).


Maybe I'm misunderstanding, but why does it need to read the file into memory at all? Can't it just parse directly as the data streams in? It should be possible to gron-ify a JSON file that is far bigger than the memory available - the only part that needs to stay in memory is the key you are currently working on.


> One warning to note is that gron burns RAM. I've killed 32GB servers working with 15MB JSON files.

That sounds seriously like there is something wrong with the tool


I think it's because Go's encoding/json package doesn't support incremental parsing.


It does - I patched that into my local fork of `gron` two years ago.


Just done a test with my 800MB stress test file.

`jq`: 1m26s 21G resident

`mygron -e --no-sort`: 18m14s 19M resident

`gron --no-sort`: 1m51s OOM killed at 54G resident


Can you try https://github.com/adamritter/fastgron as a comparision?


`fastgron`: 8.5s 2.2G resident

edit: Interestingly whilst doing this test, I piped the output into `fastgron -u` (39.5G resident) and `jq` rejected that. Will have to investigate further but it's a bit of a flaw if it can't rehydrate its own output into valid JSON.


I released fastgron v0.7.5 which contains fixed in string escaping. Could you please take another look?


Already commented on the github issue but will have another look, yep.


Sure, I saw it, thanks!

I fixed the semicolon bug, but of course correctness is more important.


Update for anyone following - 0.7.6 recreates my 800MB input JSON correctly after trip through `fastgron | fastgron -u` which is good work.


Thanks, it's a clear bug. I created a new issue for it: https://github.com/adamritter/fastgron/issues/19


> `gron --no-sort`: 1m51s OOM killed at 54G resident

Oh dear


If I remember correctly, it took a 128GB AWS EC2 to parse that file without OOMing. Go is not that efficient at deep multi-level size- and type-unknown data structures.


Thanks for the follow up. Is your fork public?



I think https://pkg.go.dev/encoding/json#Decoder do support steaming at least. Here is gojq's stream mode https://github.com/itchyny/gojq/blob/main/cli/stream.go


Is there an issue on this?


I have a version of `gron` which uses almost no RAM to parse files (uses the streaming JSON parser rather than loading the file.) Processed a 4GB JSON file on a Pi using it (admittedly, it took forever) taking, IIRC, about 64MB RAM tops.

`gron -u` is basically impossible to optimise unless you know the input is in "sorted" order (ie the order it comes out of `gron`, including the `json.a = {};` bits) in which case my code can handle that in almost no RAM also. But if it's not sorted or you're missing the `json.a = {};` lines, there's not a lot you can do since you have to hold the whole data structure in RAM.


> you have to hold the whole data structure in RAM

Sure, but something is seriously wrong if a 15 MB JSON data structure uses more than 32 GB of RAM.


That 15MB JSON expands when piped through `gron` - my 7MB pathological test file is 143MB and 2M lines after going through `gron` (which is lines like `json[0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][1][1][0][0] = "x";`)

Which is 20 levels of unknown-sized and unknown-typed slices of slices of `any` in Go and that is not super-efficient, alas. It gets worse when you have maps of slices of maps etc. `fastgron` gets around this by being able to manage its own memory.

(`gron` can, however, reconstruct the output correctly if you shuffle the input. `fastgron` cannot. Which suggests to me it's maybe using the same 'output as we go' trick that my `gron` fork uses for its "input is sorted" mode which uses almost no RAM but cannot deal with disordered input.)

(`gron` could/should maybe indicate the maximum size of the slices and if they're a single type which would make things more efficient and I might add that to my fork.)


What could possibly be so memory-intensive about gron? I suppose it could make sense for ungronning, but not in the forward direction.


It buffers all of its output statements in memory before writing to stdout:

https://github.com/tomnomnom/gron/blob/master/main.go#L204


Here's the fastgron printer if you want to compare:

https://github.com/adamritter/fastgron/blob/main/src/print_g...


Why?


It appears to be so it can sort the lines. Not sure how useful that is however.


An ironic near-miss on the UNIX philosophy. There's a great UNIX tool that will handle sorting arbitrarily large files!


It will mess up array indices, though.


Wouldn’t “sort -n” work with indices?


It's tricky to specify the sorting criterium: you have to indicate the column. Gron's output looks like this:

a.b[0].c.d[0]: ... a.b[0].e[0].f: ...


It shouldn't need to buffer the output to do that, right?


Correct.


I remembered where some of my old files are and re-tested; forward-gron was "only" about 7GB for the 15MB file. gron -u was the real killer, clocking in around 53GB.


Yeah that's a bug, no amount of buffering can justify that amount of memory.

And gron -u in theory should use less memory than gron-ifying a JSON, as you just have to fill a data structure in memory as you go.


> as you just have to fill a data structure in memory as you go

You don't know the size, shape, or type of any of the levels in the data structure until you get to a line specifying one part of it. If you did, yep, it would be trivial!


But you do: if the first line is `users[15].name.family_name = "Foo"`, all you need to know is there: there is an array of users, each is a map containing a field called name, which is a map with a field called family_name.

If users[14] is a string, or there are 1500 users, the amount of memory usage to ungrok that line is exactly the same. Prove me wrong, I can't think of any way it would not be trivial, provided one uses the correct datastructures.


> there is an array of users

How big is it? All you know at this point is that it's at least 16 entries long. If the next line starts with `users[150]`, now it's 151 entries. Next line might make it 2000 entries long. You have no idea until you see the line.

> each is a map

But a map of what? `string -> object`? Ok, the next line is `users[15].flange[15]` which means your map is now `string -> (object|array)`.

Then the next line is `users[15].age = 15` and you've got `string -> (object|array|int)`. Each line can change what you've got and in Go this isn't a trivial thing to handle without resorting to `interface{}` (or `any`) all over the show and reflection to handle the management of the data structures.

> Prove me wrong, I can't think of any way it would not be trivial, provided one uses the correct datastructures.

All I can suggest is that you try to build `ungron` in Go and have it correctly handle disordered input. If you find a better way of doing it, I'd be happy to hear about it because I spent several months fighting Go in 2021-22 trying to optimise this without success.


What happens when you ignore software engineering.


Deeply nested structures would get expanded a lot


As someone who has been using jq for years, my first instinct was why not jq? and while it answers it in the README as well, it is not very clear until you compare the output to jq.

   $ gron "https://api.github.com/repos/tomnomnom/gron/commits?per_page=1" | fgrep "commit.author"
   json[0].commit.author = {};
   json[0].commit.author.date = "2016-07-02T10:51:21Z";
   json[0].commit.author.email = "mail@tomnomnom.com";
   json[0].commit.author.name = "Tom Hudson";
And with jq:

    $ curl "https://api.github.com/repos/tomnomnom/gron/commits?per_page=1" | jq ".[].commit.author" 
    {
      "name": "Tom Hudson",
      "email": "mail@tomnomnom.com",
      "date": "2022-04-13T14:23:37Z"
    }
The jq version isn't greppable, as you can't do `| grep '.author.email'` for example.


Theoretically jq can be coerced into printing similar output. Realistically though if you've already written that jq query and you wanted the email field you'd just append .email.

https://unix.stackexchange.com/questions/561460/how-to-print...


`gron` is super useful when you don't directly know the structure of some JSON - gives you a nice simple path to locate things you can then construct a `jq` query to deal with (since e.g. dealing with multiple items in the same list can be a faff with `gron`.)


Another option: jless.

You can then copy the path and use in jq with "yq".


I think of gron and jq as siblings: gron + grep for discovery, then you use gron's output (for the desired line) as jq's input, which shows you what you were after.

With jq alone, you have to already understand the structure, which isn't always a given if you're combing through k8s manifests for instance.


Flattening facilitates arbitrary ad hoc transformation.


> The jq version isn't greppable, as you can't do `| grep '.author.email'` for example.

Truth to be told with jq you don't need to grap it, you can grab just emails directly. I find gron a lot more useful for grep -v, that is for filtering out the parts that you don't need. Super easy to clean up data.


I use this in my `~/.jq` when I have a problem like this.

    def flat_json_keys:
        [leaf_paths as $path | {"key": $path | map(if (type=="string") then (if (test("([?:\\W]+)")) then "['"+.+"']"  else . end) else "["+tostring+"]" end) | join(".") | gsub(".\\[";"[") , "value": getpath($path)}] | from_entries;

    def ukeys:
        keys_unsorted;
Use like so:

    cat wat.json | jq flat_json_keys


The idea is pretty brilliant, but it is not a new one. I remember a similar program about 20 years ago when XML was all the rage. I don't recall the name but something like "py*" I think.

Edit: I found it. It was called pyx. https://www.xml.com/pub/2000/03/15/feature/index.html https://xmlstar.sourceforge.net/doc/UG/ch04s07.html


Thumbs up for gron. Been using it for a couple of years to get the jsonpath to a property I need. It's super handy with kubectl and other ctls of the sort.


Hmmm… at first glance, this feels like I’d use it for the same sorts of things I’d use jq for, only easier to use but also way less powerful. Jq does have a little bit of a learning curve necessary to get good use out of it, so I could see this being a nice quick tool for people who don’t want to make that investment. Having already learned jq, I’m not sure why I would reach for gron, but maybe I’m missing something.


Not missing, retaining something: the details of jq. Many developers find this difficult for usual and rarely used tools. See also bash.


My pure-AWK variant: https://github.com/xonixx/gron.awk (features complete JSON parser in AWK).


Has this been tested against Mawk? I often see performance benefits with Mawk, when the Awk script does not need any Gawk specific functions.

I'm looking at the test suite, trying to figure out how to get it to emit test failure details:

  ./makesure test_suite
  mawk -f run_json_test_suit.awk
  ...
  Successes: 186
  Fails:     152


It has: https://github.com/xonixx/gron.awk/blob/e5040a7b1384c5839dca...

Strange, should be much less failures: https://github.com/xonixx/gron.awk/blob/e5040a7b1384c5839dca...

Could you please file a bug with the details?


Looks like an interesting tool. I use jq for any JSON related task, but it can often be finicky and complex when I just need to get at a value or search for something.

Looks like gron would be a nice addition to my workflow with JSON tasks.


A plug for one of my own projects, something gron-like (but more general) for Python:

https://pypi.org/project/jsonmason/

"maybe doing something depending on context for some nodes in a nested structure by using transforms or side-effects while iterating over the regularized representation of the structure's nodes"

It's a library but also includes CLI utilities that do the same as gron. Well, I hope without that memory ballooning problem described in sibling comments.


This is awesome!

I am going to try to use it to make surgical edits to the terraform state file, in rare cases when I have to.

Some terraform providers would rather delete and recreate resource, while a simple edit would do the trick for me


Related:

Gron – Make JSON Greppable - https://news.ycombinator.com/item?id=25006277 - Nov 2020 (91 comments)

Gron: A command line tool that makes JSON greppable - https://news.ycombinator.com/item?id=16727665 - April 2018 (51 comments)


can see this being useful for pentesting, jq is fine but always have to look at documentation and relearn it every few weeks/months


Why shouldn't I just use jq?

jq is awesome, and a lot more powerful than gron, but with that power comes complexity. gron aims to make it easier to use the tools you already know, like grep and sed.

I know grep, but sed is one of those I always have to look up whenever I have to escape a weird character or something.


I wonder if you could accomplish the same with structural search tools like Comby (https://comby.dev/) - with the bonus that you can target specific levels of beating since it's syntax-aware


gron is one of my favorite tools because I don't do this type of searching as much anymore and can't remember the options for more advanced tools (jq, etc). I can easily and confidently compose from gron -> grep though.


This is brilliant, I can't wait to have a json problem to try it out.


Great idea, definitely makes sense when you have that kind of problem.

Also the username of the author made me chuckle, bonus points for that.


I like how the description is just 3 words and clearly communicates what this tool is about.


Could a jq have a more C-like syntax?

JSON is already C-like https://www.json.org/json-en.html and jq uses dot-separated paths for chaining name/value accesses.


So what's the issue with jq?


You can't grep a deep keypath using jq.


shows how much I have used jq. Traversing a JSON object seems a nice to have though, but also seems CPU intensive to some degree.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: