I wonder how this compares to open source models (which might be less accurate but even cheaper if self-hosted?), e.g. Llama 3.2. I'll see if I can run the benchmark.
Also regarding the failure case in the footnote, I think Gemini actually got that right (or at least outperformed Reducto) - the original document seems to have what I call a "3D" table where the third axis is rows within each cell, and having multiple headers is probably the best approximation in Markdown.
Everything I tried previously had very disappointing results. I was trying to get rid of Azure's DocumentIntelligence, which is kind of expensive at scale. The models could often output a portion of a table, but it was nearly impossible to get them to produce a structured output of a large table on a single page; they'd often insert "...rest of table follows" and similar terminations, regardless of different kinds of prompting.
Maybe incremental processing of chunks of the table would have worked, with subsequent stitching, but if Gemini can just process it that would be pretty good.
Also regarding the failure case in the footnote, I think Gemini actually got that right (or at least outperformed Reducto) - the original document seems to have what I call a "3D" table where the third axis is rows within each cell, and having multiple headers is probably the best approximation in Markdown.