Ingesting PDFs and why Gemini 2.0 changes everything

lazypenguin · 2025-02-05T19:19:29 1738783169

I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large context window model in terms of ease-of-use. Ironically this vendor is the best known and most successful vendor for OCR'ing this specific type of PDF but many of our requests failed over to their human-in-the-loop process. Despite it not being their specialization switching to Gemini was a no-brainer after our testing. Processing time went from something like 12 minutes on average to 6s on average, accuracy was like 96% of that of the vendor and price was significantly cheaper. For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair". We probably could improve our prompt to clean up this data even further. Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.

Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!

kbyatnal · 2025-02-06T00:46:48 1738802808

This is spot on, any legacy vendor focusing on a specific type of PDF is going to get obliterated by LLMs. The problem with using an off-the-shelf provider like this is, you get stuck with their data schema. With an LLM, you have full control over the schema meaning you can parse and extract much more unique data.

The problem then shifts from "can we extract this data from the PDF" to "how do we teach an LLM to extract the data we need, validate its performance, and deploy it with confidence into prod?"

You could improve your accuracy further by adding some chain-of-thought to your prompt btw. e.g. Make each field in your json schema have a `reasoning` field beforehand so the model can CoT how it got to its answer. If you want to take it to the next level, `citations` in our experience also improves performance (and when combined with bounding boxes, is powerful for human-in-the-loop tooling).

Disclaimer: I started an LLM doc processing infra company (https://extend.app/)

TeMPOraL · 2025-02-06T09:40:17 1738834817

> The problem then shifts from "can we extract this data from the PDF" to "how do we teach an LLM to extract the data we need, validate its performance, and deploy it with confidence into prod?"

A smart vendor will shift into that space - they'll use that LLM themselves, and figure out some combination of finetunes, multiple LLMs, classical methods and human verification of random samples, that lets them not only "validate its performance, and deploy it with confidence into prod", but also sell that confidence with an SLA on top of it.

wraptile · 2025-02-06T18:59:32 1738868372

That's what we did with our web scraping saas - with Extraction API¹ we shifted web scraped data parsing to support both predefined models for common objects like products, reviews etc. and direct LLM prompts that we further optimize for flexible extraction.

There's definitely space here to help the customer realize their extraction vision because it's still hard to scale this effectively on your own!

1 - https://scrapfly.io/extraction-api

sitkack · 2025-02-06T13:38:23 1738849103

Software is dead, if it isn't a prompt now, it will be a prompt in 6 months.

Most of what we think software is today, will just be a UI. But UIs are also dead.

SketchySeaBeast · 2025-02-06T15:16:39 1738854999

I wonder about these takes. Have you never worked in a complex system in a large org before?

OK, sure, we can parse a PDF reliably now, but now we need to act on that data. We need to store it, make sure it ends up with the right people who need to be notified that the data is available for their review. They then need to make decisions upon that data, possible requiring input from multiple stakeholders.

All that back and forth needs to be recorded and stored, along with the eventual decision and the all supporting documents and that whole bundle needs to be made available across multiple systems, which requires a bunch of ETLs and governance.

An LLM with a prompt doesn't replace all that.

sitkack · 2025-02-06T20:38:16 1738874296

We need to think terms of light cones, not dog and pony take downs of whatever system you are currently running. See where thigns are going.

I have worked in large systems, both in code and people, compilers, massive data processing systems, 10k business units.

collingreen · 2025-02-06T20:49:26 1738874966

I don't know what light cones or dog and pony mean here but I'm interested in your take - would you care to expand a bit on how the future can reshape that very complicated set of steps and humans described in the parent?

SketchySeaBeast · 2025-02-07T00:34:15 1738888455

I think collingreen followed-up better than I ever could, so I'm hoping you can respond to them with more details.

victorbjorklund · 2025-02-06T16:46:51 1738860411

Can you prompt a salesforce replacement for an org with 100 000 employees?

mrbungie · 2025-02-06T20:33:42 1738874022

Yesterday I read an /r/singularity post in awe cus of a screenshot of a lead management platform from OAI in a japan convention supposedly meant a direct threat to SalesForce. Like, yeah sure buddy.

I would say most acceleracionist/AI bulls/etc don't really understand the true essential complexity in software development. LLMs are being seen as a software development silver bullets, and we know what happens with silver bullets.

sitkack · 2025-02-06T20:35:55 1738874155

Come back your comment in 18 months.

collingreen · 2025-02-06T20:51:55 1738875115

I assume this is a slap intended to imply that ai actually IS a silver bullet answer to the parent's described problem and in just 18 months they will look back and realize how wrong they are.

Is that what you mean and, if so, is there anything in particular you've seen that leads you to see these problems being solved well or on the 18 month timeline? That sounds interesting to look at to me and I'd love to know more.

sitkack · 2025-02-06T23:41:05 1738885265

It isn't a silver bullet in that it can just "make software" but it is changing the entire dynamic.

You can't do point sampling to figure out where things are going. We have to look at the slope. People see a paper come out, look at the results and say, "this fails for x, y and z. doesn't work", that is now how scientific research works. This is why two minute papers has the tag line, "hold on to your papers ... two papers down the line ..."

Copy and paste the whole thread into a SOTA model and have meta me explain it.

ethbr1 · 2025-02-08T04:23:25 1738988605

That's not why more experienced people are doubting you.

They're doubting you because the non-digital portions of processes change at people/org speed.

Which is to say that changing a core business process is a year political consensus, rearchitecture, and change management effort, because you also have to coordinate all the cascading and interfacing changes.

sitkack · 2025-02-08T19:38:00 1739043480

> changing a core business process is a year political consensus, rearchitecture, and change management effort

You are thinking within the existing structures, those structures will evaporate. All along the software supply chain, processes will get upended, not just because of how technical assets will be created, but also how organizations themselves are structured and react and in turn how software is created and consumed.

This is as big as the invention of the corporation, the printing press and the industrial revolution.

I am not here to tutor people on this viewpoint or defend it, I offer it and everyone can do with it what they will.

ethbr1 · 2025-02-08T20:51:59 1739047919

Ha. Look back on this comment in a few years.

cpursley · 2025-02-06T14:02:05 1738850525

Software without data moats, vender lock-in, etc sure will. All the low handing fruit saas is going to get totally obliterated by LLM built-software.

fragmede · 2025-02-06T18:42:16 1738867336

If I'm an autobody shop or some other well-served niche, how unhappy with them do I have to be to decide to find a replacement, either a competitor of theirs that used an LLM, or bring it in house and go off and find a developer to LLM-acceleratedly make me a better shopmonkey? And there are the integrations. I don't own a low hanging fruit SaaS company, but it seems very sticky, and since the established company already exists, they can just lower prices to meet their competitors.

B2B is different from B2C, so if one vendor has a handful of clients and they won't switch away, there's no obliterating happening.

What's opened up is even lower hanging fruit, on more trees. A SaaS company charging $3/month for the left-handed underwater basket weaver niche now becomes viable as a lifestyle business. The shovels in this could be supabase/similar, since clients can keep access to their data there even if they change frontends.

sitkack · 2025-02-06T20:35:08 1738874108

Which means that the current vc-software-ecosystem is the walking dead. The front end webdev is now going to do things that previously took a 10 person startup.

cpursley · 2025-02-07T12:22:54 1738930974

Integrations is part of the data moat I mentioned.

Vrondi · 2025-02-08T15:19:04 1739027944

The only thing that will be different for most is vendor lock-in will be to LLM vendors.

sitkack · 2025-02-06T18:14:48 1738865688

Totally agree.

Cumpiler69 · 2025-02-06T10:37:29 1738838249

>A smart vendor will shift into that space - they'll use that LLM themselves

It's a bit late to start shifting now since it takes time. Ideally they should already have a product on the market.

TeMPOraL · 2025-02-06T11:01:25 1738839685

There's still time. The situation in which you can effectively replace your OCR vendor with hitting LLM APIs via a half-assed Python script ChatGPT wrote for you, has existed for maybe few months. People are only beginning to realize LLMs got good enough that this is an option. An OCR vendor that starts working on the shift today, should easily be able to develop, tune, test and productize an LLM-based OCR pipeline way before most of their customers realize what's been happening.

But it is a good opportunity for a fast-moving OCR service to steal some customers from their competition. If I were working in this space, I'd be worried about that, and also about the possibility some of the LLM companies realize they could actually break into this market themselves right now, and secure some additional income.

EDIT:

I get the feeling that the main LLM suppliers are purposefully sticking to general-purpose APIs and refraining from competing with anyone on specific services, and that this goes beyond just staying focused. Some of potential applications, like OCR, could turn into money printers if they moved on them now, and they all could use some more cash to offset what they burn on compute. Is it because they're trying to avoid starting an "us vs. them" war until after they made everyone else dependent on them?

anon84873628 · 2025-02-06T19:11:40 1738869100

To the point after your edit, I view it like the cloud shift from IaaS to PaaS / SaaS. Start with a neutral infrastructure platform that attracts lots of service providers. Then take your pick of which ones to replicate with a vertically integrated competitor or manager offering once you are too big for anyone to really complain.

bayindirh · 2025-02-06T12:47:50 1738846070

Never underestimate the power of the second mover. Since the development is happening in the open, someone can quickly cobble up the information and cut directly to the 90% of the work.

Then your secret sauce will be your fine tunes, etc.

Like it or not AI/LLM will be a commodity, and this bubble will burst. Moats are hard to build when you have at least one open source copy of what you just did.

SoftTalker · 2025-02-06T17:26:08 1738862768

And next year your secret sauce will be worthless because the LLMs are that much better again.

Businesses that are just "today's LLM + our bespoke improvements" won't have legs.

raghavsb · 2025-02-10T12:33:23 1739190803

Great, I landed on the reasoning and citations bit through trial and error and the outputs improved for sure.

MajorData · 2025-02-09T20:50:29 1739134229

`How did you add bounding boxes, especially if it is variety of files?

montecruiseto · 2025-02-07T11:44:33 1738928673

So why should I still use Extend instead of Gemini?

panta · 2025-02-06T17:35:55 1738863355

How do you handle the privacy of the scanned documents?

kbyatnal · 2025-02-06T17:48:00 1738864080

We work with fortune 500s in sensitive industries (healthcare, fintech, etc). Our policies are:

- data is never shared between customers

- data never gets used for training

- we also configure data retention policies to auto-purge after a time period

panta · 2025-02-06T18:41:52 1738867312

But how to get these guarantees from the upstream vendors? Or do you run the LLMs on premises?

Karrot_Kream · 2025-02-06T19:20:11 1738869611

If you're using LLM APIs there are SLAs from the vendors to make sure your inputs are not used as training data and other guarantees. Generally these endpoints cost more to use (the compliance fee essentially) but they solve the problem.

makeitdouble · 2025-02-05T23:35:29 1738798529

> After trial and error with different models

As a mere occasional customer I've been scanning 4 to 5 pages of the same document layout every week in gemini for half a year, and every single week the results were slightly different.

To note the docs are bilingual so it could affect the results, but what stroke me is the lack of consistency, and even with the same model, running it two or three times in a row gives different results.

That's fine for my usage, but that sounds like a nightmare if everytime Google tweaks their model, companies have to reajust their whole process to deal with the discrepancies.

And sticking with the same model for multiple years also sound like a captive situation where you'd have to pay premium for Google to keep it available for your use.

tomrod · 2025-02-06T00:25:03 1738801503

Consider turning down the temperature in the configuration? LLMs have a bit of randomness in them.

Gemini 2.0 Flash seems better than 1.5 - https://deepmind.google/technologies/gemini/flash/

mejutoco · 2025-02-06T08:51:06 1738831866

> and every single week the results were slightly different.

This is one of the reasons why open source offline models will always be part of the solution, if not the whole solution.

rafaelmn · 2025-02-06T09:27:04 1738834024

Inconsistency comes from scaling - if you are optimizing your infra to be cos effective you will arrive at same tradeoffs. Not saying it's not nice to be able to make some of those decisions on your own - but if you're picking LLMs for simplicity - we are years away from running your own being in the same league for most people.

mejutoco · 2025-02-06T12:53:25 1738846405

And if you are not you wont.

You can decide if you change your local setup or not. You cannot decide the same of a service.

There is nothing inevitable about inconsistency in a local setup.

iandanforth · 2025-02-06T00:32:01 1738801921

At temperature zero, if you're using the same API/model, this really should not be the case. None of the big players update their APIs without some name / version change.

pigscantfly · 2025-02-06T02:58:17 1738810697

This isn't really true unfortunately -- mixture of experts routing seems to suffer from batch non-determinism. No one has stated publicly exactly why this is, but you can easily replicate the behavior yourself or find bug reports / discussion with a bit of searching. The outcome and observed behavior of the major closed-weight LLM APIs is that a temperature of zero no longer corresponds to deterministic greedy sampling.

brookst · 2025-02-06T04:51:06 1738817466

If temperature is zero, and weights are weights, where is the non-deterministic behavior coming from?

wodenokoto · 2025-02-06T08:48:48 1738831728

Temperature changes the distribution that is sampled, not if a distribution is sampled.

Temperature changes the softmax equation[1], not weather or not you are sampling from the softmax result or choosing the highest probability. IBM's documentation corroborates this, saying you need to set do_sample to True in order for the temperature to have any effect, e.g., T changes how we sample, not if we sample [2].

A similar discussion on openai forum also claim that the RNG might be in a different state from run to run, although I am less sure about that [3]

[1] https://pelinbalci.com/2023/10/16/Temperature_parameter.html

[2] https://www.ibm.com/think/topics/llm-temperature#:~:text=The...

[3] https://community.openai.com/t/clarifications-on-setting-tem...

zelphirkalt · 2025-02-06T09:19:35 1738833575

I have dealt with traditional ML models in the past and things like tensorflow non-reproducibility. Managed to make them behave reproducibly. This is a very basic requirement. If we cannot even have that or people who deal with Gemini or similar models do not even know why they don't deliver reproducible results ... This seems very bad. It becomes outright unusable for anyone wanting to do research with reliable result. We already have a reproducibility crisis, because researchers often do not have the required knowledge to properly handle their tooling and would need a knowledgeable engineer to set it up. Only that most engineers don't know either and don't show enough attention to the detail to make reproducible software.

sidkshatriya · 2025-02-06T14:17:37 1738851457

Your response is correct. However, you can choose to not sample from the distribution. You can have a rule to always choose the token with the highest probability generated by the softmax layer.

This approach should make the LLM deterministic regardless of the temperature chosen.

P.S. Choosing lower and lower temperatures will make the LLM more deterministic but it will never be totally deterministic because there will always be some probability in other tokens. Also it is not possible to use temperature as exactly 0 due to exp(1/T) blowup. Like I mentioned above, you could avoid fiddling with temperature and just decide to always choose token with highest probability for full determinism.

There are probably other more subtle things that might make the LLM non-deterministic from run to run though. It could be due to some non-deterministism in the GPU/CPU hardware. Floating point is very sensitive to ordering.

TL;DR for as much determinism as possible just choose token with highest probability (i.e. dont sample the distribution).

TeMPOraL · 2025-02-06T07:26:09 1738826769

Here probably routing would be dominating, but in general, unless I missed all the vendors ditching GPUs and switching to ASICs optimized for fixed precision math, floating points are still non-commutative therefore results are non-deterministic wrt. randomness introduced by parallelising the calculations.

zelphirkalt · 2025-02-06T09:24:01 1738833841

Of course which part of the calculations happens where should also be specifiable and be able to be made deterministicor should not have an effect on the result. A map reduce process' reduce step, merging results from various places also should be able to be made to give reproducible results, regardless of which results arrive first or from where.

Is our tooling too bad for this?

TeMPOraL · 2025-02-06T10:45:37 1738838737

> Is our tooling too bad for this?

Floating points are fundamentally too bad for this. We use them because they're fast, which usually more than compensates for inaccuracies FP math introduces.

(One, dealing with FP errors is mostly a fixed cost - there's a branch of CS/mathematics specializing in it, producing formally proven recipes for computing specific things in way that minimize or at least give specific bounds on errors. That's work that can be done once, and reused forever. Two, most programmers are oblivious to those issues anyway, and we've learned to live with the bugs :).)

When your parallel map-reduce is just doing matrix additions and multiplications, guaranteeing order of execution comes with serious overhead. For one, you need to have all partial results available together before reducing, so either the reduction step needs to have enough memory to store a copy of all the inputs, or it needs to block the units computing those inputs until all of them finish. Meanwhile, if you drop the order guarantee, then the reduction step just needs one fixed-size accumulator, and every parallel unit computing the inputs is free to go and do something else as soon as it's done.

So the price you pay for deterministic order is either a reduction of throughput or increase in on-chip memory, both of which end up translating to slower and more expensive hardware. The incentives strongly point towards not giving such guarantees if it can be avoided - keep in mind that GPUs have been designed for videogames (and graphics in general), and for this, floating point inaccuracies only matter when they become noticeable to the user.

Dylan16807 · 2025-02-06T09:22:49 1738833769

Why would the same software on the same GPU architecture use different commutations from run to run?

Also if you're even considering fixed point math, you can use integer accumulators to add up your parallel chunks.

TeMPOraL · 2025-02-06T10:08:04 1738836484

Why would the same multithreaded software run on the same CPU (not just architecture - the same physical chip) have its instructions execute in different order from run to run? Performance. Want things deterministic? You have to explicitly keep them in sync yourself. GPUs sport tens of thousands of parallel processors these days, which are themselves complex, and are linked together with more complexity, both hardware and software. They're designed to calculate fast, not to ensure every subprocessor is always in lock step with every other one.

Model inference on GPU is mostly doing a lot of GPU equivalent of parallelized product on (X1, X2, X3, ... Xn), where each X is itself some matrix computed by a parallelized product of other matrices. Unless there's some explicit guarantee somewhere that the reduction step will pause until it gets all results so it can guarantee order, instead of reducing eagerly, each such step is a non-determinism transducer, turning undetermined execution order into floating point errors via commutation.

I'm not a GPU engineer so I don't know for sure, especially about the new cards designed for AI, but since reducing eagerly allows more memory-efficient design and improves throughput, and GPUs until recently were optimized for games (where FP accuracy doesn't matter that much), and I don't recall any vendor making determinism a marketing point recently, I don't believe GPUs suddenly started to guarantee determinism at expense of performance.

Dylan16807 · 2025-02-06T16:32:39 1738859559

Each thread on a CPU will go in the same order.

Why would the reduction step of a single neuron be split across multiple threads? That sounds slower and more complex than the naive method. And if you do decide to write code doing that, then just the code that reduces across multiple blocks needs to use integers, so pretty much no extra effort is needed.

Like, is there a nondeterministic-dot-product instruction baked into the GPU at a low level?

TeMPOraL · 2025-02-07T09:52:33 1738921953

> Each thread on a CPU will go in the same order.

Not unless you control the underlying scheduler and force deterministic order; knowledge of all the code running isn't sufficient, as some factors affecting threading order are correlated with physical environment. For example, minute temperature gradient differences on the chip between two runs could affect how threads are allocated to CPU cores and order in which they finish.

> Why would the reduction step of a single neuron be split across multiple threads?

Doesn't have to, but can, depending on how many inputs it has. Being able to assume commutativity gives you a lot of flexibility in how you parallelize it, and allows you to minimize overhead (both in throughput and memory requirements).

> Like, is there a nondeterministic-dot-product instruction baked into the GPU at a low level?

No. There's just no dot-product instruction baked into GPU at low level that could handle vectors of arbitrary length. You need to write a loop, and that usually becomes some kind of parallel reduce.

Dylan16807 · 2025-02-07T10:31:25 1738924285

> could affect how threads are allocated to CPU cores and order in which they finish

I'm very confused by how you're interpreting the word "each" here.

> Being able to assume commutativity gives you a lot of flexibility in how you parallelize it, and allows you to minimize overhead (both in throughput and memory requirements).

Splitting up a single neuron seems like something that would only increase overhead. Can you please explain how you get "a lot" of flexibility?

> You need to write a loop, and that usually becomes some kind of parallel reduce.

Processing a layer is a loop within a loop.

The outer loop is across neurons and needs to be parallel.

The inner loop processes every weight for a single neuron and making it parallel sounds like extra effort just to increase instruction count and mess up memory locality and make your numbers less consistent.

TeMPOraL · 2025-02-07T11:35:59 1738928159

I feel like you're imagining a toy network with couple dozen neurons in few layers, done on a CPU. But consider a more typical case of dozens of layers with hundreds (or thousands) of neurons each. That's some thousand numbers to reduce per each neuron.

Then, remember that GPUs are built around thousands of tiny parallel processors, each able to process a bunch (e.g. 16) parallel threads, but then the threads have to run in larger batches (SIMD-like), and there's a complex memory management architecture built-in, over which you only have so much control. Specific numbers of cores, threads, buffer sizes, as well as access patterns, differ between GPU models, and for optimal performance, you have to break down your computation to maximize utilization. Or rather, have the runtime do it for you.

This ain't an an FPGA, you don't get to organize hardware to match your network. If you have a 1000 neurons per hidden layer, then individual neurons likely won't fit on a single CUDA core, so you will have to split them down the middle, at least if you're using full-float math. Speaking of, the precision of the numbers you use is another parameter that adds to the complexity.

On the one hand, you have a bunch of mostly-linear matrix algebra, where you can tune precision. On the other hand, you have a GPU-model-specific number of parallel processors (~thousands), that can fit only so much memory, can run some specific number of SIMD-like threads in parallel, and most of those numbers are powers of two (or a multiple of), so you have also alignment to take into account, on top of memory access patterns.

By default, your network will in no way align to any of that.

It shouldn't be hard to see that assuming commutativity gives you (or rather the CUDA compiler) much more flexibility to parallelize your calculations by splitting it whichever way it likes to maximize utilization.

Dylan16807 · 2025-02-07T18:55:59 1738954559

I'm not imagining toy sizes. Quite the opposite. I'm saying that layers are so big that splitting per neuron already gives you a ton of individual calculations to schedule and that's plenty to get full usage out of the hardware.

You can do very wide calculations on a single neuron if you want; throwing an entire SM (64 or 128 CUDA cores) at a single neuron is trivial to do in a deterministic way. And if you have a calculation so big you benefit from splitting it across SMs, doing a deterministic sum at the end will use an unmeasurably small fraction of your runtime.

And I'll remind you that I wasn't even talking about determinism across architectures, just within an architecture, so go ahead and optimize your memory layouts and block sizes to your exact card.

michalsustr · 2025-02-06T07:48:21 1738828101

I recently attended a STAC conference where they claimed the GPUs themselves are not deterministic. The hand-wavy speculation is they need to temperature control the cores and the flop ops may be reordered during that process. (By temperature I mean physical temperature, not some nn sampling parameter). On such large scale of computation these small differences can show up in the actually different tokens.

fancyfredbot · 2025-02-06T08:09:21 1738829361

I can assure you this isn't true. Having worked with GPUs for many years in an application where consist results are important it's not only possible but actually quite easy to ensure consistent inputs produce consistent results. The temperature and clock speed do not affect the order of operations, only the speed, and this doesn't affect the results. This is the same as with any modern CPU which will also adjust clock for temperature.

petesergeant · 2025-02-06T04:55:06 1738817706

The parent is suggesting that temperature only applies at the generation step, but the choice of backend “expert model” that a request is given to (and then performs the generation) is non-deterministic. Rather than being a single set of weights, there are a few different sets of weights that constitute the “expert” in MoE. I have no idea if that’s true, but that’s the assertion

brookst · 2025-02-06T05:27:05 1738819625

I don't think it makes sense? Somewhere there has to be a RNG for that to be true. MOE itself doesn't introduce randomness, and the routing to experts is part of the model weights, not (I think) a separate model.

pigscantfly · 2025-02-06T06:01:20 1738821680

The samples your input is batched with on the provider's backend vary between calls and sparse mixture of experts routing when implemented for efficient utilization induces competition among tokens with either encouraged or enforced balance of expert usage among tokens in the same fixed-size group. I think it's unknown or at least undisclosed exactly why sequence non-determinism at zero temperature occurs in these proprietary implementations, but I think this is a good theory.

[1] https://arxiv.org/abs/2308.00951 pg. 4 [2] https://152334h.github.io/blog/non-determinism-in-gpt-4/

kettleballroll · 2025-02-06T06:25:09 1738823109

I thought the temperature only affects randomness at the end of the network (when turning embeddings back I to words using the softmax). It cannot influence routing, which is inherently influenced by which examples get batched together (ie, it might depend on other users of the system)

menaerus · 2025-02-06T07:40:10 1738827610

You don't need RNG since the whole transformer is an extremely large floating-point arithmetic unit. A wild guess - how about the source of non-determinism is coming from the fact that, on the HW level, tensor execution order is not guaranteed and therefore (T0 * T1) * T2 can produce slightly different results than T0 * (T1 * T2) due to rounding errors?

daralthus · 2025-02-06T12:28:44 1738844924

I have seen numbers come differently in JAX just depending on the batch size, simply because the compiler optimizes to a different sequence of operations on the hardware.

kiratp · 2025-02-06T09:50:17 1738835417

Quantized floating point math can, under certain scenarios, be non-associative.

When you combine that fact with being part of a diverse batch of requests over an MoE model, outputs are non-deterministic.

bushbaba · 2025-02-06T14:33:28 1738852408

That’s why you have azure openAI APIs which give a lot more consistency

itissid · 2025-02-05T21:56:35 1738792595

Wait isn't there atleast a two step process here one is semantic segmentation followed by a method like texttract for text - to avoid hallucinations?

One cannot possibly say that "Text extracted by a multimodal model cannot hallucinate"?

> accuracy was like 96% of that of the vendor and price was significantly cheaper.

I would like to know how this 96% was tested. If you use a human to do random sample based testing, well how do you adjust the random sample for variations in distribution of errors that vary like a small set of documents could have 90% of the errors and yet they are only 1% of the docs?

themanmaran · 2025-02-05T22:04:06 1738793046

One thing people always forget about traditional OCR providers (azure, tesseract, aws textract, etc.) is that they're ~85% accurate.

They are all probabilistic. You literally get back characters + confidence intervals. So when textract gives you back incorrect characters, is that a hallucination?

kapitalx · 2025-02-05T22:22:30 1738794150

I'm the founder of https://doctly.ai, also pdf extraction.

The hallucination in LLM extraction is much more subtle as it will rewrite full sentences sometimes. It is much harder to spot when reading the document and sounds very plausible.

We're currently working on a version where we send the document to two different LLMs, and use a 3rd if they don't match to increase confidence. That way you have the option of trading compute and cost for accuracy.

LeafItAlone · 2025-02-06T04:38:53 1738816733

>We're currently working on a version where we send the document to two different LLMs, and use a 3rd if they don't match to increase confidence.

I’m interested to hear more about the validation process here. In my limited experience, I’ve sent the same “document” to multiple LLMs and gotten differing results. But sometimes the “right” answer was in the minority of responses. But over a large sample (same general intent of document, but very different possible formats of the information within), there was no definitive winner. We’re still working on this.

nnurmanov · 2025-02-06T04:06:34 1738814794

What if you use a different prompt to check the result, did this work? I am thinking to use this approach, but now I think maybe it is better to use two different LLM like you do.

anon373839 · 2025-02-05T22:17:07 1738793827

It’s a question of scale. When a traditional OCR system makes an error, it’s confined to a relatively small part of the overall text. (Think of “Plastics” becoming “PIastics”.) When a LLM hallucinates, there is no limit to how much text can be made up. Entire sentences can be rewritten because the model thinks they’re more plausible than the sentences that were actually printed. And because the bias is always toward plausibility, it’s an especially insidious problem.

themanmaran · 2025-02-05T23:17:45 1738797465

It's a bit of a pick your poison situation. You're right that traditional OCR mistakes are usually easy to catch (except when you get $30.28 vs $80.23). Compared to LLM hallucinations that are always plausibly correct.

But on the flip side, layout is often times the biggest determinant of accuracy, and that's something LLMs do a way better job on. It doesn't matter if you have 100% accurate text from a table, but all that text is balled into one big paragraph.

Also the "pick the most plausible" approach is a blessing and a curse. A good example is the handwritten form here [1]. GPT 4o gets the all the email addresses correct because it can reasonably guess these people are all from the same company. Whereas AWS treats them all independently and returns three different emails.

[1] https://getomni.ai/ocr-demo

miki123211 · 2025-02-06T07:14:18 1738826058

The difference is the kind of hallucinations you get.

Traditional OCR is more likely to skip characters, or replace them with similar -looking ones, so you often get AL or A1 instead of AI for example. In other words, traditional spelling mistakes. LLMs can do anything from hallucinating new paragraphs to slightly changing the meaning of a sentence. The text is still grammatically correct, it makes sense in the context, except that it's not what the document actually said.

I once gave it a hand-written list of words and their definitions and asked it to turn that into flashcards (a json array with "word" and "definition"). Traditional OCR struggled with this text, the results were extremely low-quality, badly formatted but still somewhat understandable. The few LLMs I've tried either straight up refused to do it, or gave me the correct list of words, but entirely hallucinated the definitions.

Scoundreller · 2025-02-05T22:27:52 1738794472

> You literally get back characters + confidence intervals.

Oh god, I wish speech to text engines would colour code the whole thing like a heat map to focus your attention to review where it may have over-enthusiastically guessed at what was said.

You no knot.

gioazzi · 2025-02-05T23:09:30 1738796970

We did this for a speech to text solution in healthcare. Doctors would always review everything that was transcribed manually (you don’t want hallucinations in your prescription), and using a heatmap it was trivial to identify e.g. drugs that were pretty much always misunderstood by STT

somebehemoth · 2025-02-05T22:16:15 1738793775

I know nothing about OCR providers. It seems like OCR failure would result in gibberish or awkward wording that might be easy to spot. Doesn't the LLM failure mode assert made up truths eloquently that are more difficult to spot?

nyarlathotep_ · 2025-02-06T01:47:34 1738806454

> is that they're ~85% accurate.

Speaking from experience, you need to double check "I" and "l" and "1" "0" and "O" all the time, accuracy seems to depend on the font and some other factors.

have a util script I use locally to copy some token values out of screenshots from a VMWare client (long story) and I have to manually adjust 9/times.

How relevant that is or isn't depends on the use case.

itissid · 2025-02-05T21:59:53 1738792793

For an OCR company I imagine it is unconscionable to do this because if you would say OCR for an Oral History project for a library and you made hallucination errors, well you've replaced facts with fiction. Rewriting history? What the actual F.

phatfish · 2025-02-05T23:22:25 1738797745

Probaly totally fine for a "fintech" (Crypto?) though. Most of them are just burning VC money anyway. Maybe a lucky customer gets a windfall because Gemini added some zeros.

rapind · 2025-02-06T04:49:34 1738817374

I think you can just ask DeepSeek to create a coin for you at this point, and with the recent elimination of any oversight, you can automate your rug pulls...

threecheese · 2025-02-06T13:31:50 1738848710

Normal OCR (like Tesseract) can be wrong as well (and IMO this happens frequently). It won’t hallucinate/straight make shit up like an LLM, but a human needs to review OCR results if the workload requires accuracy. Even across multiple runs of the same image an OCR can give different results (in some scenarios). No OCR system is perfectly accurate, they all use some kind of machine learning/floating point/potentially nondeterministic tech.

nthingtohide · 2025-02-06T03:58:38 1738814318

Can confirm using gemini, some figure numbers were hallucinated. I had to cross-check each row to make sure data extracted is correct.

godapi · 2025-02-06T06:27:03 1738823223

use different models to extract the page and cross check against each other. generally reduces issues alot

basch · 2025-02-05T22:30:18 1738794618

Wouldn’t the temperature on something like OCR be very low. You want the same result every time. Isn’t some part of hallucination the randomness of temperature?

manmal · 2025-02-05T23:44:10 1738799050

I can imagine reducing temp too much will lead to garbage results in situations where glyphs are unreadable.

2rsf · 2025-02-06T11:49:35 1738842575

Isn't it a good thing in this case? this is fintec, so if in doubt get a human to look at it

basch · 2025-02-10T18:14:23 1739211263

so you want every time you scan something illegible, for it to return a different result.

serjester · 2025-02-05T23:08:51 1738796931

The LLM's are near perfect (maybe parsing I instead of 1) - if you're using the outputs in the context of RAG, your errors are likely much much higher in the other parts of your system. Spending a ton of time and money chasing 9's when 99% of your system's errors have totally different root causes seems like a bad use of time (unless they're not).

j_timberlake · 2025-02-06T00:07:16 1738800436

This sounds extremely like my old tax accounting job. OCR existed and "worked" but it was faster to just enter the numbers manually than fix all the errors.

Also, the real solution to the problem should have been for the IRS to just pre-fill tax returns with all the accounting data that they obviously already have. But that would require the government to care.

eternauta3k · 2025-02-06T05:52:06 1738821126

Germany (not exactly the cradle of digitalization) already auto-fills salary tax fields with data from the employer.

Andrex · 2025-02-06T00:24:47 1738801487

They finally made filing free.

So, maybe this century?

kennyloginz · 2025-02-06T02:59:19 1738810759

Check again, Elon and his Doge team killed that.

happyopossum · 2025-02-06T06:20:40 1738822840

No they didn’t, that claim is ridiculously easy to debunk but it has been going around because it fits the narrative.

djeastm · 2025-02-06T19:12:33 1738869153

It'd be nicer if you wouldn't presume to know the reasons people might believe erroneous information.

In this case, the reason for the misinformation is do to the lack of communication from the DOGE entity regarding their actions. Mr. Musk wrote via Tweet that he had "deleted" the digital services agency "18F" that develops the IRS Free File program and also deleted their X account.

https://apnews.com/article/irs-direct-file-musk-18f-6a4dc35a...

If indeed he did cut the agency, it remains to be see how long the application will be operational.

panarky · 2025-02-05T19:24:43 1738783483

This is a big aha moment for me.

If Gemini can do semantic chunking at the same time as extraction, all for so cheap and with nearly perfect accuracy, and without brittle prompting incantation magic, this is huge.

wussboy · 2025-02-05T23:19:50 1738797590

Could it do exactly the same with a web page? Would this replace something like beautiful soup?

eitally · 2025-02-06T05:27:04 1738819624

I don't know exactly how or what it's doing behind the scenes, but I've been massively impressed with the results Gemini's Deep Research mode has generated, including both traditional LLM freeform & bulleted output, but also tabular data that had to come from somewhere. I haven't tried cross-checking for accuracy but the reports do come with linked sources; my current estimation is that they're at least as good as a typical analyst at a consulting firm would create as a first draft.

fallinditch · 2025-02-05T20:53:51 1738788831

If I used Gemini 2.0 for extraction and chunking to feed into a RAG that I maintain on my local network, then what sort of locally-hosted LLM would I need to gain meaningful insights from my knowledge base? Would a 13B parameter model be sufficient?

jhoechtl · 2025-02-05T21:55:28 1738792528

Ypur lovalodel has littleore to do but stitch the already meaningzl pieces together.

The pre-step, chunking and semantic understanding is all that counts.

yeahwhatever10 · 2025-02-06T00:36:22 1738802182

Do you get meaningful insights with current RAG solutions?

fallinditch · 2025-02-06T02:47:24 1738810044

Yes. For example, to create AI agent 'assistants' that can leverage a local RAG in order to assist with specialist content creation or operational activities.

potatoman22 · 2025-02-05T19:38:30 1738784310

Small point but is it doing semantic chunking, or loading the entire pdf into context? I've heard mixed results on semantic chunking.

panarky · 2025-02-05T19:52:37 1738785157

It loads the entire PDF into context, but then it would be my job to chunk the output for RAG, and just doing arbitrary fixed-size blocks, or breaking on sentences or paragraphs is not ideal.

So I can ask Gemini to return chunks of variable size, where each chunk is a one complete idea or concept, without arbitrarily chopping a logical semantic segment into multiple chunks.

thelittleone · 2025-02-05T21:26:38 1738790798

Fixed size chunks is holding back a bunch of RAG projects on my backlog. Will be extremely pleased if this semantic chunking solves the issue. Currently we're getting around an 78-82% success on fixed size chunked RAG which is far too low. Users assume zero results on a RAG search equates to zero results in the source data.

refulgentis · 2025-02-05T21:46:34 1738791994

FWIW, you might be doing it / ruled it out already:

- BM25 to eliminate the 0 results in source data problem

- Longer term, a peek at Gwern's recent hierarchical embedding article. Got decent early returns even with fixed size chunks

thelittleone · 2025-02-05T21:53:56 1738792436

Much appreciated.

For others interested in BM25 for the use case above, I found this thread informative.

https://news.ycombinator.com/item?id=41034297

mediaman · 2025-02-05T22:10:22 1738793422

Agree, BM25 honestly does an amazing job on its own sometimes, especially if content is technical.

We use it in combination with semantic but sometimes turn off the semantic part to see what happens and are surprised with the robustness of the results.

This would work less well for cross-language or less technical content, however. It's great for acronyms, company or industry specific terms, project names, people, technical phrases, and so on.

jacobr1 · 2025-02-05T23:43:54 1738799034

Also consider methods that are using reasoning to potentially dispatch additional searches based on analysis of the returned data

nnurmanov · 2025-02-06T04:09:13 1738814953

This is my problem as well; do you have lots of documents?

Tostino · 2025-02-05T22:54:17 1738796057

I wish we had a local model for semantic chunking. I've been wanting one for ages, but haven't had the time to make a dataset and finetune that task =/.

hattmall · 2025-02-06T02:32:29 1738809149

It's cheap now because Google is subsidizing it, no?

vrosas · 2025-02-06T02:45:22 1738809922

Spoiler: every model is deeply, deeply subsidized. At least Google's is subsidized by a real business with revenue, not VC's staring at the clock.

panarky · 2025-02-06T04:05:52 1738814752

It's cheap because it's a Flash model, far smaller and much less compute for inference, runs on TPUs instead of GPUs.

ForHackernews · 2025-02-05T23:05:32 1738796732

This is great, I just want to highlight out how nuts it is that we have spun up whole industries around extracting text that was typically printed from a computer, back into a computer.

There should be laws that mandates that financial information be provided in a sensible format: even Office Open XML would be better than this insanity. Then we can redirect all this wasted effort into digging ditches and filling them back in again.

faxmeyourcode · 2025-02-05T19:29:09 1738783749

I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.

This is giving me hope that it's possible.

anirudhb99 · 2025-02-05T19:49:47 1738784987

(from the gemini team) we're working on it! semantic chunking & extraction will definitely be possible in the coming months.

otoburb · 2025-02-05T19:36:23 1738784183

>>I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.

For this specific use case you can also try edgartools[1] which is a library that was relatively recently released that ingests SEC submissions and filings. They don't use OCR but (from what I can tell) directly parse the XBRL documents submitted by companies and stored in EDGAR, if they exist.

[1] https://github.com/dgunning/edgartools

faxmeyourcode · 2025-02-05T23:00:20 1738796420

I'll definitely be looking into this, thanks for the recommendation! Been playing around with it this afternoon and it's very promising.

barrenko · 2025-02-05T19:42:50 1738784570

If you'd kindly tl;dr the chunking strategies you have tried and what works best, I'd love to hear.

jgalt212 · 2025-02-05T20:15:59 1738786559

isn't everyone on iXBRL now? Or are you struggling with historical filings?

faxmeyourcode · 2025-02-05T22:31:19 1738794679

XBRL is what I'm using currently, but it's still kind of a mess (maybe I'm just bad at it) for some of the non-standard information that isn't properly tagged.

yzydserd · 2025-02-05T19:49:07 1738784947

How do today’s LLM’s like Gemini compare with the Document Understanding services google/aws/azure have offered for a few years, particularly when dealing with known forms? I think Google’s is Document AI.

zacmps · 2025-02-05T20:05:09 1738785909

I've found the highest accuracy solution is to OCR with one of the dedicated models then feed that text and the original image into an LLM with a prompt like:

"Correct errors in this OCR transcription".

therein · 2025-02-05T20:55:27 1738788927

How does it behave if the body of text is offensive or what if it is talking about a recipe to purify UF-6 gas at home? Will it stop doing what it is doing and enter lecturing mode?

I am asking not to be cynical but because of my limited experience with using LLMs for any task that may operate on offensive or unknown input seems to get triggered by all sorts of unpredictable moral judgements and dragged into generating not the output I wanted, at all.

If I am asking this black box to give me a JSON output containing keywords for a certain text, if it happens to be offensive, it refuses to do that.

How does one tackle with that?

devjab · 2025-02-06T04:20:04 1738815604

We use the Azure models and there isn't an issue with safety filters as such for enterprise customers. The one time we had an issue microsoft changed the safety measures. Of course the safety measures we might meet are the sort of engineering which could be interpreted as weapons manufacturing, and not "political" as such. Basically the safety guard rails seem to be added on top of all these models, which means they can also be removed without impacting the model. I could be wrong on that, but it seems that way.

xnx · 2025-02-05T21:37:54 1738791474

There are many settings for changing the safety level in Gemini API calls: https://ai.google.dev/gemini-api/docs/safety-settings

shijithpk · 2025-02-06T12:05:28 1738843528

This is for anyone coming across this link later. In their latest SDKs, if you want to completely switch off their safety settings, the flag to use is 'OFF' and not 'BLOCK_NONE' as mentioned in the docs in the link above.

The Gemini docs don't refect that change yet. https://discuss.ai.google.dev/t/safety-settings-2025-update-...

sumedh · 2025-02-05T22:02:34 1738792954

Try setting the safety params to none and see if that makes any difference.

zacmps · 2025-02-05T21:31:56 1738791116

It's not something I've needed to deal with personally.

We have run into added content filters in Azure OpenAI on a different application, but we just put in a request to tune them down for us.

bradfox2 · 2025-02-05T20:17:01 1738786621

This is what we do today. Have you tried it against Gemini 2.0?

anirudhb99 · 2025-02-05T23:34:12 1738798452

member of the gemini team here -- personally, i'd recommend directly using gemini vs the document understanding services for OCR & general docs understanding tasks. From our internal evals gemini is now stronger than these solutions and is only going to get much better (higher precision, lower hallucination rates) from here.

joelhaus · 2025-02-06T03:00:37 1738810837

Could we connect offline about using Gemini instead of the doc ai custom extractor we currently use in production?

This sounds amazing & I'd love your input on our specific use case.

joelatoutboundin.com

ajcp · 2025-02-05T21:59:45 1738792785

GCP's Document AI service is now literally just a UI layer specific to document parsing use-cases back by Gemini models. When we realized that we dumped it and just use Gemini directly.

xnx · 2025-02-05T21:35:34 1738791334

Your OCR vendor would be smart to replace their own system with Gemini.

TeMPOraL · 2025-02-06T09:30:32 1738834232

They will, and they'll still have a solid product to sell, because their value proposition isn't accurate OCR per se, but putting an SLA on it.

Reaching reliability with LLM OCR might involve some combination of multiple LLMs (and keeping track of how they change), perhaps mixed with old-school algorithms, and random sample reviews by humans. They can tune this pipeline however they need at their leisure to eke out extra accuracy, and then put written guarantees on top, and still be cheaper for you long-term.

_the_inflator · 2025-02-05T21:46:30 1738791990

With “Next generation, extremely sophisticated AI” to be precise, I wait say. ;)

Marketing joke aside, maybe a hybrid approach could serve the vendor well. Best of both worlds if it reaps benefits or even have a look at hugging face for even more specialized aka better LLMs.

martingoodson · 2025-02-06T10:39:10 1738838350

I work in financial data and our customers would not accept 96% accuracy in the data points we supply. Maybe 99.96%.

For most use cases in financial services, accurate data is very important.

riku_iki · 2025-02-06T22:50:49 1738882249

so, what solution are you using to extract data with 99.96% accuracy?

llm_bot · 2025-02-10T14:26:20 1739197580

I'm curious to hear about your experience with this. Which solution were you using before (the one that took 12 minutes)? If it was a self-hosted solution, what hardware were you using? How does Gemini handle PDFs with an unknown schema, and how does it compare to other general PDF parsing tools like Amazon Textract or Azure Document Intelligence? In my initial test, tables and checkboxes weren't well recognized.

chewkokwah · 2025-02-09T07:11:11 1739085071

How about the comparison with traditional proprietary on premise software like ONMIPage or ABBYY or those listed below: https://en.wikipedia.org/wiki/Comparison_of_optical_characte...

eru · 2025-02-06T03:53:09 1738813989

> For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair".

I'm actually somewhat surprised Gemini didn't guess from context that LLC is much more likely?

I guess the OCR subsystem is intentionally conservative? (Though I'm sure you could do a second step on your end, take the output from the conservative OCR pass, and sent it through Gemini and ask it to flag potential OCR problems? I bet that would flag most of them with very few false positives and false negatives.)

PaulHoule · 2025-02-06T16:48:08 1738860488

Where I work we've had great success at using LLMs to OCR paper documents that look like

https://static.foxnews.com/foxnews.com/content/uploads/2023/...

but were often written with typewriters long ago to get nice structured tabular output. Deals with text being split across lines and across pages just fine.

chickenWing · 2025-02-06T20:16:21 1738872981

It is cheaper now, but I wonder if it will continue to be cheaper when companies like Google and OpenAI decide they want to make a profit off of AI, instead of pouring billions of dollars of investment funds into it. By the time that happens, many of the specialized service providers will be out of business and Google will be free to jack up the price.

kbaker · 2025-02-06T20:23:04 1738873384

I use Claude through OpenRouter (with Aider), and was pretty amazed to see that it routes the requests during the same session almost round-robin through Amazon Bedrock, sometimes through Google Vertex, sometimes through Anthropic themselves, all of course using the same underlying model.

Literally whoever has the cheapest compute.

With the speed that AI models are improving these days, it seems like the 'moat' of a better model is only a few months before it is commoditized and goes to the cheapest provider.

mnky9800n · 2025-02-06T02:10:30 1738807830

What are the pdfs containing?

I’ve been wanting to build a system that ingests pdf reports that reference other types of data like images, csv, etc. that can also be ingested to ultimately build an analytics database from the stack of unsorted data AB’s meta data but I have not found any time to do anything like that yet. What kind of tooling do you use to build your data pipelines?

eitally · 2025-02-06T05:23:40 1738819420

It's great to hear it's this good, and it makes sense since Google has had several years of experience creating document-type-specific OCR extractors as components of their Document AI product in Cloud. What most heartening is to hear that the legwork they did for that set of solutions has made it into Gemini for consumers (and businesses).

RamblingCTO · 2025-02-06T06:37:00 1738823820

Successful document processing vendors to use LLMs already. I know this at least of klippa. They have (apparently) fine-tuned models, prompts etc. The biggest issue with using LLMs directly is error handling, validation and "parameter drift"/randomness. This is the typical "I'll build it myself but worse" thing

octonaut · 2025-02-06T03:09:28 1738811368

I'm interested to hear what your experience has been dealing with optional data. For example if the input pdf has fields which are sometimes not populated or nonexistent, is Gemini smart enough to leave those fields blank in the output schema? Usually the LLM tries to please you and makes up values here.

surfingdino · 2025-02-06T11:59:19 1738843159

You could ingest them with AWS Textract and have predictability and formatting in the format of your choice. Using LLMs for this is lazy and generates unpredictable and non-deterministic results.

faebi · 2025-02-06T07:09:36 1738825776

Did you try other vision models such as ChatGPT and Grok? I'm doing something similar but struggled to find good comparisons in between the vision models in terms OCR and document understanding.

andai · 2025-02-06T10:24:06 1738837446

If the documents have the same format, maybe you could include an example document in the prompt, so the boilerplate stuff (like LLC) gets handled properly.

lebimas · 2025-02-06T19:05:17 1738868717

You could probably take this a step further and pipe the OCR'ed text into Claude 3.5 Sonnet and get it to fix any OCR errors

bdd8f1df777b · 2025-02-06T01:40:19 1738806019

What if you prompt Gemini that mistaking LLC for IIC is a common mistake? Will Gemini auto correct it?

tomrod · 2025-02-06T01:41:53 1738806113

With lower temperature, it seems to work okay for me.

A _killer_ awesome thing it does too is allow code specification in the config instead of through repeated attempts at prompts.

ensocode · 2025-02-06T10:02:07 1738836127

Just to make sure: you are talking about your experiences with Gemini 1.5 Flash here, right?

mring33621 · 2025-02-06T16:29:44 1738859384

Hi! Any guesstimate for pages/minute from your Gemini OCR experience? Thanks!

depr · 2025-02-05T20:14:27 1738786467

So are you mostly processing PDFs with data? Or PDFs with just text, or images, graphs?

thelittleone · 2025-02-05T21:28:30 1738790910

Not the parent, but we process PDFs with text, tables, diagrams. Works well if the schema is properly defined.

_blk · 2025-02-06T00:59:11 1738803551

Is privacy a concern?

surfingdino · 2025-02-06T12:00:44 1738843244

Why would it be? Their only concern is IPO.

jack_pp · 2025-02-06T04:17:55 1738815475

In fintech I'd suspect the PDFs are public knowledge

cess11 · 2025-02-05T19:20:46 1738783246

What hardware are you using to run it?

kccqzy · 2025-02-05T19:47:57 1738784877

The Gemini model isn't open so it does not matter what hardware you have. You might have confused Gemini with Gemma.

cess11 · 2025-02-05T22:09:03 1738793343

OK, I see, pity. I'm interested in similar applications but in contexts where the material is proprietary and might contain PII.

rco8786 · 2025-02-06T13:17:08 1738847828

“LLC” to “IIC” is one thing. But wouldn’t that also make it just as easy to to mistake something like “$100” for “$700”?

sensecall · 2025-02-05T21:03:13 1738789393

Out of interest, did you parse into any sort of defined schema/structure?

gnat · 2025-02-05T21:09:03 1738789743

Parent literally said so …

> Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.

bionhoward · 2025-02-05T21:31:52 1738791112

The Gemini api has a customer noncompete, so it’s not an option for AI, what are you working on that doesn’t compete with AI?

B-Con · 2025-02-05T21:36:13 1738791373

You do realize most people aren't working on AI, right?

Also, OP mentioned fintech at the outset.

novaleaf · 2025-02-05T21:39:29 1738791569

what doesn't compete with ai?

llm_trw · 2025-02-05T21:27:31 1738790851

This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.

You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.

You use an object detection model trained on documents to find the bounding boxes of each document section as _images_; each bounding box comes with a confidence score for free.

You then feed each box of text to a regular OCR model, also gives you a confidence score along with each prediction it makes.

You feed each image box into a multimodal model to describe what the image is about.

For tables, use a specialist model that does nothing but extract tables—models like GridFormer that aren't hyped to hell and back.

You then stitch everything together in an XML file because Markdown is for human consumption.

You now have everything extracted with flat XML markup for each category the object detection model knows about, along with multiple types of probability metadata for each bounding box, each letter, and each table cell.

You can now start feeding this data programmatically into an LLM to do _text_ processing, where you use the XML to control what parts of the document you send to the LLM.

You then get chunking with location data and confidence scores of every part of the document to put as meta data into the RAG store.

I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.

ajcp · 2025-02-05T22:17:32 1738793852

Not sure what service you're basing your calculation on but with Gemmini I've processed 10,000,000+ shipping documents (PDF and PNGs) of every concievable layout in one month at under $1000 and an accuracy rate of between 80-82% (humans were at 66%).

The longest part of the development timeline was establishing the accuracy rate and the ingestion pipeline, which itself is massively less complex than what your workflow sounds like: PDF -> Storage Bucket -> Gemini -> JSON response -> Database

Just to get sick with it we actually added some recusion to the Gemini step to have it rate how well it extracted, and if it was below a certain rating to rewrite its own instructions on how to extract the information and then feed it back into itself. We didn't see any improvement in accuracy, but it was still fun to do.

llm_trw · 2025-02-05T23:34:58 1738798498

>Not sure what service you're basing your calculation on but with Gemmini

The table of costs in the blog post. At 500,000 pages per day the hardware fixed cost overcomes the software variable cost at day 240 and from then on you're paying an extra ~$100 per day to keep it running in the cloud. The machine also had to use extremely beefy GPUs to fit all the models it needed to. Compute utilization was between 5 to 10% which means that it's future proof for the next 5 years at the rate at which the data source was growing.

    | Model                       | Pages per Dollar |
    |-----------------------------+------------------|
    | Gemini 2.0 Flash            | ≈ 6,000          |
    | Gemini 2.0 Flash Lite       | ≈ 12,000*        |
    | Gemini 1.5 Flash            | ≈ 10,000         |
    | AWS Textract                | ≈ 1,000          |
    | Gemini 1.5 Pro              | ≈ 700            |
    | OpenAI 4o-mini              | ≈ 450            |
    | LlamaParse                  | ≈ 300            |
    | OpenAI 4o                   | ≈ 200            |
    | Anthropic claude-3-5-sonnet | ≈ 100            |
    | Reducto                     | ≈ 100            |
    | Chunkr                      | ≈ 100            |

There is also the fact that it's _completely_ local. Which meant we could throw in every proprietary data source that couldn't leave the company at it.

>The longest part of the development timeline was establishing the accuracy rate and the ingestion pipeline, which itself is massively less complex than what your workflow sounds like: PDF -> Storage Bucket -> Gemini -> JSON response -> Database

Each company should build tools which match the skill level of their developers. If you're not comfortable training models locally with all that entails off the shelf solutions allow companies to punch way above their weight class in their industry.

serjester · 2025-02-05T23:55:02 1738799702

That assumes that you're able to find a model that can match Gemini's performance - I haven't come across anything that comes close (although hopefully that changes).

jeswin · 2025-02-06T14:08:37 1738850917

Nice article, mirrors my experience. Last year (around when multimodal 3.5 Sonnet launched), I had run a sizeable number of PDFs through it. Accuracy was remarkably high (99%-ish), whereas GPT was just unusable for this purpose.

cpursley · 2025-02-05T22:27:00 1738794420

Very cool! How are you storing it to a database - vectors? What do you do with the extracted data (in terms of being able to pull it up via some query system)?

ajcp · 2025-02-05T22:51:15 1738795875

In this use-case the customer just wanted data not currently in the warehouse inventory management system capatured, so here we converted a JSON response to a classic table row schema (where 1 row = 1 document) and now boom, shipping data!

However we do very much recommend storing the raw model responses for audit and then at least as vector embeddings to orient the expectation that the data will need to be utilized for vector search and RAG. Kind of like "while we're here why don't we do what you're going to want to do at some point, even if it's not your use-case now..."

rofl123 · 2025-02-06T15:00:05 1738854005

> Kind of like "while we're here why don't we do what you're going to want to do at some point, even if it's not your use-case now..."

wow, this is so bad. why do it now and introduce complexity and debt if you can do it later when you actually need it? you are just riding the hype wave and trying to get most out of it but that's fine.

ajcp · 2025-02-06T18:13:15 1738865595

> why do it now and introduce complexity and debt if you can do it later when you actually need it?

The same reason I don't wait until it snows to buy snowboots. I know my environment, topography, scale, risk-profile, and costs, and can concieve of innumerable use-cases for when they will be necessary, even if it's only May, when snowboots happen to be on sale ;) What's a little closet space and the burden of locking my door when I leave the house in the interim?

svieira · 2025-02-06T19:42:36 1738870956

> [with] an accuracy rate of between 80-82% (humans were at 66%)

Was this human-verified in some way? If not, how did you establish the facts-on-the-ground about accuracy?

ajcp · 2025-02-06T22:12:35 1738879955

Yup, unfortunately the only way to know how good an AI is at anything is to do the same way you'd do with a human: build a test that you know the answers to already. That's also why the accuracy evaluation was by far the most time intensive part of the development pipeline as we had to manually build a "ground-truth" dataset that we could evaluate the AI again.

jeswin · 2025-02-06T14:02:08 1738850528

I feel compelled to reply. You've made a bunch of assumptions, and presented your success (likely with a limited set of table formats) as the one true way to parse PDFs. There's no such thing.

In real world usage, many tables are badly misaligned. Headers are off. Lines are missing between rows. Some columns and rows are separated by colors. Cells are merged. Some are imported from Excel. There are dotted sub sections, tables inside cells etc. Claude (and now Gemini) can parse complex tables and convert that to meaningful data. Your solution will likely fail, because rules are fuzzy in the same way written language is fuzzy.

Recently someone posted this on HN, it's a good read: https://lukaspetersson.com/blog/2025/bitter-vertical/

> You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.

No, not like that, but often as nested Json or Xml. For financial documents, our accuracy was above 99%. There are many ways to do error checking to figure out which ones are likely to have errors.

> This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.

One should refrain making statements about cost without knowing how and where it'll be used. When processing millions of PDFs, it could be a problem. When processing 1000, one might prefer Gemini/other over spending engineering time. There are many apps where processing a single doc is say $10 in revenue. You don't care about OCR costs.

> I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.

The author presented techniques which worked for them. It may not work for you, because there's no one-size-fits-all for these kinds of problems.

metadat · 2025-02-06T16:49:01 1738860541

Related discussion:

AI founders will learn the bitter lesson

https://news.ycombinator.com/item?id=42672790 - 25 days ago, 263 comments

The HN discussion contains a lot of interesting ideas, thanks for the pointer!

llm_trw · 2025-02-06T22:56:54 1738882614

You're making an even less charitable set of assumptions:

1). I'm incompetent enough to ignore publicly available table benchmarks.

2). I'm incompetent enough to never look at poor quality data.

3). I'm incompetent enough to not create a validation dataset for all models that were available.

Needless to say you're wrong on all three.

My day rate is $400 + taxes per hour if you want to be run through each point and why VLMs like Gemini fail spectacularly and unpredictably when left to their own devices.

pkkkzip · 2025-02-06T23:04:04 1738883044

whoa, this is a really aggressive response. No one is calling you incompetent rather criticizing your assumptions.

> My day rate is $400 + taxes per hour if you want to be run through each point

Great, thanks for sharing.

danielparsons · 2025-02-13T02:07:03 1739412423

bragging about billing $400 an hour LOL

vikp · 2025-02-05T23:01:51 1738796511

Marker (https://www.github.com/VikParuchuri/marker) works kind of like this. It uses a layout model to identify blocks and processes each one separately. The internal format is a tree of blocks, which have arbitrary fields, but can all render to html. It can write out to json, html, or markdown.

I integrated gemini recently to improve accuracy in certain blocks like tables. (get initial text, then pass to gemini to refine) Marker alone works about as well as gemini alone, but together they benchmark much better.

llm_trw · 2025-02-05T23:46:21 1738799181

I used sxml [0] unironically in this project extensively.

The rendering step for reports that humans got to see was a call to pandoc after the sxml was rendered to markdown - look ma we support powerpoint! - but it also allowed us to easily convert to whatever insane markup a given large (or small) language model worked best with on the fly.

[0] https://en.wikipedia.org/wiki/SXML

cma · 2025-02-06T08:55:13 1738832113

Why process separately, if there are ink smudges, photocopier glitches, etc. wouldn't it guess some stuff better from richer context, like acronyms in rows used across the other tables?

hackernewds · 2025-02-06T08:43:37 1738831417

It's funny you astroturf your own project in a thread where another is presenting tangential info about their own

alemos · 2025-02-06T09:28:24 1738834104

what does marker add on top of docling?

vikp · 2025-02-06T16:34:40 1738859680

Docling is a great project, happy to see more people building in the space.

Marker output will be higher quality than docling output across most doc types, especially with the --use_llm flag. A few specific things we do differently:

  - We have hybrid mode with gemini that merges tables across pages, improves quality on forms, etc.
  - we run an ordering model, so ordering is better for docs where the PDF orde ris bad
  - OCR is a lot better, we train our own model, surya - https://github.com/VikParuchuri/surya
  - References and links
  - Better equation conversion (soon including inline)

anon373839 · 2025-02-06T04:07:11 1738814831

This is a great comment. I will mention another benefit to this approach: the same pipeline works for PDFs that are digital-native and don't require OCR. After the object detection step, you collect the text directly from within the bounding boxes, and the text is error-free. Using Gemini means that you give this up.

siva7 · 2025-02-06T03:40:43 1738813243

You‘re describing yesterdays world. With the advancement of AI, there is no need for any of these many steps and stages of OCR anymore. There is no need for XML in your pipeline because Markdown is now equally suited for machine consumption by AI models.

llm_trw · 2025-02-06T06:01:23 1738821683

The results we got 18 months ago are still better than the current gemini benchmarks at a fraction the cost.

As for markdown, great. Now how do you encode the meta data about the confidence of the model that the text says what it thinks it says? Becuase xml has this lovely thing called attributes that let's you keep a provenance record without a second database that's also readable by the llm.

JohnKemeny · 2025-02-06T07:39:26 1738827566

Just commenting here so that I can find back to this comment later. You perfectly captured the AI hype in one small paragraph.

fransje26 · 2025-02-06T10:04:35 1738836275

Hey, why settle for yesteryear's world, with better accuracy, lower costs and local deployment, if you can use today's new shinny tool, reinvent the wheel, put everything in the cloud, and get hallucination for free..

BenGosub · 2025-02-07T09:40:33 1738921233

What are the tools from the yesterday's world you are referring to? I've had issues with the base Python library in PDF parsing, only some state of the art tools were able to parse the information correctly.

raincole · 2025-02-06T11:12:38 1738840358

Just commenting here to say the GP is spot on.

If you already have a high optimized pipeline built yesterday, then sure keep using it.

But if you start dealing with PDF today, just use Gemini. Use the most human readable formats you can find because we know AI will be optimized on understanding that. Don't even think about "stitching XML files" blahblah.