Hacker News new | past | comments | ask | show | jobs | submit login

Before commenting asking about why they don't just use LLMs, please note that the article specifically calls out that they do, but it's not always a viable solution:

> The agency uses artificial intelligence and a technology known as optical character recognition to extract text from historical documents. But these methods don’t always work, and they aren’t always accurate.

The document at the top is likely an especially easy document to read precisely because it's meant to be the hook to get people to sign up and get started. It isn't going to be representative of the full breadth of documents that the National Archives want people to go through.




OK, fair enough, but can you find one in this article that's hard for an LLM? The gnarliest one I saw, 4o handled instantly, and I went back and looked carefully at the image and the text and I'm sold.

Like if this is a crowdsourcing project, why not do a first pass with an LLM and present users with both the image and the best-effort LLM pass?

Later

I signed up, went to the current missions, and they all seem to post post-1900 and all typeset. They're blurry, but 4o cuts through them like a hot knife through butter.


My parents have saved letters from their parents which are written in cursive but in two perpendicular layers. Meaning the writing goes horizontally in rows and then when they got to the end of the page it was turned 90 degrees and continued right on top of what was already there for the whole page. This was apparently to save paper and postage. It looks like an unintelligible jumble but my mother can actually decipher it. Maybe that’s what the LLMs are having trouble with?

Edit: apparently it’s called cross writing [1]

1: https://highshrink.com/2018/01/02/criss-cross-letters/


Are they having trouble? You can sign up right now and get tasks from the archive that seem trivial for 4o (by which I mean: feed a screenshot to 4o, get a transcription, and spot check it).


Did you actually check it? Sonnet 3.5 generates text that seems legitimate and generally correct, but misreads important details. LLMs are particularly deceptive because they will be internally consistent - they'll reuse the same incorrect name in both places and will hallucinate information that seems legit, but in fact is just made-up.

You don't use LLM but other transformer based ocr models like trocr which has very low CER and WER rates

Just have version control, and allow randomized spot checks with experts to have a known error rate.

> Like if this is a crowdsourcing project, why not do a first pass with an LLM and present users with both the image and the best-effort LLM pass?

Possibly for the reason that came up in your other post: you mentioned that you spot checked the result.

Back when I was in historical research, and occasionally involved in transcription projects, the standard was 2-3 independent transcriptions per document.

Maybe the National Archive will pass documents to an LLM and use the output as 1 of their 2-3 transcriptions. It could reduce how many duplicate transcriptions are done by humans. But I'll be surprised if they jump to accepting spot checked LLM output anytime soon.


You get that I'm not saying they should just commit LLM outputs as transcriptions, right?

My guess is because it’s the Smithsonian, they’re just not willing to trust an LLM’s transcription enough to put their name on it. I imagine they’re rather conservative. And maybe some AI-skeptic protectionist sentiments from the professional archivists. Seems like it could change with time though.


> My guess is because it’s the Smithsonian, they’re just not willing to trust an LLM’s transcription enough to put their name on it. I imagine they’re rather conservative

I expect thats a common theme from companies like that, yet I don't think they understand the issue they think they have there.

Why not have the LLMs do as much work as possible and have humans review and put their own name on it? Do you think they need to just trust and publish the output of the LLM wholeheartedly?

I think too many people saw what a few idiot lawyers did last year and closed the book on LLM usage.


> Why not have the LLMs do as much work as possible and have humans review and put their own name on it?

That's not a good way to improve on the accuracy of the LLM. Humans reviewing work that is 95% accurate are mostly just going to rubber-stamp whatever you show them. This is equally a problem for humans reviewing the work of other humans.

What you actually want, if you're worried about accuracy, is to do the same work multiple times independently and then compare results.


The incident with the lawyers just highlighted the fundamental problem with LLMs and AI in general. They can't be trusted for anything serious. Worse, they give the apppearence of being correct, which leads humans "checkers" into complacency. Total dumpster fire.

Instead of thinking about this as an all-or-nothing outcome, consider how this might work if they were made accessible with LLMs, and then you used randomized spot checks with experts to create a clear and public error rate. Then, when people see mistakes they can fix them.

I’m trying to do this for old Latin books at the Embassy of the Free Mind in Amsterdam. So many of the books have never been digitized, let alone OCRd or translated. There is a huge amount of work to be done to make these works accessible.

LLMs won’t make it perfect. But isn’t perfect the enemy of the good? If we make it an ongoing project where the source image material is easily accessible (unlike in a normal published translation, where you just have to trust the translator), then the knowledge and understanding can improve over time.

This approach also has the benefit of training readers not to believe everything they read — but to question it and try to get directly at the source. I think that’s a beautiful outcome.


These kinds of ideas just sound to me like "Suppose you had to use broken technology X. How do you make work?"

I don't think you're wrong, but that's because there are no alternative technologies. The only alternative is leaving much more of the archive inaccessible for a much longer period, possibly forever.

> The only alternative is leaving much more of the archive inaccessible for a much longer period, possibly forever.

No, the alternative is volunteers transcribing. Like this project.

Not every problem needs a computer.


In a world that keeps people working, entertained, or sleeping, the total number of volunteers out there is likely pretty small and easily burned out.

Volunteers transcribing leaves much more of the archive inaccessible for a much longer period.

The article is from The Smithsonian. The actual project is with the National Archives.

I'm doing some genealogy work right now on my family's old papers covering the time period from recent years back to the late 17th century. Handwriting styles changed a lot over the centuries and individuals can definitely be identified by their personal cursive style of writing and you can see their handwriting change as they aged.

Then you have the problem that some of these ancestors not only had terrible penmanship but also spelled multi-syllabic words phonetically since they likely were barely educated kids who spent more time when they were young working on the farm or ranch instead of attending school where they would've learned how to spell correctly.

I don't know whether your LLM can handle English words spelled phonetically written in cursive by an individual who had no consistency in forming letters in the words. It is clear after reading a lot of correspondence from this person that they ignored things that didn't seem important in the moment like dotting i's or crossing t's or forming tails on g's, p's, j's, or even beginning letters consistently since they switched between cursive and block letters within a sentence, maybe while they paused to clarify their thoughts. I don't know but it is fascinating to take a walk through life with someone you'll never meet and to discover that many of the things that seemed awesome to you as a kid were also awesome to them and that their life had so many challenges that our generations will never need to endure.

Some of my people have the most beautiful flowing cursive handwriting that looks like the cursive that I was taught in grade school. Others have the most beautiful flowing cursive with custom flourishes and adornments that make their handwriting instantly recognizable and easy to read once you understand their style.

I think there are plenty of edge cases where LLMs will take a drunkard's walk through the scribble and spit out gibberish.

I'm reminded of an old joke though.

Ronald Reagan woke up one snowy Washington, DC morning and took a look out of the window to admire the new-fallen snow. He enjoys the beautiful scene laid out before him until he sees tracks in the snow below his window and a message obviously written in piss that said - "Reagan sucks".

He dispatched the Secret Service to the site where samples were taken of the affected snow and photos of the tracks of two people were made.

After an investigation he receives a call from the Secret Service agent in charge who tells him he has some good news and some bad news for him.

The good news is that they know who pissed the message. It was George HW Bush, his Vice President. The bad news is that it was Nancy's handwriting.


I don't know about this project, but I can easily find thousands of images that gpt-4o can't read, but a human expert can. It can do typed text excellently, antika-style cursive if it's very neat, and kurrent-style cursive never.

For straightforward reasons, I am commenting on this project, not the space of all possible projects. I did try, once, to get 4o to decode the Zodiac Killer's message. It didn't work.

The point is that gpt-4o can't read cursive very well. The chain of thought reasoning doesn't seem to help at all. It's a thing the model doesn't generalize to, that's relevant to this discussion.

I'm sure you can find something it does a bad job on (have you, though?) but it's done swimmingly on some really horrendous 18th century examples I've found. Among other obvious things, it translates the Revolutionary War cursive that leads off this article.

I think people may be kidding themselves a little bit about this stuff.


I wish! I do genealogy, I would love automated transcription of historical documents. Have I? Oh yes! It wasn't rhetorical that I could find you thousands of pages. Here's one book if you want a link: https://media.digitalarkivet.no/view/166325/4

But every time I find something that I think "This one is surely easy enough!" it's usually wrong. US revolutionary war cursive is on the easy end; it's quite similar to modern cursive, and what is of training data in the datasets is probably a lot like it, if not actually in it - historians who test this complain about overfitting. Which I believe, because as I said, I see that it's bad at generalizing.


Real quick, how long do you think chatgpto4 has existed? How long do you think the National Archive has been archiving?

It's 4o. The crowdsourced transcription project dates back to 2012. My comment is mostly on this article.

One that require additional work beyond simply feeding the image into the model would be this example which is a mix of barely legible hand written cursive and easy to read typed form. [0] Initially 4o just transcribes (successfully) the bottom half of the text and has to be prompted to attempt the top half at which point it seems to at best summarize the text instead of giving a direct transcription. [1] In fact it seems to mix up some portions of the latter half of the typed text with the written text in the portion of it's "transcription" about "reduced and indigent circumstances".

[0] https://catalog.archives.gov/id/54921817?objectPage=8&object...

[1] Reproducing here since I cannot share the chat since it has user uploaded images. " The text in the top half of the image is handwritten and partially difficult to read due to its cursive style and some smudging. Here's my best transcription attempt for the top section:

...resident within four? years, swears and says that the name of the John Hopper mentioned in the foregoing declaration is the same person, and he verily believes the facts as stated in the declaration are true.

He further swears that the said John Hopper is in reduced and indigent circumstances and requires the aid of his country.

The declarant further swears he has no evidence now in his power of service, except the statement of Capt. (illegible name), as to his reduced circumstances ...

Sworn to before me, this day...

Some parts remain unclear due to the handwriting, but let me know if you'd like me to attempt further clarification on specific sections!"


> this example which is a mix of barely legible hand written cursive and easy to read typed form.

> In fact it seems to mix up some portions of the latter half of the typed text with the written text in the portion of it's "transcription" about "reduced and indigent circumstances".

What typed form? What typed text? That image is a single handwritten page, and the writing is quite clean, not "barely legible".† The file related to John Hopper appears to be 59 pages, and some of them are typed, but they're all separate images.

Are you trying to process all 59 pages at once? Why?

I should note that transcription is an excellent use of an LLM in the sense of a language model, as opposed to an "LLM" in the sense of several different pieces of software hooked together in cryptic ways. It would be a lot more useful, for this task, to have direct access to the language model backing 4o than to have access to a chatbot prompt that intermediates between you and the model.

† My biggest problems in reading the page: Cursive n and u are often identical glyphs (both written и), leading me to read "Ind." as "Jud."; and I had trouble with the "roster" at the bottom of the page. What felt weirdest about that was that the crossbar of the "t" is positioned well above the top of the stem, but that can't actually be what tripped me up, because on further review it's a common feature of the author's handwriting that I didn't even notice until I got to the very end of the letter. It's even true in the earlier instance of "Roster" higher up on the page. So my best guess is that the "os" doesn't look right to me.

I misread 1758 as 1958, too, but hopefully (a) that kind of thing wears off as you get used to reading documents about the Revolutionary War; and (b) it's a red flag when someone who died in 1838 was born in 1958 according to a letter written in 1935.


What? I pulled one page out of the image set and tried to get GPT 4o to transcribe it. I wasn't just using the easy example from the original article, it's an easy example to draw people into the idea of participating in the volunteer effort. If it were one of the inscrutable documents people would be more likely to be put off the effort.

Did the link in my comment not take you to a single page (I just tested it in incognito mode too..)? For me it's this image [0] and no I tried just this one page and it didn't do well. If you can get it to work let me know the prompt it was late for me and it

[0] https://s3.amazonaws.com/NARAprodstorage/opastorage/live/17/...


No, if I follow the link in your comment, I get a very different image, this one: https://s3.amazonaws.com/NARAprodstorage/opastorage/live/17/...

(page 8 of "Revolutionary War Pension and Bounty Land Warrant Application File W. 7785, John Hopper, N.C.")

I agree that your description of the image the link shows you, which appears to be page 52 of the same file, makes sense. I can read ... some of the handwritten words. None of the long ones.


Very very strange. It's been giving me the same image for the whole time including over multiple devices. The one you link is #52 for me.

Anyways yes that handwritten text is an example where LLMs just cant hack it and people seem to be able to. There's a pretty thorough transcript of the upper handwritten portion of the page I was referencing available from a user. It's a great example of why you can't just throw an LLM at problems like this. At best they're a tool people can use to transcribe loads of them quickly but it still needs to be hand checked for accuracy, completeness, and relevance.


> Like if this is a crowdsourcing project...

I'm confused by what you're asking. Are you asking me to like (upvote) your comment if this is a crowdsourcing project? Don't we already know it is a crowdsourcing project?


The use of the word “like” here could be replaced with the word “so”

“So if this is a crowdsourcing project…”

Like is serving as an indication that someone else approximately said the phrase it introduced, in a way often associated with the “Valley Girl” social dialect but regularly seen outside of it.

https://en.wikipedia.org/wiki/Like#As_a_colloquial_quotative


> The use of the word “like” here could be replaced with the word “so”

Correct, but that's not a quotative use of the word. It's a discourse particle. You want to link one subsection down, like as a discourse particle.

https://en.wikipedia.org/wiki/Like#As_a_discourse_particle,_...


Determining whether the latest off the shelf LLMs are good enough should be straight forward because of this:

“Some participants have dedicated years of their lives to the program—like Alex Smith, a retiree from Pennsylvania. Over nine years, he transcribed more than 100,000 documents”

Have different LLMs transcribe those same documents and compare to see if the human or machine is or accurate and by how much.


This is not an LLM problem. It was solved years ago via OCR. Worldwide, postal services long ago deployed OCR to read handwitten addresses. And there was an entire industry of OCR-based data entry services, much of it translating the chicken scratch of doctor's handwiting on medical forms, long before LLMs were a thing.

It was never “solved” unless you can point me to OCR software that is 100% accurate. You can take 5 seconds to google “ocr with llm” and find tons of articles explaining how LLMs can enhance OCR. Here’s an example:

https://trustdecision.com/resources/blog/revolutionizing-ocr...


By that standard, no problem has ever been solved by anyone. I prefer to believe that a great many everyday tech issues were in fact tackled and solved in the past by people who had never even heard of LLMs. So too many things were done in finance long before blockchains solved everything for us.

OCR is very bad.

As an example look at subtitle rips for DVD and Blu-ray. The discs store them as images of rendered computer text. A popular format for rippers is SRT, where it will be stored as utf-8 and rendered by the player. So when you rip subtitles, there's an OCR step.

These are computer rendered text in a small handful of fonts. And decent OCR still chokes on it often.


From the article I linked:

“Our internal tests reveal a leap in accuracy from 98.97% to 99.56%, while customer test sets have shown an increase from 95.61% to 98.02%. In some cases where the document photos are unclear or poorly formatted, the accuracy could be improved by over 20% to 30%.”


In my experience the chatbots have bumped transcription accuracy quite a bit. (Of course, it's possible I just don't have access to the best-in-class OCR software I should be comparing against).

(I always go over the transcript by hand, but I'd have to do that with OCR anyway).


OCR is not perfect. And therefore it is not "solved".

That definition, solved=perfect, is not what sandworm meant and it's an irrelevant definition to this conversation because it's an impossible standard.

Insisting we switch to that definition is just being unproductive and unhelpful. And it's pure semantics because you know what they meant.


Not really, because this entire post is about that last fraction of a %.

It's not, because then they wouldn't want humans, because humans can't do 100% either.

That's only true if the x% humans can't do is the same x% that OCR can't do.

I know it matters what percent humans can do. But specifically "that last fraction of a percent" is in comparison to 100, not to humans. The argument I was replying to was about perfection, and rejecting anything short of it. Comparing to humans is a much better idea, and removes the entire argument of "OCR literally can't be that good so the problem isn't solved".

point me to handwriting that is 100% legible...

If 100% is your standard, good luck solving anything ever.


Most handwriting is legible to its owner. This would indicate that there is enough consistency within a person's writing style to differentiate letters, etc., even if certain assumptions about resemblance to any standard may not hold. I wonder if there are modern OCR methods that incorporate old code-breaking techniques like frequency analysis.

> Most handwriting is legible to its owner.

Not necessarily, I'd be surprised if I could fully understand my old handwritten notes from when I was in school (years ago), since I've always had messy handwriting and no longer have the context in each subject matter to guess.

LLMs could help in some of those cases, since it would have knowledge of history/chemistry/etc. and could fill in the blanks better than I could at this point. Though the hallucinations would no doubt outweigh it.


I think OP is saying there is always scope for improvement until it is 100% not that 100% or bust .

LLMs improve significantly on state of the art OCR. LLMs can do contextual analysis. If I were transcribing these by hand, I would probably feed them through OCR + an LLM, then ask an LLM to compare my transcription to its transcription and comment on any discrepancies. I wouldn't be surprised if I offered minimal improvement over just having the LLM do it though.

Why assume that OCR does not involve context? OCR systems regularly use context. It doesnt require an LLM for a machine reading medical forms to generate and use a list of the hundred most common drugs appearing in a paticular place on a specific form. And an OCR reading envelopes can be directed to prefer numbers or letters depending on what it expects.

Even if LLMs can push a 99.9% accuracy to 99.99, at least an OCR-based system can be audited. Ask an OCR vendor why the machine confused "Vancouver WA" and "Vancouver CA" and one can get a solid answer based in repeated testing. Ask an LLM vendor why and, at best, you'll get a shrug and some line citing how much better they were in all the other situations.


Are you guessing, or are there results somewhere that demonstrate how LLMs improve OCR in practical applications?

Someone linked this above

https://trustdecision.com/resources/blog/revolutionizing-ocr...

> Our internal tests reveal a leap in accuracy from 98.97% to 99.56%, while customer test sets have shown an increase from 95.61% to 98.02%. In some cases where the document photos are unclear or poorly formatted, the accuracy could be improved by over 20% to 30%.

While a small percentage increase, when applied to massive amounts of text it’s a big deal.


It's not a small percentage. The moment you OCR a book, you'll end up with hundreds to thousands of errors.

For the addresses it might be a bit easier because they are a lot more structured and in theory and the vocabulary is a lot more limited. I’m less sure about medical notes although I’d suspect that there are fairly common things they are likely to say.

Looking at the (admittedly single) example from the National Archives seems a bit more open than perhaps the other two examples. It’s not impossible thst LLMs could help with this


Yes, but there was usually a fall-back mechanism where an unrecognized address would be shown on a screen to an employee who would type it so that it could then be inkjetted with a barcode.

Fun fact, convolutional neural networks developed by Yann LeCunn were instrumental in that roll out!

Agree. Sounds like not wanting to let go of a legacy

Something about extraordinary claims and extraordinary evidence? The evidence presented, a seemingly easily transcribed image, is hardly persuasive.


Some are significantly harder to read. I took the page below and tried to get GPT 4o to transcribe it and it basically couldn't do it. I'm not going to sit and prompt hack for ages to see if it can but it seems unable to tackle the handwritten text at the top. When I first just fed it the image and asked for a transcription it only (but successfully) read the bottom portion, prompted for a transcription of the top it dropped into more of a summary of the whole document mainly pulling some phrases from the bottom text. (Sadly can't share it but I copied it's reply out in a comment upthread) [0]

It was more successful at a few others I tried but it's still a task that requires manual processing like a lot of LLM output to check for accuracy and prompt modification to get it to output what you need for some documents.

https://catalog.archives.gov/id/54921817?objectPage=8&object...

[0] https://news.ycombinator.com/item?id=42746490


Drives me crazy that they are saying "AI and OCR". It sucks that charlatans have occupied the field of "AI" so thoroughly now that OCR is considered something separate.

Still, the fact that they’re combining AI and human effort makes sense

High quality human transcriptions are the most valuable kind of training data



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: