Nice PDF Scrapers for Code

anon92994695 · October 30, 2020, 1:51am

Problem statement: Have a *.pdf of ~1k lines of public domain FORTRAN code from the 80s in a journal article. It’s in a split page format, with an noticeable slant from someone hand scanning it for digitalization. Want that code in plain text for archiving and maybe translation to Julia :).

Anyone know of “good” *.pdf converters, readers, or reliable free OCR tools for these tasks? I know a bit about OCR, having made one myself at some point, but, there’s dozens of tools out there now from API’s to standalone doohickies. Anyone have a recommendation?

Oscar_Smith · October 30, 2020, 2:08am

If it was scanned, your best bet is converting it to PNG (by screenshot) and using OCR. The PDF spec is one of the worst specifications ever. I would be very surprised if there are any tools that correctly implement all of the PDF spec (and yes, I’m including Adobe in this).

anon92994695 · October 30, 2020, 6:12pm

Was a little worried this is where I’d end up - but yea, It’s probably the way to go. Over the past year or so does anyone know what the best free OCR/OCR API is? I used to use Teserect, etc. I’ll do a little digging and report back if I find anything.

Tamas_Papp · October 31, 2020, 9:00am

Don’t. You can extract bitmaps directly — on Linux it’s pdfimages.

If it’s just 1K lines, I would do a rough OCR and then a manual review. OCR is fine for prose, but would make code unusable with a high probability. (FWIW, if it’s just 1K LOC and I need to read the code anyway, I would just sit down and type it in, or hire someone to do it).

anon92994695 · October 31, 2020, 10:32am

I tried OCRFeeder (available on Ubuntu). Basically it’s a GUI wrapper for Tesserect. Results are - pretty bad. It is picking out the right chars - reasonably but yea it is garbling all the syntax. Also it’s a fair bit of work trying to find the “sweet spot” for the pdf resolution to get a decent OCR.

I think in this case - Tamas you are right. The best bet is just to use your hands/brain/$$$. It’s only 8 pages of code or something.

Does stand to say that - there is a huge room for improvement of OCR to be able to either identify code and act appropriately or, address code as a niche.

Tamas_Papp · October 31, 2020, 10:44am

FWIW, I think that almost all old Fortran code that was not kept around and up to date is worse than useless.

I know it feels really nice when you discover that someone in the 1980s solved a problem for you, but it’s a trap. You dare not use it without understanding what it does, which is frequently more effort than it is worth, and then you might as well just reimplement from scratch. You will waste 90% of the time on some trick about reusing some array which the author thought was clever, but is probably irrelevant for your purposes and may even be a real bugfest. And of course it rarely (if ever) come with unit tests, which are apparently for wimps — a real Fortran coder does not need them.

anon92994695 · October 31, 2020, 11:24am

Yea I grokked a few sub routines in the code. All I can say is yikes. 50-60 lines to see if a vector matches a row in an array to some tolerance. triple nested loops with conditionals all wonky for what could be 2 lines of idiomatic julia. Fortran is a different beast entirely - that’s for sure.

Tamas_Papp · October 31, 2020, 2:58pm

To be fair, code from the 1980s is probably Fortran77, which precedes Julia by 4.5 decades.

Which is also somewhat reassuring: scientific programming came a long way, our tools are much better than in the early days.

anon92994695 · November 2, 2020, 7:17pm

Fortran is still the king of what it is. I mean no disrespect.

But turns out the solution I made to the problem was to not use OCR - and instead ask the community if similar functionality existed in our ecosystem. Turns out it does, so I got to save my brain from translating code and learn some stuff :).

That said - if any progress gets made in the area of image → code I’d love to hear about it. Teserect was not suitable for these purposes.

ericphanson · November 2, 2020, 8:20pm

You can train Tesseract on custom corpuses (there’s even a trained model for math called equ), so maybe one could train it for a particular language or to match a particular output style, like that of the journal, and it would do a bit better?

Tamas_Papp · November 3, 2020, 11:45am

I don’t think this is an application where people will invest resources — 99% of code that is not available electronically is probably dead for a good reason, and the rest you can just type in if really necessary.

As a sidenote, this topic recalls fond memories of “computer magazines” from my childhood, which had BASIC code for typing into your ZX-Spectrum or C64. You would spend an afternoon doing it, then it would do something super-simple. But it was an interesting learning experience.

Bardo · September 23, 2021, 6:09am

I can recommend MathPix.

Export images and PDFs to LaTex, DOCX, Overleaf, Markdown, Excel, ChemDraw and more, with our AI powered document conversion technology.
Save time preparing scientific documents with Mathpix Snip.

Used it to copy linear algebra examples from articles.

liuyxpp · September 23, 2021, 6:28am

I’d say only 1k lines of code is not that much. Just type it by yourself and review the code simutaneously. It is a good way to learn things. A couple of hours is worth it!

Topic		Replies	Views
PDF Parser and Reading API Data	42	12120	July 30, 2020
OCR with Julia Machine Learning	7	5082	December 12, 2021
Effective Text Extraction from Documents (PDFs) General Usage question , strings , data , nlp , etl	2	1129	February 9, 2021
[ANN] A Package to download the currently free Springer books Community package	31	3617	March 2, 2021
A plea from an old scientist for someone to write the Kernighan & Ritchie of Julia New to Julia	7	1137	June 4, 2025

Nice PDF Scrapers for Code

Related topics