Effective Text Extraction from Documents (PDFs)

Hi all,

Question for those involved with text extraction pipelines (or ETL pipelines for that matter):

In Julia, what works best for you when doing text extraction from PDFs?

In my case, I am simplifying the problem to look at PDFs that are single column in form and are written in English with either none or minimal images. Currently trying out both Taro.jl and PDFIO.jl - so far I have found Taro.jl a bit easier to work with to extract content. However, both packages struggle with white space in the form of actual spaces between sentences or new line characters so it seems where I end up with long delineated concatenated strings of words…

Any tips/tricks on text extraction that might help better with my processing?

Thank you!

~ tcp :deciduous_tree:

1 Like

Not Julia, but a friend of mine recently did this and found that using Inkscape to convert to SVG and then parsing the SVG worked well. I think this is where they started: pdf2svg/pdf2svg.sh at master · scraperwiki/pdf2svg · GitHub

There’s also pdftotext.