Effective Text Extraction from Documents (PDFs)

TheCedarPrince · February 9, 2021, 2:24am

Hi all,

Question for those involved with text extraction pipelines (or ETL pipelines for that matter):

In Julia, what works best for you when doing text extraction from PDFs?

In my case, I am simplifying the problem to look at PDFs that are single column in form and are written in English with either none or minimal images. Currently trying out both Taro.jl and PDFIO.jl - so far I have found Taro.jl a bit easier to work with to extract content. However, both packages struggle with white space in the form of actual spaces between sentences or new line characters so it seems where I end up with long delineated concatenated strings of words…

Any tips/tricks on text extraction that might help better with my processing?

Thank you!

~ tcp

contradict · February 9, 2021, 7:28pm

Not Julia, but a friend of mine recently did this and found that using Inkscape to convert to SVG and then parsing the SVG worked well. I think this is where they started: https://github.com/scraperwiki/pdf2svg/blob/master/pdf2svg.sh

mbaz · February 9, 2021, 10:48pm

There’s also pdftotext.

Topic		Replies	Views
How to extract data from pdf with two columns General Usage	6	646	December 26, 2023
PDFIO pdPageExtractText New to Julia	3	480	February 4, 2025
PDF Parser and Reading API Data	42	12128	July 30, 2020
Using Julia to extract information from a ballot Machine Learning images	7	617	November 12, 2020
ANN: uCSV.jl Data	4	1207	October 3, 2017

Effective Text Extraction from Documents (PDFs)

Related topics