Hi all,
Question for those involved with text extraction pipelines (or ETL pipelines for that matter):
In Julia, what works best for you when doing text extraction from PDFs?
In my case, I am simplifying the problem to look at PDFs that are single column in form and are written in English with either none or minimal images. Currently trying out both Taro.jl and PDFIO.jl - so far I have found Taro.jl a bit easier to work with to extract content. However, both packages struggle with white space in the form of actual spaces between sentences or new line characters so it seems where I end up with long delineated concatenated strings of words…
Any tips/tricks on text extraction that might help better with my processing?
Thank you!
~ tcp