How to extract data from pdf with two columns

I am trying to extract text from a pdf using PDFIO.jl.

Problem is that some slides have text on double columns, and the extraction with the provided example fails to reconstruct the text the proper way. For example using this:

doc = pdDocOpen(src)
docinfo = pdDocGetInfo(doc)
npage = pdDocGetPageCount(doc)
io = IOBuffer()
for i = 1:npage
    page = pdDocGetPage(doc, i)
    pdPageExtractText(io, page)

on the slide below:

gives a poor performance:
● SAM: The solution ○ EU incubators / accelerators: ~1200 market size ○ EU VC: ~500 ● SOM: ○ IT incubators / ● TAM:○ 1.35M tech startups accelerators: ~250, worldwide ○ IT VCs: ~60 ○ ~7000 incubators / accelerators worldwide, ○ proxies to enter and counting 5000 startups ○ ~2500 VC firms, half of them on growth

so I am wondering if it would be possible to query the page for all the objects it contains and read them one by one, from left to right. But tbh I am quite lost… I know I can get the objects like:

page = pdDocGetPage(doc, 5)
elements = pdPageGetContentObjects(page)

any suggestion on how I can improve this to iterate over the objects and extract text? From the docs it looks like the only text extraction function wants a page and not an object, but maybe I missed something…

Would try using GPT4 or similar tools. I have had pretty good success giving it pdfs and then conducting tasks related to the files.

You mean asking gpt4 to write Julia code for the task of just ditch pdfio and use gpt4?

Just using GPT itself. I have given it unstructured text before and it works just fine to extract stuff. This was in ChatGPT, but I think the file upload API + GPT4 would probably work just as well. There is a thread floating around on Discourse here about Julia + AI packages, might be worth searching for in your case.

thanks for your advice and for taking the time, but IMO ditching programming in favour of a paid service that does it for me kind of defies the very purpose of these forums.

might be me being old school… :slight_smile:

There seems to be a similar open issue in the package repo.

1 Like

ahaha, you original message was a good burn…

thanks for the pointer, I missed that one

1 Like