I am trying to extract text from a pdf using PDFIO.jl.
Problem is that some slides have text on double columns, and the extraction with the provided example fails to reconstruct the text the proper way. For example using this:
doc = pdDocOpen(src)
docinfo = pdDocGetInfo(doc)
npage = pdDocGetPageCount(doc)
io = IOBuffer()
for i = 1:npage
page = pdDocGetPage(doc, i)
pdPageExtractText(io, page)
end
pdDocClose(doc)
on the slide below:
gives a poor performance:
● SAM: The solution ○ EU incubators / accelerators: ~1200 market size ○ EU VC: ~500 ● SOM: ○ IT incubators / ● TAM:○ 1.35M tech startups accelerators: ~250, worldwide ○ IT VCs: ~60 ○ ~7000 incubators / accelerators worldwide, ○ proxies to enter and counting 5000 startups ○ ~2500 VC firms, half of them on growth giano.rocks
so I am wondering if it would be possible to query the page for all the objects it contains and read them one by one, from left to right. But tbh I am quite lost… I know I can get the objects like:
page = pdDocGetPage(doc, 5)
elements = pdPageGetContentObjects(page)
any suggestion on how I can improve this to iterate over the objects and extract text? From the docs it looks like the only text extraction function wants a page and not an object, but maybe I missed something…