How to extract data from pdf with two columns

Simone_Gabbriellini · December 24, 2023, 4:05pm

I am trying to extract text from a pdf using PDFIO.jl.

Problem is that some slides have text on double columns, and the extraction with the provided example fails to reconstruct the text the proper way. For example using this:

doc = pdDocOpen(src)
docinfo = pdDocGetInfo(doc)
npage = pdDocGetPageCount(doc)
io = IOBuffer()
for i = 1:npage
    page = pdDocGetPage(doc, i)
    pdPageExtractText(io, page)
end
pdDocClose(doc)

on the slide below:

gives a poor performance:
● SAM: The solution ○ EU incubators / accelerators: ~1200 market size ○ EU VC: ~500 ● SOM: ○ IT incubators / ● TAM:○ 1.35M tech startups accelerators: ~250, worldwide ○ IT VCs: ~60 ○ ~7000 incubators / accelerators worldwide, ○ proxies to enter and counting 5000 startups ○ ~2500 VC ﬁrms, half of them on growth giano.rocks

so I am wondering if it would be possible to query the page for all the objects it contains and read them one by one, from left to right. But tbh I am quite lost… I know I can get the objects like:

page = pdDocGetPage(doc, 5)
elements = pdPageGetContentObjects(page)

any suggestion on how I can improve this to iterate over the objects and extract text? From the docs it looks like the only text extraction function wants a page and not an object, but maybe I missed something…

tbeason · December 24, 2023, 4:17pm

Would try using GPT4 or similar tools. I have had pretty good success giving it pdfs and then conducting tasks related to the files.

Simone_Gabbriellini · December 24, 2023, 6:30pm

You mean asking gpt4 to write Julia code for the task of just ditch pdfio and use gpt4?

tbeason · December 24, 2023, 7:03pm

Just using GPT itself. I have given it unstructured text before and it works just fine to extract stuff. This was in ChatGPT, but I think the file upload API + GPT4 would probably work just as well. There is a thread floating around on Discourse here about Julia + AI packages, might be worth searching for in your case.

Simone_Gabbriellini · December 26, 2023, 9:49am

thanks for your advice and for taking the time, but IMO ditching programming in favour of a paid service that does it for me kind of defies the very purpose of these forums.

might be me being old school…

rafael.guerra · December 26, 2023, 10:31am

There seems to be a similar open issue in the package repo.

Simone_Gabbriellini · December 26, 2023, 11:59am

ahaha, you original message was a good burn…

thanks for the pointer, I missed that one

Topic		Replies	Views
PDFIO pdPageExtractText New to Julia	3	481	February 4, 2025
PDF Parser and Reading API Data	42	12133	July 30, 2020
Effective Text Extraction from Documents (PDFs) General Usage question , strings , data , nlp , etl	2	1139	February 9, 2021
Read pdf as image New to Julia images	1	664	January 19, 2021
Word Documents and PDFs in Julia New to Julia	4	379	August 15, 2024

How to extract data from pdf with two columns

Related topics