PDFIO pdPageExtractText

Does anybody work with PDFIO.jl?
I need to extract text from pdf, due to limiting of my knowledge, I don’t understand how to use the method to save text to string, which is printing to REPL.

 doc = pdDocOpen("test.pdf")
 page = pdDocGetPage(doc, 1)
 dPageExtractText(stdout, page)

the first argument to dPageExtractText is stdout indicating that the output is printed. If you want to capture the output, you may try something like this

io = IOBuffer()
dPageExtractText(io, page)
String(take!(io))

note, that I have not tried this, but something like that should work.

2 Likes

Thank you .
only, it should be String() not string()

3 Likes

I use this function:

function getPDFText(src, out)
# handle that can be used for subsequence operations on the document.
doc = pdDocOpen(src)

# Metadata extracted from the PDF document. 
# This value is retained and returned as the return from the function. 
docinfo = pdDocGetInfo(doc)
open(out, "w") do io
    # Returns number of pages in the document       
    npage = pdDocGetPageCount(doc)
    for i=1 : npage
        # handle to the specific page given the number index. 
        page = pdDocGetPage(doc, i)
        # Extract text from the page and write it to the output file.
        pdPageExtractText(io, page)
    end
end

I found it on the internet. My problem is that it take always the first line of the next page. The PDFtk does a better job.

1 Like