PDF Parser and Reading API

sambitdash · July 12, 2017, 8:34am

Hi All,

I am developing a PDF Library for doing some simple tasks like extracting texts and PDF file contents and attributes. The library is written in pure Julia (save some dependencies on some filter libraries). I am open to anyone interested in reviewing and contributing to it:

https://github.com/sambitdash/PDFIO.jl

regards,

Sambit

avik · July 12, 2017, 11:00am

Hey Sambit,

Thanks for tackling this, a pure julia PDF parser will be very useful. I hope this can eventually lead to a writing library as well (but I’ll understand if that is not of particular interest to you). I’m also hoping that we can get some higher level tools on top, things like Tabula for example. I am very interested in this, and will play around with it, but unfortunately not sure how much time i’ll have to contribute seriously.

Regards

Avik

sambitdash · July 12, 2017, 11:32am

Thanks Avik.

Tabula should not be as difficult, although every structured text processing in PDF is some form of heuristic as document structure is not mandated in PDFs. Acrobat had a table picker way back in 2002. So I am assuming it may not be hard to implement. Extraction of all the forms of text is definitely of my interest and will ensure APIs for the same are available. However, I may leave the subsequent heuristic development for table picking for someone to invest focused time and effort in that direction.

I will add an issue in the project for tracking this requirement.

regards,

Sambit

sambitdash · July 12, 2017, 11:45am

https://github.com/sambitdash/PDFIO.jl/issues/2

oxinabox · July 12, 2017, 3:01pm

PDF parsing is a hideous problem.
I wish you all the luck in the world.

sambitdash · July 12, 2017, 3:21pm

@oxinabox

Very true!!! The biggest issue is PDF creators generate files that are non-compliant with the spec. Many a times you have to give the creator higher precedence over the spec based on your customer.

http://www.stillhq.com/pdfdb/db.html

has a lot of such examples.

regards,

Sambit

sambitdash · July 19, 2017, 9:58am

Hi All,

I am now kind of finalizing the v1 of the APIs for the PDF library or the core of the PDF reader library. Here are the initial benefits of the library.

It will allow you to read through a PDF file and create objects which can be used for further access to the document.
It will also provide you the details of the content in every page and create a tree like data structure of PDF page contents which can be used know what is there in the PDF document.
The library has been tested with about 800+ text based files (12000+ pages) so fairly robust in text objects. And the parser is fairly robust and a bit non-tolerant as a standards based file is given higher emphasis.

However, next steps to extend the library requires specific domain where it will be used. For example, in the text extraction itself here are some standard challenges:

PDF text do not have reading order of character appearance. So text may appear as “aliuJ” with each character location in such a way printed such that the visual output is “Julia”.
Text and graphics directives can be interspersed. So you may get 5 different text objects as each character.
Since, fonts can be sub-setted “Julia” may be printed as (uvwxy) with gyph code of embedded font-51. One needs to query these judiciously with several logical smart reasoning to get the actual text.

Every such reasoning is subjective to the needs and interpretation of the developer/user and can be challenged with an alternate viewpoint. Hence, it’s important to keep the low level APIs simple and minimal such that any advanced development can be carried out on top of the minimal API set.

After some thoughts I realized I will rather keep the base APIs simple and minimal. Thus providing more flexibility to developers to develop more advanced solutions they need.

Of course there are a few areas in the basic APIs that are missing currently:

Enhancing the documentation of the library.
Support for encrypted PDF
Support for image filters. This has been knowingly avoided as most people may be using a third party API to render the final graphics. They could send the encrypted image in JPEG or JPX or LZW (TIFF, PNG, GIF) formats than decompressing and sending raw image to the rendering API.
Standardize the tree iterator with AbstractTrees APIs.
Develop what is needed as the adoption of APIs increase.

If you are all in agreement with my approach, I will register the PDFIO to Julia Package so that it’s available for general usage and testing.

Looking to hear from you soon.

regards,

Sambit

sambitdash · September 8, 2017, 8:12am

Update on the PDFIO API. It now has:

A full documentation of the APIs : API Structure and Design · PDFIO
Has a text extraction API pdPageExtractText(page)
Complex page number support.
Supports unicode code extraction from font encoding as well as unicode CMap. (does not read into the font’s internal encoding)
Supports Adobe’s encoding for latin fonts.
Does not do any special handling for tagged PDFs but tagged PDFs may behave better as the creation order and reading order of document objects are similar.

regards,

Sambit

sambitdash · October 4, 2017, 6:57am

Update on the PDFIO API v.0.0.8:

Changes this release:

A new pdPageExtractText method is introduced which does a cleaner text conversion for complex PDFs including non-tagged PDFs.
Bug fixes
Text conversions carried out on 25,000+ files.

The master untagged version has also some heuristics for text extraction when space character is simulated through text positioning. A few documents of 1000+ pages have been used for text extraction testing as well.

sambitdash · November 3, 2017, 2:38pm

Update on the PDFIO API v.0.0.9:

Changes this release:

pdPageExtractText handles superscripts with enhanced heuristics.
2, Space can be simulated from text positions.

regards,

Sambit

Nosferican · December 6, 2018, 8:50pm

Is this working in Julia 1.0?
Could I get the text out of a page of a PDF file?

jandehaan · December 6, 2018, 10:40pm

Thank you for providing this PDF package.

I noticed that the function names that are part of the API don’t follow the Julia naming conventions, but other functions do. For example you chose a name like pdPageExtractText instead of something like pd_page_extract_text. Was this motivated by wanting to be consistent with PDF API naming conventions used elsewhere, or is this merely a historical accident?

sambitdash · December 7, 2018, 12:34am

The convention is very similar to what is used by Adobe’s PDF Library and many other libraries used in the industry in general.

Secondly, Julia does not have a convention for exported methods. Only exported methods in PDFIO follow this convention. Internal methods follow the underscore notation.

Regards,

Sambit

sambitdash · December 7, 2018, 12:35am

Yes. It works with 1.0.

Nosferican · December 7, 2018, 12:44am

I am trying to port an application from R and just need something like,

library(pdftools)
txt <- pdf_text(pdf = 'data.pdf') # Vector{String} for each page

How would that be with PDFIO.jl?

using PDFIO
doc = pdDocOpen("data.pdf")
page = pdDocGetPage(doc, 1) |>
    (doc -> IOBuffer() |>
        (io -> pdPageExtractText(io, page)))
read(page) # empty no data
pdDocClose(doc)

sambitdash · December 7, 2018, 1:47am

This link can give you some ideas.

https://github.com/sambitdash/PDFIO.jl/issues/38

Nosferican · December 7, 2018, 5:18am

Perfect! Thanks.

Ended up using,

function getPDFText(pdf)
    doc = pdDocOpen(pdf)
    npage = pdDocGetPageCount(doc)
    for i ∈ 1:npage
        lines = pdDocGetPage(doc, i) |>
            (page -> pdPageExtractText(IOBuffer(), page)) |>
            (io -> readlines(IOBuffer(String(take!(io))))) |>
            magic
    end
end

Nosferican · April 1, 2019, 1:54pm

I am getting non-boolean (CosName) used in boolean context with this file.

sambitdash · April 1, 2019, 2:26pm

Give me a few days to investigate and get back. If you isolate the issue and feel free to submit a PR.

regards,

Sambit

Nosferican · April 1, 2019, 2:37pm

I think it might require some more advanced PDF familiarity than what I have. R’s pdftools::pdf_text had no problem with the file. I was able to “fix it”, by running ILovePDF compress on the file and then it works just fine. The file was generated on a Macbook Pro using save as from a cfm file. Do reach out for more details or if you need the exact process to help track down the issue.

Topic		Replies	Views
Read pdf as image New to Julia images	1	670	January 19, 2021
How to extract data from pdf with two columns General Usage	6	655	December 26, 2023
Effective Text Extraction from Documents (PDFs) General Usage question , strings , data , nlp , etl	2	1145	February 9, 2021
Viewer for PDF images General Usage question , images , cairomakie-plotting	34	1922	March 12, 2023
PDFIO pdPageExtractText New to Julia	3	483	February 4, 2025

PDF Parser and Reading API

Regards

Related topics