PDF Parser and Reading API

sambitdash · April 1, 2019, 3:18pm

I just checked in the fix. Please go ahead take the latest. There are certain assumptions that were missed out in the implementation of the PDFIO. If it’s ok with you me adding the file to the test case, will love to do so. Sometimes, PDF creators tend to use the PDF spec in ways that stretches to the limits.

Ignore the build breaks as they are due to a download repository, that’s no longer accessible. I am looking at possibilities of uploading the files to the repository if the original hosting site has no objection.

Nosferican · April 1, 2019, 6:43pm

Worked like a charm! Feel free to use the pdf. It is produced by the PR state senate so it is public domain. Thanks!

sambitdash · April 23, 2019, 10:06am

Release 0.1.3

Documentation updated to the current and can be automatically updated with every release.
pdDocGetOutline method added to get access to the document bookmarks (PDF terminology outline).
a. PDOutline and PDOutlineItem provide access structures to traverse the outlines. You can use the AbstractTrees module interfaces to traverse these structures.
b. pdOutlineItemGetAttrs enables you to query the PDOutlineItems to get the details.
Page number related APIs:
a. pdPageGetPageNumber - gets the physical page number for the current page.
b. pdDocGetPageLabel - gets the logical page label given an absolute page number.
Fonts related methods:
a. pdFontIsBold, pdFontIsItalic, pdFontIsFixedW, pdFontIsAllCap, pdFontIsSmallCap - provide the font attributes of the font. However, bold attribute of a font is just an estimate as in PDF, bold can be simulated by changing font weights, by over printing, Hence, the attribute may not be very accurate.

chakravala · April 23, 2019, 10:22am

Would be great if DjVu docs with optical character recognition were supported.

The djvulibre library has a tool for converting the djvu docs to text called djvutxt.

sambitdash · April 23, 2019, 10:47am

djvulibre or any library which is GPL licensed is not compatible to the MIT licensing of PDFIO. So unfortunately, we cannot integrate it with PDFIO.

o314 · May 10, 2019, 12:17am

If you want to OCR some stuff, Tesseract, license apache, by HP, Google may be the way to go.
It was ported in js by mit folk there https://tesseract.projectnaptha.com

Tabula, license mit, Camelot, license mit, are great projects to ocr table too.

EDIT: add link to camelot

stevengj · May 10, 2019, 2:49am

The GPL is compatible with the MIT (aka expat/X11) license, it’s just that the combined work falls under the GPL.

sambitdash · May 10, 2019, 5:26am

We have no intention of changing the license of PDFIO to GPL at this time.

sambitdash · June 21, 2019, 11:55am

v0.1.4 - Release Notes

Edit

julia-tagbot released this 4 days ago

v0.1.4 (2019-06-17)

This release has the following enhancements:

Support for validation of Digital Signatures in a PDF document.
Performance improvement of pdPageExtractText .

sambitdash · June 26, 2019, 5:11pm

v0.1.5 - Documentation Update

Edit

julia-tagbot released this 3 days ago

Documentation has been significantly updated and sample code is added to most methods.

sambitdash · July 9, 2019, 11:02pm

v0.1.6 - Support for password protected PDF files

julia-tagbot released this 3 minutes ago

v0.1.6 (2019-07-09)

Diff since v0.1.5

Merged pull requests:

Support for encrypted PDF files with standard crypto handler (#67) (sambitdash)

sambitdash · July 14, 2019, 5:42pm

Hi All,

With certificate based encrypted files handled now almost all PDF file types can be read by the APIs as long as you have the required access passwords or recipient certificates. Kind of inclined to call this the 1.0 version, once any stability issues are reported and handled.

regards,

Sambit

v0.1.7

julia-tagbot released this 3 hours ago

v0.1.7 (2019-07-12)

Diff since v0.1.6

Merged pull requests:

PKI Security Handler implementation (#69) (sambitdash)

Assets2

Source code(zip)

Source code(tar.gz)

hasanOryx · September 7, 2019, 6:53pm

Well done, just used it, appreciate your efforts and time

eatbot · October 2, 2019, 6:27am

Hi Sambitdash,
I am looking for a stable and scaleable solution to read/parse complex PDF and present it in json or strucutred database.

How can I use your solution? Please check eat.bot for what I’m trying to do. Thanks

sambitdash · October 2, 2019, 7:21am

Hi @eatbot,

PDFIO is a PDF reading library. It can read a PDF file and present it in terms of low level PDF objects. It’s no machine learning library to understand internal representation of text or image artifacts. You can pick up the low level PDF objects and extract the PDF elements that are useful to you.

Complexity of PDF documents is purely creator introduced. A complex PDF document with a good quality creator can be made absolutely well tagged representation like an XML. So with the information you have shared it’s absolutely hard to decipher what you are looking for.

While extracting text is one of the things PDFIO implements in enough details, you will need to understand the PDF specification well to be able to do any significant PDF extraction tasks. Again representation of PDF to JSON is depends on your data model and how you will need the representation for your consumption. Once, you have the object hierarchy you should be able to convert to any hierarchical format of your choice including JSON.

regards,

Sambit

sambitdash · October 6, 2019, 8:28am

Hi All,

I am planning to move to Julia 1.3 for PDFIO to include the pre-built packages under the JuliaBinaryWrappers which to me seems like a very consistent experience with binary packages. However, if you are using any older PDFIO packages the last version will be 0.1.7 which supports Julia 1.1.

If you have concerns moving to Julia 1.3 please let me know.

https://github.com/sambitdash/PDFIO.jl/issues/73

regards,

Sambit

sambitdash · November 3, 2019, 7:14pm

The changes are already in place at: https://github.com/sambitdash/PDFIO.jl/pull/75

It’ll be merged when Julia 1.3.0 is generally available. There is a bug in the RC build due to 7z, which has been addressed and may be released as part of the GA build.

regards,

Sambit

sambitdash · November 15, 2019, 7:54am

The PDFIO is now published as part of the Journal of Open Source Software.

regards,

Sambit

ludwig-austermann · November 15, 2019, 11:47am

Hey @sambitdash,

I recently wanted to port my python pdf tool to julia. It requires to merge, split, rotate etc. pdfs, but especially to insert blank pages inside a pdf.

Are such operations supported by PDFIO and if so, in which way?

Thanks, ludwig

sambitdash · November 15, 2019, 12:01pm

Hi @ludwig-austermann,

PDFIO is a Reader API. There is no writer functionality.

It has a full PDF specification object model. So extending the PDF writing functionality will not be difficult but understanding of a good bit of PDF specification may be needed.

If you want to extend it please go ahead and submit PR, I will be happy to support as a package owner.

regards,

Sambit

Topic		Replies	Views
Effective Text Extraction from Documents (PDFs) General Usage question , strings , data , nlp , etl	2	1129	February 9, 2021
Nice PDF Scrapers for Code Offtopic	12	1418	September 23, 2021
Using Julia to extract information from a ballot Machine Learning images	7	617	November 12, 2020
TextFormats parser generator Biology, Health, and Medicine	5	531	March 5, 2023
Read pdf as image New to Julia images	1	664	January 19, 2021

PDF Parser and Reading API

v0.1.4 (2019-06-17)

v0.1.6 (2019-07-09)

v0.1.7 (2019-07-12)

Related topics