PDF Parser and Reading API

I just checked in the fix. Please go ahead take the latest. There are certain assumptions that were missed out in the implementation of the PDFIO. If it’s ok with you me adding the file to the test case, will love to do so. Sometimes, PDF creators tend to use the PDF spec in ways that stretches to the limits.

Ignore the build breaks as they are due to a download repository, that’s no longer accessible. I am looking at possibilities of uploading the files to the repository if the original hosting site has no objection.

4 Likes

Worked like a charm! Feel free to use the pdf. It is produced by the PR state senate so it is public domain. Thanks!

Release 0.1.3

  1. Documentation updated to the current and can be automatically updated with every release.
  2. pdDocGetOutline method added to get access to the document bookmarks (PDF terminology outline).
    a. PDOutline and PDOutlineItem provide access structures to traverse the outlines. You can use the AbstractTrees module interfaces to traverse these structures.
    b. pdOutlineItemGetAttrs enables you to query the PDOutlineItems to get the details.
  3. Page number related APIs:
    a. pdPageGetPageNumber - gets the physical page number for the current page.
    b. pdDocGetPageLabel - gets the logical page label given an absolute page number.
  4. Fonts related methods:
    a. pdFontIsBold, pdFontIsItalic, pdFontIsFixedW, pdFontIsAllCap, pdFontIsSmallCap - provide the font attributes of the font. However, bold attribute of a font is just an estimate as in PDF, bold can be simulated by changing font weights, by over printing, Hence, the attribute may not be very accurate.
2 Likes

Would be great if DjVu docs with optical character recognition were supported.

The djvulibre library has a tool for converting the djvu docs to text called djvutxt.

djvulibre or any library which is GPL licensed is not compatible to the MIT licensing of PDFIO. So unfortunately, we cannot integrate it with PDFIO.

If you want to OCR some stuff, Tesseract, license apache, by HP, Google may be the way to go.
It was ported in js by mit folk there https://tesseract.projectnaptha.com

Tabula, license mit, Camelot, license mit, are great projects to ocr table too.

EDIT: add link to camelot

The GPL is compatible with the MIT (aka expat/X11) license, it’s just that the combined work falls under the GPL.

We have no intention of changing the license of PDFIO to GPL at this time.

v0.1.4 - Release Notes

Edit

@julia-tagbot julia-tagbot released this 4 days ago

v0.1.4 (2019-06-17)

This release has the following enhancements:

  1. Support for validation of Digital Signatures in a PDF document.
  2. Performance improvement of pdPageExtractText .
5 Likes

v0.1.5 - Documentation Update

Edit

@julia-tagbot julia-tagbot released this 3 days ago

Documentation has been significantly updated and sample code is added to most methods.

3 Likes

v0.1.6 - Support for password protected PDF files

@julia-tagbot julia-tagbot released this 3 minutes ago

v0.1.6 (2019-07-09)

Diff since v0.1.5

Merged pull requests:

  • Support for encrypted PDF files with standard crypto handler (#67) (sambitdash)
4 Likes

Hi All,

With certificate based encrypted files handled now almost all PDF file types can be read by the APIs as long as you have the required access passwords or recipient certificates. Kind of inclined to call this the 1.0 version, once any stability issues are reported and handled.

regards,

Sambit

v0.1.7

@julia-tagbot julia-tagbot released this 3 hours ago

v0.1.7 (2019-07-12)

Diff since v0.1.6

Merged pull requests:

Assets2

Source code(zip)

Source code(tar.gz)

9 Likes

Well done, just used it, appreciate your efforts and time :kissing_heart:

1 Like

Hi Sambitdash,
I am looking for a stable and scaleable solution to read/parse complex PDF and present it in json or strucutred database.

How can I use your solution? Please check eat.bot for what I’m trying to do. Thanks

Hi @eatbot,

PDFIO is a PDF reading library. It can read a PDF file and present it in terms of low level PDF objects. It’s no machine learning library to understand internal representation of text or image artifacts. You can pick up the low level PDF objects and extract the PDF elements that are useful to you.

Complexity of PDF documents is purely creator introduced. A complex PDF document with a good quality creator can be made absolutely well tagged representation like an XML. So with the information you have shared it’s absolutely hard to decipher what you are looking for.

While extracting text is one of the things PDFIO implements in enough details, you will need to understand the PDF specification well to be able to do any significant PDF extraction tasks. Again representation of PDF to JSON is depends on your data model and how you will need the representation for your consumption. Once, you have the object hierarchy you should be able to convert to any hierarchical format of your choice including JSON.

regards,

Sambit

2 Likes

Hi All,

I am planning to move to Julia 1.3 for PDFIO to include the pre-built packages under the JuliaBinaryWrappers which to me seems like a very consistent experience with binary packages. However, if you are using any older PDFIO packages the last version will be 0.1.7 which supports Julia 1.1.

If you have concerns moving to Julia 1.3 please let me know.

https://github.com/sambitdash/PDFIO.jl/issues/73

regards,

Sambit

5 Likes

The changes are already in place at: https://github.com/sambitdash/PDFIO.jl/pull/75

It’ll be merged when Julia 1.3.0 is generally available. There is a bug in the RC build due to 7z, which has been addressed and may be released as part of the GA build.

regards,

Sambit

2 Likes

The PDFIO is now published as part of the Journal of Open Source Software.

regards,

Sambit

5 Likes

Hey @sambitdash,

I recently wanted to port my python pdf tool to julia. It requires to merge, split, rotate etc. pdfs, but especially to insert blank pages inside a pdf.

Are such operations supported by PDFIO and if so, in which way?

Thanks, ludwig

Hi @ludwig-austermann,

PDFIO is a Reader API. There is no writer functionality.

It has a full PDF specification object model. So extending the PDF writing functionality will not be difficult but understanding of a good bit of PDF specification may be needed.

If you want to extend it please go ahead and submit PR, I will be happy to support as a package owner.

regards,

Sambit

1 Like