I just checked in the fix. Please go ahead take the latest. There are certain assumptions that were missed out in the implementation of the PDFIO. If it’s ok with you me adding the file to the test case, will love to do so. Sometimes, PDF creators tend to use the PDF spec in ways that stretches to the limits.
Ignore the build breaks as they are due to a download repository, that’s no longer accessible. I am looking at possibilities of uploading the files to the repository if the original hosting site has no objection.
Documentation updated to the current and can be automatically updated with every release.
pdDocGetOutline method added to get access to the document bookmarks (PDF terminology outline).
a. PDOutline and PDOutlineItem provide access structures to traverse the outlines. You can use the AbstractTrees module interfaces to traverse these structures.
b. pdOutlineItemGetAttrs enables you to query the PDOutlineItems to get the details.
Page number related APIs:
a. pdPageGetPageNumber - gets the physical page number for the current page.
b. pdDocGetPageLabel - gets the logical page label given an absolute page number.
Fonts related methods:
a. pdFontIsBold, pdFontIsItalic, pdFontIsFixedW, pdFontIsAllCap, pdFontIsSmallCap - provide the font attributes of the font. However, bold attribute of a font is just an estimate as in PDF, bold can be simulated by changing font weights, by over printing, Hence, the attribute may not be very accurate.
With certificate based encrypted files handled now almost all PDF file types can be read by the APIs as long as you have the required access passwords or recipient certificates. Kind of inclined to call this the 1.0 version, once any stability issues are reported and handled.
PDFIO is a PDF reading library. It can read a PDF file and present it in terms of low level PDF objects. It’s no machine learning library to understand internal representation of text or image artifacts. You can pick up the low level PDF objects and extract the PDF elements that are useful to you.
Complexity of PDF documents is purely creator introduced. A complex PDF document with a good quality creator can be made absolutely well tagged representation like an XML. So with the information you have shared it’s absolutely hard to decipher what you are looking for.
While extracting text is one of the things PDFIO implements in enough details, you will need to understand the PDF specification well to be able to do any significant PDF extraction tasks. Again representation of PDF to JSON is depends on your data model and how you will need the representation for your consumption. Once, you have the object hierarchy you should be able to convert to any hierarchical format of your choice including JSON.
I am planning to move to Julia 1.3 for PDFIO to include the pre-built packages under the JuliaBinaryWrappers which to me seems like a very consistent experience with binary packages. However, if you are using any older PDFIO packages the last version will be 0.1.7 which supports Julia 1.1.
If you have concerns moving to Julia 1.3 please let me know.
It’ll be merged when Julia 1.3.0 is generally available. There is a bug in the RC build due to 7z, which has been addressed and may be released as part of the GA build.
I recently wanted to port my python pdf tool to julia. It requires to merge, split, rotate etc. pdfs, but especially to insert blank pages inside a pdf.
Are such operations supported by PDFIO and if so, in which way?
PDFIO is a Reader API. There is no writer functionality.
It has a full PDF specification object model. So extending the PDF writing functionality will not be difficult but understanding of a good bit of PDF specification may be needed.
If you want to extend it please go ahead and submit PR, I will be happy to support as a package owner.