Renumbering PDF files - can this be implemented with PDFIO.jl instead?

I often find myself with PDFs in which the physical numbering of the pages does not match the numbering of the PDF.

For example, suppose that you have a PDF in which the first twelve pages of the PDF are front matter that should be numbered with the Roman numerals i, ii, iii, ..., xi, xii. And then, starting with the thirteenth page of the PDF, the pages should be numbered with Arabic numerals 1, 2, 3, ....

Basically, it’s the task described in this answer: metadata - How to change internal page numbers in the meta data of a PDF? - Super User

I’ve written this function:

function renumber(p::Pair{<:AbstractString, <:AbstractString},
                  n::Integer)
    input_filename = p[1]
    output_filename = p[2]
    a = read(input_filename, String)
    r = r"(<<\/Type[\s]?\/Catalog[\s]?[\s\S]*?)(>>)"
    s = SubstitutionString(
        string(
            "\\1",
            "/PageLabels << /Nums [ 0 << /S /r >>\n",
            "                       % labels pages 1 to $(n-1) in small Roman numerals\n",
            "                       $(n-1) << /S /D >>\n",
            "                       % numbers pages $(n) to the end in Arabic numerals\n",
            "                     ]\n",
            "            >>\n",
            "\\2\n",
        )
    )
    b = replace(a, r => s)
    rm(output_filename; force = true, recursive = true)
    open(output_filename, "w") do io 
        println(io, b)
    end
    return output_filename
end

Which can be run as such:

julia> renumber("inputfile.pdf" => "outputfile.pdf", 13)

This is pretty hacky. Is there a better way? For example, could we implement this using the PDFIO.jl package instead?

1 Like

I just want to say that I like the Pair based syntax. LOL.

1 Like

cc: @sambitdash

Inspired by your recent range PR!

1 Like

The PDF editing is not currently a functionality in PDFIO. While I am open to any PRs that will enhance the scope PDFIO, PDFIO will always be compliant to the PDF Specification. The code you have suggested definitely does some quickfixes for PDF files but lacks in terms of ensuring consistent and specification compliant PDF files. If you can suggest the same and work towards ensuring PDF generated will be PDF specification compliant, feel free to submit a PR for review. I would believe it needs significant design changes to the existing PDFIO core to achieve the same.

Thanking you.

regards,

Sambit

Note: Some readers or even Adobe Reader opening a file does not necessarily make a file specification compliant. Readers show leniency as many incompatible PDF files are generated, that readers try to work gracefully.

2 Likes

Thanks Sambit! I definitely agree that we must comply with the PDF spec.

I don’t know very much about the internals of PDF files. Do you know of a way we can accomplish the desired task (adding PageLabels to the catalog dictionary, as described in table 28 on page 73 here) while staying in compliance with the PDF spec?

I think the main point here is: With your string replacement you change the length of an (PDF) object and that should be input to recreate the correct xref table at the end of the .pdf (or the correct place). PDF supports also incremental updates, so you could put your object at the end of the previous file and write a new end-of-file structure. Both things are easy if the underlying library deals with independent objects.

Having a clear unambiguous argument order that will not get confused with similarly named Python API has its benefits.

To your original question, have you considered Taro.jl by @avik ? Use it Java 8. It provides a Julia interface to to Apache Tika and FOP.

PDFIO does not support PDF editing functionality. So if you are expecting to recreate a new PDF or modify an existing PDF file, the PDFIO library today cannot do so.

regards,

Sambit

1 Like