I’m working with folders of PDFs and .docx documents in Julia. I’ve been trying to find some packages to help me deal with this, but I only really found Docx (I’ve taken care of the PDFs). While I can the documents with Docx, I’d like to be able to get more than plain text, like word count, page count, etc. Are there any better packages for handling word documents in Julia or getting this info with Docx? Thanks.
I wrote WriteDocx.jl so I know a bit about these files now, however that package only involves writing them and not reading them.
They are just zip files with a couple xml files in them, and these xml files can be opened with packages such as EzXML.jl
While a word count could be relatively easy (go through all w:t tags and count the words in their content strings) you cannot get page counts this way because there are no pages in docx files. The pages are the result of feeding the content through the layouting algorithm of Word, but you don’t have access to that.
I think somebody in the community may try to wrap Apache POI (https://poi.apache.org/) using the JavaCall package of Julia for a full control on the office files.
There is already Taro.jl that wraps various Apache libraries in this area.
Thinking a bit outside the box: If you can handle the pdfs already, wouldn’t it be easiest to just convert the .docx to pdf and then analyse the pdf version?
You can probably do this automatically via pandoc (there is a Julia binding Pandoc.jl but I don’t know it’s status).