Word Documents and PDFs in Julia

John_Finseth · August 14, 2024, 10:56am

I’m working with folders of PDFs and .docx documents in Julia. I’ve been trying to find some packages to help me deal with this, but I only really found Docx (I’ve taken care of the PDFs). While I can the documents with Docx, I’d like to be able to get more than plain text, like word count, page count, etc. Are there any better packages for handling word documents in Julia or getting this info with Docx? Thanks.

jules · August 14, 2024, 6:30pm

I wrote WriteDocx.jl so I know a bit about these files now, however that package only involves writing them and not reading them.

They are just zip files with a couple xml files in them, and these xml files can be opened with packages such as EzXML.jl

While a word count could be relatively easy (go through all w:t tags and count the words in their content strings) you cannot get page counts this way because there are no pages in docx files. The pages are the result of feeding the content through the layouting algorithm of Word, but you don’t have access to that.

jbytecode · August 14, 2024, 6:50pm

I think somebody in the community may try to wrap Apache POI (https://poi.apache.org/) using the JavaCall package of Julia for a full control on the office files.

dawbarton · August 14, 2024, 7:15pm

There is already Taro.jl that wraps various Apache libraries in this area.

abraemer · August 15, 2024, 4:06am

Thinking a bit outside the box: If you can handle the pdfs already, wouldn’t it be easiest to just convert the .docx to pdf and then analyse the pdf version?
You can probably do this automatically via pandoc (there is a Julia binding Pandoc.jl but I don’t know it’s status).

Topic		Replies	Views
Reading word '.doc' file New to Julia question	2	1293	October 19, 2018
How convert docx and odt into pure text? General Usage	2	603	March 8, 2020
Create text document without LaTeX General Usage	3	139	August 20, 2024
PDF Parser and Reading API Data	42	12129	July 30, 2020
How to export the documentation of a Julia package to PDF? General Usage documentation , packages , pdf-format	4	378	June 15, 2024

Word Documents and PDFs in Julia

Related topics