XLSX.jl cannot handle customXml elements correctly when opening files as templates

This is a cross post from an issue on github.

My application uses XLSX.jl to open xlsx files as templates, add data and then write them out again. It stopped working when I moved the template files to a new directory. I had the issue once before back in Feb and have managed to work around it for a while, but it’s got me again and now it’s a bigger problem for me.

I’ve done a little digging. I don’t understand much, but here is what I’ve found:

ERROR: LoadError: AssertionError: Some internal files were not loaded into memory. Did you use `XLSX.open_xlsx_template` to open this file?
Stacktrace:
 [1] writexlsx(output_source::String, xf::XLSX.XLSXFile; overwrite::Bool)
   @ XLSX C:\Users\TGebbels\.julia\packages\XLSX\U2Bcm\src\write.jl:64

Using some different .xlsx files works just fine.

Line 64 in write.jl says

@assert all(values(xf.files)) "Some internal files were not loaded into memory. Did you use `XLSX.open_xlsx_template` to open this file?"

So I tried to see what was going on with

println(keys(template.files))
println(values(template.files))

which showed:

["xl/sharedStrings.xml", "xl/drawings/drawing1.xml", "customXml/item2.xml", "customXml/itemProps1.xml", "customXml/itemProps2.xml", "xl/worksheets/sheet1.xml", "xl/ctrlProps/ctrlProp1.xml", "_rels/.rels", "xl/workbook.xml", "xl/styles.xml", "[Content_Types].xml", "docProps/core.xml", "xl/_rels/workbook.xml.rels", "xl/worksheets/sheet2.xml", "customXml/_rels/item1.xml.rels", "xl/worksheets/_rels/sheet1.xml.rels", "customXml/item1.xml", "xl/theme/theme1.xml", "customXml/_rels/item2.xml.rels", "docProps/app.xml"]
Bool[1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1]

for a file that fails whereas, for a working file, it shows

["xl/sharedStrings.xml", "xl/drawings/drawing1.xml", "xl/worksheets/sheet1.xml", "xl/ctrlProps/ctrlProp1.xml", "_rels/.rels", "xl/workbook.xml", "xl/styles.xml", "[Content_Types].xml", "docProps/core.xml", "xl/_rels/workbook.xml.rels", "xl/worksheets/sheet2.xml", "xl/worksheets/_rels/sheet1.xml.rels", "xl/theme/theme1.xml", "docProps/app.xml"]
Bool[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

The only differences are the customXml items. So it looks like XLSX.jl doesn’t correctly handle the customXml content.

I can manually delete the customXml folder if I open the xlsx file as a zip. If I do that, the file will work again with XLSX.jl.

["xl/sharedStrings.xml", "xl/drawings/drawing1.xml", "xl/worksheets/sheet1.xml", "xl/ctrlProps/ctrlProp1.xml", "_rels/.rels", "xl/workbook.xml", "xl/styles.xml", "[Content_Types].xml", "docProps/core.xml", "xl/_rels/workbook.xml.rels", "xl/worksheets/sheet2.xml", "xl/worksheets/_rels/sheet1.xml.rels", "xl/theme/theme1.xml", "docProps/app.xml"]
Bool[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

However, it will no longer work with Excel, which offers to try to repair the file. Once repaired by Excel, the file fails again but now with even more customXml elements!

["customXml/_rels/item3.xml.rels", "xl/sharedStrings.xml", "xl/drawings/drawing1.xml", "customXml/item2.xml", "customXml/itemProps1.xml", "customXml/item3.xml", "customXml/itemProps2.xml", "xl/worksheets/sheet1.xml", "xl/ctrlProps/ctrlProp1.xml", "_rels/.rels", "xl/workbook.xml", "xl/styles.xml", "[Content_Types].xml", "docProps/core.xml", "xl/_rels/workbook.xml.rels", "customXml/itemProps3.xml", "xl/worksheets/sheet2.xml", "customXml/_rels/item1.xml.rels", "xl/worksheets/_rels/sheet1.xml.rels", "customXml/item1.xml", "xl/theme/theme1.xml", "docProps/custom.xml", "customXml/_rels/item2.xml.rels", "docProps/app.xml"]
Bool[0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1]

I don’t understand the internals of Excel and I don’t understand how XLSX.jl works. However, I found the lines:

# ignore custom XML internal files
if startswith(f.name, "customXml")
    continue
end

in the code to read the template file in. Does the code ignore these elements on read but then require them on write? Any hints or suggestions gratefully received. Thanks!

I’m having a go here but I’m struggling. I don’t know github and wouldn’t know how to go about finding, fixing and testing an issue. Even so, I’m going to stick my neck out naively because no-one else has yet. With those caveats, here are some thoughts:

In read.jl, the function open_or_read_xlsx begins at line 190.
Inside a for loop over xf.io.files (at line 199), it has two conditional blocks.
If certain conditions are met, it reads the io.file with internal_xml_file_read
at line 222.
Otherwise (at line 224), if the condition is not met, it reads the io.file as a binary object - # we only read binary files to save the Excel file later.

It looks to me like the customXml io.files meet the condition at line 222, but then get excluded from the call internal_xml_file_read by the test at line 218.

If they are not going to be used, it might fix the issue I’ve encountered if the customXml were instead made to fail the condition at line 190. Then they would be included as binary files ready to save later.

It is very likely that I’ve misunderstood what’s going on and I’ve grossly over-simplified. Despite this, I’ve set this out in case anyone can help!

Thanks,

You can try out your theory if you add XLSX as a dev dependency, i.e. do

] dev XLSX

You will then have a local copy at ~/.julia/dev/XLSX where you can edit the source code. If you then start a fresh REPL, and do

using Revise, XLSX

You can see if your changes improved the situation. Using Revise will allow you to keep changing the source code of XLSX in the background and have changes propagate to your REPL session when you save them.

1 Like

Thank you, @nilshg. Firmly into new territory here, but will have a go…

1 Like

Appreciate that it can be daunting, and despite developing my own packages on occasion as well as helping out on others where I can for years now I still run into hiccups every now and then.

That said the tl;dr for what you’re doing here is really:

  • ]dev will create a copy of the package in your ~/.julia/dev folder
  • You can then change whatever function in the package you want to work by simply opening the relevant file in your favourite editor, just like when you work on your own scripts
  • using Revise, XLSX in an environment where you have added XLSX via ]dev XLSX will use your local, patched-up version of the package, and Revise allows any changes you make to be visible directly in your live session.

That is, you can have a REPL open where you do using Revise, XLSX, call the function you’re working on to see if it works after your changes, and if it doesn’t do some more changes to the source file, save it, and call the function in your REPL session again. This allows for relatively swift iteration.

2 Likes

I’ve had a go at this with some success.

To start with, I just removed the exclusion of the customXml internal files from being processed by internal_xml_file_read by commenting out the lines:

# ignore custom XML internal files
#if startswith(f.name, "customXml")
#    continue
#end

This resulted in the following warnings:

┌ Warning: XMLError: xmlns: URI ed0bb2e3-83d5-46a8-9e95-92c1ca9cc599 is not absolute from XML Namespace module (code: 100, line: 1)
└ @ EzXML C:\Users\TGebbels\.julia\packages\EzXML\DL8na\src\error.jl:97
┌ Warning: XMLError: xmlns: URI 8fe8bdaf-491a-4d2e-89dd-756120b60898 is not absolute from XML Namespace module (code: 100, line: 1)
└ @ EzXML C:\Users\TGebbels\.julia\packages\EzXML\DL8na\src\error.jl:97

Besides these, the process seemed to work for my purposes. I don’t really know how significant these warnings are but I suppose Excel knows best what it puts in these elements of a .xlsx file.

Second, I changed the initial condition so that the customXml internal files were treated as binary files and simply “passed through” from read to write without being touched.

if !startswith(f.name, "customXml") && (endswith(f.name, ".xml") || endswith(f.name, ".rels"))
#if endswith(f.name, ".xml") || endswith(f.name, ".rels")

This also worked for my purpose and without producing any errors or warnings!

My use case is simply to add data from julia into one of several pre-existing Excel templates, so I’m only writing simple cell values. I have no idea whether this approach would have adverse consequences for more complex use cases. But for me, it seems to have worked!

1 Like

The next step is then to do (in an environment in which your dev version of XLSX is added):

] test XLSX

If all tests pass, your fix might not have any adverse consequences for more complex use cases (to the extent they are covered by XLSX’s test suite) and you can move up to the next level, which is making a pull request on GitHub do upstream your fix.

1 Like
     Testing Running tests...
Test Summary:   | Pass  Total  Time
read test files |   23     23  3.2s
Test Summary: | Pass  Total  Time
Cell names    |  326    326  0.0s
Test Summary: | Pass  Total  Time
getindex      |   12     12  1.2s
Test Summary:     | Pass  Total  Time
Time and DateTime |    6      6  0.1s
Test Summary:  | Pass  Total  Time
number formats |   16     16  1.3s
Test Summary: | Pass  Total  Time
Defined Names |   24     24  0.5s
Test Summary: | Pass  Total  Time
Book1.xlsx    |   32     32  0.3s
Test Summary:       | Pass  Total  Time
book_1904_ptbr.xlsx |    9      9  0.3s
Test Summary: | Pass  Total  Time
numbers.xlsx  |   75     75  0.0s
Test Summary: | Pass  Total  Time
Column Range  |    9      9  0.1s
Test Summary:      | Pass  Total  Time
CellRange iterator |    1      1  0.1s
Test Summary: | Pass  Total  Time
Table         |  598    598  2.8s
Test Summary:    | Pass  Total  Time
Helper functions |   32     32  0.5s
Test Summary: | Pass  Total  Time
Write         |   73     73  0.3s
Test Summary: | Pass  Total  Time
Edit Template |    3      3  0.2s
Test Summary: | Pass  Total  Time
addsheet!     |   10     10  0.1s
Test Summary: | Pass  Total  Time
Edit          |   13     13  0.3s
Test Summary: | Pass  Total  Time
writetable    |  183    183  1.3s
Test Summary: | Pass  Total  Time
Styles        |   74     74  0.2s
Test Summary: | Pass  Total  Time
filemodes     |  200    200  2.8s
Test Summary: | Pass  Total  Time
escape        |   84     84  0.2s
Test Summary: | Pass  Total  Time
row_index     |    1      1  0.1s
Test Summary: |Time
show xlsx     | None  0.0s
Test Summary:  | Pass  Total  Time
relative paths |    5      5  0.2s
Test Summary:         | Pass  Total  Time
windows compatibility |    3      3  0.0s
Test Summary:    | Pass  Total  Time
whitespace nodes |    8      8  0.0s
Test Summary: | Pass  Total  Time
inlineStr     |   20     20  0.0s
Test Summary:         | Pass  Total  Time
Tables.jl integration |   27     27  3.4s
     Testing XLSX tests passed 

!!! :smiley:

edit: Also, just to confirm, the status lists

  [fdbf4ff8] XLSX v0.10.2-dev `C:\Users\TGebbels\.julia\dev\XLSX`
2 Likes

Update read.jl to pass through Custom XML internal files by TimG1964 · Pull Request #261 · felipenoris/XLSX.jl (github.com)

Hope I did this right!

Big thanks to @nilshg. Without your patient hand-holding I wouldn’t have had the knowledge or confidence to get here.

1 Like

That’s great, thanks for trying and pushing this through. XLSX.jl has largely been a one-man job since its inception, partly I think because most people in the Julia community only ever have to work with Excel once in a blue moon. I’m sure Felipe would love a co-maintainer so if your work requires loads of Excel interaction it might be worthwhile digging into the package a bit more deeply and helping to drive development forward!

1 Like

This pull request was merged a few days ago. The latest version of XLSX.jl now incorporates the fix.

3 Likes