[ANN] Write Excel files with XLSX.jl v0.2.0

I have not experience with hosting docs on github, but maybe you could add another (temporary ?) shield for your docs if you already build them ?

The changes documented above and a host of others have now been merged into the #master branch of XLSX.jl. The revised docs are also linked as the Dev docs, too.

I have one further PR that is waiting to merge which adds a setFormula function, allowing you to add any Excel function to a cell or cell range from XLSX.jl and properly respects absolute and relative cell addressing.

julia> setFormula(sheet, "B2:J10", "=A2+B1") # Adds simple formula to a range of cells.

julia> setFormula(sheet, "E1:G1", "=sort(unique(A2:A7),,-1)") # adds a dynamic array function with a spill range.

julia> setFormula(sheet, "C1", " = max(B1#)") # Applies max() to the collected values in the spill range of B1.

@felipenoris has tagged the #master branch as v0.11.0 so I’m hoping a new release is coming soon!

19 Likes

I’ve been looking at memory usage in XLSX.jl and have now made a PR to reduce the memory footprint by about 40% compared with the current release (v0.10.4).

I illustrate with a simple spreadsheet of 10,000 rows x 21 columns containing a subset of data downloaded from this site (it downloads as a CSV, which I cut back and convert to an .xlsx file in Excel). This file has no formulas but it does have a large number of large shared strings and many numbers and dates - it has no magical properties! If I use About.jl to compare the memory footprint of the XLSXFile created when this file is loaded in "rw" mode, I see:

v0.10.4
about(f)             - 41MB
about(wb.sst)        - 13MB (included in f)
about(s.cache.cells) - 27MB (included in f)

However, v0.10.4 uses EzXML.jl to access the .xlsx file, and then stores the xml data it loads into the XLSXFile inside a C wrapper. I believe this is not seen by About.jl but, using simple code suggested by Claude, I calculate this to be an additional 10MB.

This brings the total memory footprint in v0.10.4 to 51MB!

In comparison, v0.11.0 uses XML.jl, so all of its memory use is seen by About.jl. For v0.11.0 + this PR, I see a total memory footprint of 30MB, as follows:

v0.11.0 + PR
about(f)             - 30MB
about(wb.sst)        -  9MB (included in f)
about(s.cache.cells) - 15MB (included in f)

There are three principal changes that have brought about this reduction. The first was already implemented (in v0.11.0, the current dev branch of XLSX.jl), and reduces the 10MB used previously by EzXML.jl to few hundred KB. The second tackles the sst cache while the third tackles the cell cache and is the biggest and most significant in terms of changes to the code base. These last two are the subject of the latest PR.

  • In v0.10.4 (in "rw" mode), all the internal .xml files are loaded and stored. Mostly these are small but, where there are many cells and/or many ssts, this results in large files which substantially duplicate the information held in the cell cache and sst cache respectively. In v0.11.0, the XML nodes holding the ssts and cell data are stripped out of the respective xml files as they are read in to their caches. On write, these nodes are created from the data in the cache anyway and always have been, so the duplicate data in v0.10.4 serve no purpose.
  • In v0.10.4 when sharedStrings are read in, they are eagerly converted to plain text strings and both the XML representation and the plain text string representation are stored in the sst cache. The XML representation may contain complex rich text formatting information, so can’t simply be discarded. In v0.11.0+PR, only the XML representation is stored in the sst cache and plain text strings are lazily generated when needed. Since, for most time-critical use cases (eg reading to a DataFrame), this still results in only one conversion per sst, there is no time penalty, but the memory saving is very significant.
  • Cells are the largest users of memory in an XLSXFile. In v0.10.4, the cell cache stores information as strings, so that a number is stored as it’s string representation, as are dates and datetimes (using their internal Excel representation). An XLSX.Cell structure would typically be around 100B in size, and this is obviously repeated for every non-empty cell. In v0.11.0+PR, I’ve converted an XLSX.Cell to always include only isbits types and to always and consistently take only 24B. The price for this has been the need to create a separate cache of cell formulas outside the cell (it is held in the Workbook in the same way as the sst cache). Since most cells do not contain formulas, the net result is a considerable saving.

Although these changes have required a major overhaul of the code, I don’t believe there are any changes to any aspect of the public API. Moreover, I don’t think there is any regression in terms of performance compared with v0.11.0-dev, either (no gains though. :cry:)

I’ll be very glad to hear your experiences of using this new version if you want to give it a go, and I will endeavour to fix any issues you uncover!

13 Likes