[ANN] Write Excel files with XLSX.jl v0.2.0

I have not experience with hosting docs on github, but maybe you could add another (temporary ?) shield for your docs if you already build them ?

The changes documented above and a host of others have now been merged into the #master branch of XLSX.jl. The revised docs are also linked as the Dev docs, too.

I have one further PR that is waiting to merge which adds a setFormula function, allowing you to add any Excel function to a cell or cell range from XLSX.jl and properly respects absolute and relative cell addressing.

julia> setFormula(sheet, "B2:J10", "=A2+B1") # Adds simple formula to a range of cells.

julia> setFormula(sheet, "E1:G1", "=sort(unique(A2:A7),,-1)") # adds a dynamic array function with a spill range.

julia> setFormula(sheet, "C1", " = max(B1#)") # Applies max() to the collected values in the spill range of B1.

@felipenoris has tagged the #master branch as v0.11.0 so I’m hoping a new release is coming soon!

I’ve been looking at memory usage in XLSX.jl and have now made a PR to reduce the memory footprint by about 40% compared with the current release (v0.10.4).

I illustrate with a simple spreadsheet of 10,000 rows x 21 columns containing a subset of data downloaded from this site (it downloads as a CSV, which I cut back and convert to an .xlsx file in Excel). This file has no formulas but it does have a large number of large shared strings and many numbers and dates - it has no magical properties! If I use About.jl to compare the memory footprint of the XLSXFile created when this file is loaded in "rw" mode, I see:

v0.10.4
about(f)             - 41MB
about(wb.sst)        - 13MB (included in f)
about(s.cache.cells) - 27MB (included in f)

However, v0.10.4 uses EzXML.jl to access the .xlsx file, and then stores the xml data it loads into the XLSXFile inside a C wrapper. I believe this is not seen by About.jl but, using simple code suggested by Claude, I calculate this to be an additional 10MB.

This brings the total memory footprint in v0.10.4 to 51MB!

In comparison, v0.11.0 uses XML.jl, so all of its memory use is seen by About.jl. For v0.11.0 + this PR, I see a total memory footprint of 30MB, as follows:

v0.11.0 + PR
about(f)             - 30MB
about(wb.sst)        -  9MB (included in f)
about(s.cache.cells) - 15MB (included in f)

There are three principal changes that have brought about this reduction. The first was already implemented (in v0.11.0, the current dev branch of XLSX.jl), and reduces the 10MB used previously by EzXML.jl to few hundred KB. The second tackles the sst cache while the third tackles the cell cache and is the biggest and most significant in terms of changes to the code base. These last two are the subject of the latest PR.

  • In v0.10.4 (in "rw" mode), all the internal .xml files are loaded and stored. Mostly these are small but, where there are many cells and/or many ssts, this results in large files which substantially duplicate the information held in the cell cache and sst cache respectively. In v0.11.0, the XML nodes holding the ssts and cell data are stripped out of the respective xml files as they are read in to their caches. On write, these nodes are created from the data in the cache anyway and always have been, so the duplicate data in v0.10.4 serve no purpose.
  • In v0.10.4 when sharedStrings are read in, they are eagerly converted to plain text strings and both the XML representation and the plain text string representation are stored in the sst cache. The XML representation may contain complex rich text formatting information, so can’t simply be discarded. In v0.11.0+PR, only the XML representation is stored in the sst cache and plain text strings are lazily generated when needed. Since, for most time-critical use cases (eg reading to a DataFrame), this still results in only one conversion per sst, there is no time penalty, but the memory saving is very significant.
  • Cells are the largest users of memory in an XLSXFile. In v0.10.4, the cell cache stores information as strings, so that a number is stored as it’s string representation, as are dates and datetimes (using their internal Excel representation). An XLSX.Cell structure would typically be around 100B in size, and this is obviously repeated for every non-empty cell. In v0.11.0+PR, I’ve converted an XLSX.Cell to always include only isbits types and to always and consistently take only 24B. The price for this has been the need to create a separate cache of cell formulas outside the cell (it is held in the Workbook in the same way as the sst cache). Since most cells do not contain formulas, the net result is a considerable saving.

Although these changes have required a major overhaul of the code, I don’t believe there are any changes to any aspect of the public API. Moreover, I don’t think there is any regression in terms of performance compared with v0.11.0-dev, either (no gains though. :cry:)

I’ll be very glad to hear your experiences of using this new version if you want to give it a go, and I will endeavour to fix any issues you uncover!

Here are a couple of small updates.

  • I’ve added support for AnnotatedStrings via a small package extension. Using the code in this PR, you can now create rich text cell values from annotated strings directly:
julia> using XLSX, StyledStrings

julia> f=XLSX.newxlsx()                                                                                       
XLSXFile("blank.xlsx") containing 1 Worksheet
            sheetname size          range
-------------------------------------------------
               Sheet1 1x1           A1:A1        


julia> s=f[1]                                                                                                 
1Ɨ1 Worksheet: ["Sheet1"](A1:A1) 

julia> s["A1"] = styled"{yellow:hello} {blue:there}"                                                          
"hello there"

julia> s["A2"] = styled"The {bold:{italic:quick {(foreground=#cd853f):brown} fox} jumps over the {(foreground=#FFC000):lazy} dog}"
"The quick brown fox jumps over the lazy dog"

image

  • I’ve created an (unregistered update: now registered) experimental package to decrypt password protected Excel files so they can be read by XLSX.jl. This package was almost entirely created by Claude with very little input from me but it does seem to work OK. Please treat with due caution.
julia> using XLSXDecrypt, XLSX

julia> buf = decrypt_xlsx("protected.xlsx", "password")

julia> f=openxlsx(buf, mode="rw")
XLSXFile("IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=8554, maxsize=Inf, ptr=8555, mark=-1)") containing 1 Worksheet
            sheetname size          range
-------------------------------------------------
               Sheet1 3x1           A1:A3

Edit: I’m not allowed to make three replies in succession without any intermediate replies from others, so I’m resorting to this edit. Maybe this is a sign that if/when I make a new release I should start a new topic!

XLSX.jl has now successfully moved to the JuliaData org. Thanks to @felipenoris and @quinnj for making that happen.

I’ve merged the latest PRs into the main dev branch on the repo. This includes support for AnnotatedStrings and support to set Excel formulas in cells using a setFormula function.

I’ll wait a few days at least before I attempt a new release, but I don’t think it will be very long unless someone reports a bunch of issues on the dev branch. If/when I make a new release there will be a bunch of issues, some very old, that I will be able to close at the same time.