16000 cases lost in UK Covid reporting

This is being reported by the media as a failure with Excel.
It seems more likely it is a problem with CSV Format files being used for data exchange.
Should have been using Julia of course!

1 Like

How does a file format relate (CSV) to use of a programming language (Julia) in this case?

Edit: in case your point was that CSV is a horrible way to store and exchange data then I fully agree :slight_smile: But given that Excel is apparently used in the story (and lots of other places) I’m genuinely curious what role you see for Julia?

2 Likes

Hi Paul. There is a lot of fuss about this in the media in the UK (dont know if you are located there). I guess I am just flagging this up - first we see epidemiological models being front page news. Now data analysis.
From the Register article they say the problem was an old version of excel which could not cope with many rows of data.

I guess my remark about Julia was a bit flippant. Again in the UK there is outcry about why are there not more modern methods being used. I guess Julia is modern!

Before you say it, I am well aware that every company in the world runs on Excel and could not function without it.

I definitely noticed this particular bit of news here in The Netherlands :slight_smile: Technological progress is usually hampered by the least “modern” link in the chain that can’t (or won’t) be upgraded. But then again, folks also choose particular solutions because it works for them right now, instead of one that is more modern but for which they have to wait to become more adopted.

I find the posts on this forum about multi-threaded CSV reading and writing a nice symptom of that situation. I recently worked on a project that used text-encoded files (basically CSV) of several GBs each, which could easily be stored in a binary encoded format (parquet) of around 325 MB a piece. A great step forward in terms of storage, but also I/O performance and easier querying of subsets. But I’m afraid CSV and the like will be around for a long time.

It’s also partly about training people to use different (more powerful) tools, like Julia and the things built on top of it. But Excel, or spreadsheets in general, are actually a good fit for certain types of data and workflows, plus it is fairly easy to learn. So there’s also a tradeoff in tool complexity versus benefit versus skills.

Going a bit off topic… Ilived in Eindhoven and worked with ASML.
Sadly I could not introduce Julia there.
Regarding file formats, they had lots of images stored in directories… hundreds or thousands. Again I would like to see that in HDF5 or a similar format… but ahh well…

Please contribute to my thread in Community and tell us what interesting things you are doing with Julia

I would be willing to bet that in 99% of all scenarios like this, someone at some point questioned the use of Excel, and was told that it will not be replaced because “it works”.

6 Likes

@Tamas_Papp yes indeed. From the comment in the Register article:
From my understanding it was reported right up to the top so that there would be absolutely no misunderstanding that Excel should not be used. The heads of development at NHSX told PHE exactly what would happen if they used a spreadsheet system and sure enough it happened.

This quote from the article is even better (or worse):

A Reg source confirmed widespread use of the spreadsheet software as “human middleware” in the sector, scathingly describing it as the “default for all tech in all of the NHS and related quangos and other bodies… to bridge all the gaps that the ‘proper’ tech hasn’t been designed to cope with.”

2 Likes

I think this quote from this BBC article sums it up nicely:

“Excel was always meant for people mucking around with a bunch of data for their small company to see what it looked like,” commented Prof Jon Crowcroft from the University of Cambridge.

"And then when you need to do something more serious, you build something bespoke that works - there’s dozens of other things you could do.

“But you wouldn’t use XLS. Nobody would start with that.”

Also:

To handle the problem, PHE is now breaking down the test result data into smaller batches to create a larger number of Excel templates. That should ensure none hit their cap.

But insiders acknowledge that the current clunky system needs to be replaced by something more advanced that excludes Excel, as soon as possible.

2 Likes

Which cap?
Was the line limit (roughly 1 m rows) of Excel exceeded?

No, news say 65000 limit of older Excel (file format) exceeded. I’m guessing the limit is 65536…

That seems strange to me.
In Excel 2007 that row limit was increased to 1,048,576.

So either the use incredibly old software (which I don’t think), or the used an *.xls file (which makes me wonder how that could happen…?).
Excel actually gives you a warning, when you enter more than 65536 rows in a xls file and try to save it.

I guess somewhere down the line there was one supervisor who didn’t have an updated version of MS Office and didn’t want to update their setup, so everyone else was forced to use an old format.

1 Like

They have a CSV to Excel ‘‘file format’’ pipeline, and they might not have used Excel at all for it (speculating, could have used a software library). The export would specify the old format (possible even new Excel, not just software libraries). By restricting you to the old format (because of some software), doesn’t mean you can’t open it in newer Excel.

They may well have simply have used end-of-life Excel version… for exporting, or importing (or weren’t sure all have upgraded, so using that as justification for using the old format).

Going a bit off topic… Ilived in Eindhoven and worked with ASML.
Sadly I could not introduce Julia there.

Also off-topic, but regarding this remark, have you noticed this presentation at JuliaCon 2020?
https://live.juliacon.org/talk/FHEGUA

@Klaas_Pauly I did not see that presentation. I shall watch it!

IF anyone knows Jorge’s handle on here or contact details please let me know.

Staying off topic, Matlab when run in parallel has a bad habit of leaving processes runnign on the worker systems. You either have to regularly terminate orphaned processes, or do what I did and implement cgroups so thet are killed when the job terminates.