More control over writedlm formatting

mcopik · March 23, 2017, 5:45pm

Hi,

I’ve recently discovered that writedlm() tends to output floating-point numbers using a very strange format pattern which removes the fractional part by decreasing exponent. Not only it is terribly confusing, the data is no longer easy to read and understand by a human.

An example:
6.0444e-5 -> 60444e-9 2.15234e-5 -> 215234e-10 3.3253e-5 -> 33253e-9
It is not a bug - print_shortest has done it job by reducing the output by exactly one character - a decimal mark. But it is not human readable anymore and it can take a lot of time to notice that decimal mark is gone and the exponent is indeed correct.

I have seen that a similar problem with print function has been resolved by not using print_shortest anymore and it looks that I’m neither first nor second to ask questions due to lack of configurability in this function. The source code suggests that at some point (at least two years ago) there was an idea to extend capabilities of writedlm. I have to implement a new functionality because the current situation is sadly not acceptable for us and I might just try to contribute to Julia and extend writedlm(). The questions here are: is there some solution which I have not been able to find? Is there someone already working on that? Do you already have an idea how to extend writedlm signature or do you have plans for a more generic IO formatting?

Another option here is to modify print_shortest but I guess that there is a reason why it is being used.

Best regards,
Marcin

StefanKarpinski · March 24, 2017, 1:31pm

Use the CSV package instead: Home · CSV.jl. We should really delete readdlm and writedlm.

giordano · March 24, 2017, 2:12pm

Please, don’t. Probably CSV can do lots more than readdlm, but the latter can do easily very simple thing, without the need of using a DataFrame. CSV.read is not even able to read an UTF-encoded file correctly[1], I have no problem with readdlm.

I remember you also expressed similar “disdain” for QuadGK, which is actually a very nice piece of code

Note:
[1] This seems specific to CSV.jl, DataFrames.jl works flawless.

Edit: the problem with CSV.jl can be actually solved by setting the keyword weakrefstrings=false.

mkborregaard · March 24, 2017, 3:10pm

Isn’t readdlm for reading Matrixes directly? I think that functionality is distinct from CSV’s? (I don’t think a Matrix defines a Data.Sink)

StefanKarpinski · March 24, 2017, 8:35pm

It has nothing to do with disdain. Numerical integration just doesn’t belong in a language’s standard library. It’s fine to include it in a standard distribution of packages, but there’s just no reason to have integration in the base language. In the case of CSV reading and writing, it’s for a different reason: the current state of affairs is perfectly demonstrated by this thread. People naturally try the built-in functionality which lags behind the CSV package in both features and performance, and whenever anyone has a question the answer ends up being to use the CSV package. The fact that CSV.read doesn’t handle UTF-8 data incorrectly doesn’t mean that we should continue to split efforts between readcsv and CSV – it means that CSV should be fixed and readcsv should be deleted.

StefanKarpinski · March 24, 2017, 8:38pm

In general, we should not have several half-baked solutions to problems – be it integration, csv reading/writing, or whatever. Instead, we should strive to have a single good solution. Having a half-baked solution in the standard library actively prevents us from reaching that better state of affairs.

giordano · March 24, 2017, 8:54pm

“Disdain” was between quotes for a reason

Anyway, I see your point, but really, for many simple uses readdlm and writedlm are very handy. Of course, for more fine-grained control other tools are more suited.

How about moving to external package at least? They wouldn’t look as “the official way for reading/writing matrices in Julia”, but at least would still be available.

Paul_Soderlind · March 24, 2017, 9:15pm

It is probably reasonable to move readdlm and similar things out of the standard library, but I would kindly ask you to offload them to simple and stand-alone packages. QuadGK is a good example.

There is something to be said for packages that do not have lots of cross-dependencies (on other packages).

/Paul S

mkborregaard · March 24, 2017, 9:47pm

I think there might be some confusion about words here? I am sure almost everybody agrees that CSV.read is superior to readcsv and readtable for reading DataFrames, and that these should be removed in favour of a CSV dependency. But the question was about readdlm / writedlm which are also used for directly reading numerical matrices into the Matrix type. That is still really useful.

Or am I misunderstanding this? (EDIT I was misrepresenting this slightly, so I edited the text above).

mzaffalon · March 25, 2017, 4:42am

If this happens, are there plans to add a section in the official documentation with a collection of recommended libraries as opposed to listing 1300+ packages and their build status? For somebody not keeping a close eye on the current state of development, an extended standard library documentation section would be the fastest way to get started.

(Since it was mentioned, quadgk is not a standard library routine any longer in v0.6 and the only reference I could find to the Quadgk.jl package is in the developer section about compiler efficiency.)

mcopik · March 25, 2017, 3:34pm

I agree fully. I believe it is a very common problem and having a simple file I/O for numerical matrices would be very beneficial to users.

The real question here is not if writedlm should be more powerful or should be abandoned. I’m still not understanding why 215234e-10 is considered to be a better format than 2.15234e-5. It’s not easier to read, it’s barely shorter and it is terribly confusing. Is there anyone else here who agrees with me that this is just a bug in print_shortest?

mcopik · March 25, 2017, 3:57pm

I understand your POV but, as I said above in another reply, saving a small array of numbers to file is a very common task which does not require a huge flexibility or high throughput and using a full library there is an overkill. One can argue that my complaint is based on a lack of configurability in writedlm() but I did not expect from it to be flexible or efficient, I expected a simple utility to save data in a human readable format and it failed me.

Do you think that I should download an additional and much more complex package to just create a datafile for pgfplots? I’d have to put this burden on each user of my very small and simple library. I downloaded the package and here is the output:

INFO: Installing CSV v0.1.2
INFO: Installing CategoricalArrays v0.1.3
INFO: Installing Compat v0.21.0
INFO: Installing DataArrays v0.4.0
INFO: Installing DataFrames v0.9.0
INFO: Installing DataStreams v0.1.2
INFO: Installing DataStructures v0.5.3
INFO: Installing FileIO v0.3.1
INFO: Installing GZip v0.3.0
INFO: Installing NullableArrays v0.1.0
INFO: Installing Reexport v0.0.3
INFO: Installing SortingAlgorithms v0.1.1
INFO: Installing SpecialFunctions v0.1.1
INFO: Installing StatsBase v0.13.1
INFO: Installing WeakRefStrings v0.2.0

Fifteen packages to create simple text file with few strings and numbers, and it is very likely that I will never use any other functionality from these packages in my project. I can achieve the same thing with one function call in NumPy (savetxt) or two in MATLAB (fprintf for header and dlmwrite).

The most likely scenario here is that I’ll just implement a loop over IOBuffer generating a very long string representing my data. It won’t be efficient but it doesn’t have to. It won’t be elegant but it will keep my library simple and easy to use.

nalimilan · March 25, 2017, 6:37pm

What’s the actual problem with installing these 15 packages? They are all pure Julia and very lightweight. They don’t make your library complex nor hard to use. At some point there will be a set of preinstalled standard packages, in which CSV might be included.

The problem with having simple default functions is that we keep having to tell people to use the more complex implementation when the basic one doesn’t suit their needs. This wastes everybody’s time.

It should be possible to allow CSV.jl to return data as a matrix for when you don’t want to use a data frame. The code for that just needs to be written.

mkborregaard · March 25, 2017, 7:00pm

Though I may not care about installing 15 packages (especially basic ones like these) it still seems like a big dependency to incur for a package to be able to read a CSV file, given the efforts we otherwise take to keep packages dependency-light and modular (e.g. StatPlots is a distinct package from Plots precisely so that Plots can avoid depending on those packages).

My main concern was to point out that readcsv and CSV.read do not have completely overlapping functionalities, which should be considered if readcsv should be deleted. That would be remedied if the Matrix type would define a Data.Sink as you say. But is that really the best design?

Would it not seem more obvious to make the methods distinct by deprecating readcsvs ability to read DataFrames and other general table structures (via the ::Type positional argument), but keep the function as a way to read and write a Matrix in Base; and then making CSV.read the only way of reading DataFrames (and friends)? It would still save the issue of having to tell people to use a different function.

mkborregaard · March 25, 2017, 7:11pm

(maybe the discussion about deprecating readdlm should be split into a different topic and this thread kept to discuss the issue of print_shortest formatting)

StefanKarpinski · March 25, 2017, 8:00pm

Having functionality for loading an array of delimited values with a single common element type, with no support for escaping data or anything complex. seems fine, but readddlm and writedlm try to do way too much more than that. The number of dependencies of CSV.jl is an issue, but the fact that it currently has too many dependencies is not really an argument that we shouldn’t strive to have a simpler, better factored, generic CSV reading/writing in an external package instead of in the base library.

mkborregaard · March 25, 2017, 8:26pm

I agree completely with that.

quinnj · March 25, 2017, 9:23pm

The dependencies situation for CSV is indeed currently unfortunate, but it’s entirely due to the current dependency on DataFrames, which has dependency-bloated over the years and is the number one example I’m aware of that would benefit from optional dependencies. I could remove the DataFrames dependency, which means users would have to write CSV.read(file, DataFrame) instead, which isn’t terrible.

Also, there’s absolutely no reason we can’t define Data.Sink interface methods for Matrix and it’s been on my list for a while, just haven’t gotten around to.

mkborregaard · March 25, 2017, 9:30pm

Most of the dependencies seem to also come through DataStreams though? Or could you also remove DataFrames and the others from that package? (don’t be mistaken, I am a huge fan of CSV.jl )

mcopik · March 27, 2017, 12:48pm

Do you think I should start another thread here or open an issue on Github?

Topic		Replies	Views
How to specify the format in writedlm? General Usage	2	2426	June 2, 2017
Writedlm gives wrong output General Usage	6	1001	April 26, 2019
CSV Reading (rewrite in C?) Internals & Design	50	5069	October 1, 2018
Reading fixed-width files? General Usage	22	3680	March 22, 2020
Very slow readdlm() General Usage	14	1917	October 2, 2018

More control over writedlm formatting

Related topics