David Sander’s question: [quote]First question is why don’t you just write it in Julia, [/quote]
some more numbers.
Reading my cd.csv file into a Julia string buffer takes about 3 seconds. Although this is much slower than C, it is a one-time operation and thus acceptable for many uses.
More problematic:
A parse.(Float32, stringvector) of 100 million strings takes about 20-25 seconds on my imac. A @time begin; mysum=0.0; for i=1:n; global mysum+= parse(Float32, sv[1]); end; end takes about 50 seconds.
in contrast, the C version is about 2-5 seconds. I am assuming that this is also why R’s fread is fast.
I do not know why Julia is so slow at the string->Float conversion, but it limits the efficiency of native julia code. just guessing, this is the bottleneck for CSV.
I have no code for C to Julia. I only experimented with pure C and with pure Julia. Both code was trivial:
julia> @time begin; mystring="1.2"; mysum=0.0; for i=1:100_000_000; global mysum+= parse(Float32, mystring); end; end
27.745808 seconds (200.02 M allocations: 2.981 GiB, 0.92% gc time)
25-50 seconds.
and quick-and-dirty C code:
#include <stdio.h>
#include <stdlib.h>
int main() {
char s[100];
int n=100000000;
float mysum=0.0; sprintf(s, "%.f", 1.2);
for (int i=0; i<n; ++i) {
s[0]= '0'+i%10; ## make sure we are not optimizing it away
mysum+= atof(s);
}
printf("%f\n", mysum);
}
yes I did know this. unfortunately, the emphasis was on the did and not on the know. when one gets old, memory has already been used up and is no longer freed up by the garbage collector.
heck, at least I do remember always to type my variables and functions, because this avoids a set of other gotchas.
but can I ask not just for the mechanics, but also for the why? is this a temporary problem, or something that is intrinsically unavoidable in Julia?
I mean, the variable is global and typed. The code is compiled and run, and when this happens, it can see that the variable exists and is typed. I understand why type variability can cause extra code and dispatch time, but why here?
I am wondering: what is actually causing the slowing down of C-native 100m atof’s vs. julia 100m parses then? is it the C-Julia barrier, which is also crossed 100m times?
even after my global variable mishap (which, I apologize, sidetracked everyone), it still does not seem feasible to get a native julia csv decoder to the speed of fread, because the necessary string-float conversions already exceed the budget.
In the CRSP daily data, the ret variable sometimes contains characters like “C”, which will force the whole column into string. You then need to convert it back to numeric, and replace these special observation with missing value. I think such task has been handled by most existing packages. If you write your own parser, a small change in the data structure may require a rewrite of code. Are you running the code in cloud?
No, pointer. This is undefined. f_safe is also not safer than f (other than the wrong use of pointer there), it just checks for embedding NUL, which isn’t really a safety issue. The safety issue, i.e. if the string is NUL terminated, is guaranteed in both cases.
I think I see. although the compiler could easily figure out the type in my example, in more general cases, another function could have the side effect of changing a global variable. this is not the case with variables defined in a stack frame.
in some sense, I would need the ability to lock a global variable to retain a certain type. and julia 1.0.0 does not yet have it.
Do note that CSV.jl uses the native-Julia float parser in the Parsers.jl package, which is different from what’s provided in Base Julia (Base just calls the C function strtod). There’s a known performance issue when parsing floats at full precision, but if the float is rounded to even one less significant digit, the performance is comparable with C. A few of us have ideas on how to fix this, it’s just a matter of deciding what would be easiest/best for the fix.
Again, it would be most helpful if there are specific files that can be shared, even if just privately, to compare the performance of CSV.jl vs. other parsers.
Hi, thx for the great work in CSV.jl, it’s very nice to have DataFrame interaction. What work flow do you recommend with large CSV file? Can we dump it to other format at the moment? or can we partially |> to DataFrame?
Would it make sense to generate a big test data CSV (why not other formats as well) with all the known bad data features? Or maybe a package that generates the data file(s) locally for testing.
As this question is orthogonal to the original, it may be worth starting a separate thread. When there are multiple conversations happening in a single these it can make it hard to follow, and make it harder for future reference to find.