CSV Reading (rewrite in C?)

iwelch · September 27, 2018, 9:02pm

David Sander’s question: [quote]First question is why don’t you just write it in Julia, [/quote]

some more numbers.

Reading my cd.csv file into a Julia string buffer takes about 3 seconds. Although this is much slower than C, it is a one-time operation and thus acceptable for many uses.

More problematic:

A parse.(Float32, stringvector) of 100 million strings takes about 20-25 seconds on my imac. A @time begin; mysum=0.0; for i=1:n; global mysum+= parse(Float32, sv[1]); end; end takes about 50 seconds.

in contrast, the C version is about 2-5 seconds. I am assuming that this is also why R’s fread is fast.

I do not know why Julia is so slow at the string->Float conversion, but it limits the efficiency of native julia code. just guessing, this is the bottleneck for CSV.

kristoffer.carlsson · September 27, 2018, 9:21pm

Please post code when posting benchmarks if possible. Julia calls into C for parsing e.g. Float32s.

Trying on master I get

julia> strs = string.(1:100*10^6);

julia> @time parse.(Float32, strs);;
 10.913384 seconds (9 allocations: 381.470 MiB)

(Master, and 1.0.1 does have improve performance for parsing Floats by KristofferC · Pull Request #27764 · JuliaLang/julia · GitHub which might explain why you get slower results).

For performance, don’t use globals. Fixing that this take (of course) just as long time as just parsing the strings.

iwelch · September 27, 2018, 9:33pm

I have no code for C to Julia. I only experimented with pure C and with pure Julia. Both code was trivial:

julia> @time begin; mystring="1.2"; mysum=0.0; for i=1:100_000_000; global mysum+= parse(Float32, mystring); end; end
 27.745808 seconds (200.02 M allocations: 2.981 GiB, 0.92% gc time)

25-50 seconds.

and quick-and-dirty C code:

#include <stdio.h>
#include <stdlib.h>

int main() {
  char s[100];
  int n=100000000;

  float mysum=0.0; sprintf(s, "%.f", 1.2);
  for (int i=0; i<n; ++i) {
    s[0]= '0'+i%10;  ## make sure we are not optimizing it away
    mysum+= atof(s);
  }
  printf("%f\n", mysum);
}

2 seconds.

GunnarFarneback · September 27, 2018, 9:42pm

That’s spending a fair amount of time updating a non-const global.

julia> @time begin; mystring="1.2"; mysum=0.0; for i=1:100_000_000; global mysum+= parse(Float32, mystring); end; end
 38.471735 seconds (200.02 M allocations: 2.981 GiB, 0.21% gc time)

julia> function f()
           mystring="1.2"
           mysum=0.0
           for i=1:100_000_000
               mysum += parse(Float32, mystring)
           end
           return mysum
       end
       @time f()
  9.861511 seconds (24.90 k allocations: 1.305 MiB)
1.2000000476837158e8

julia> @time f()
  9.834558 seconds (5 allocations: 176 bytes)
1.2000000476837158e8

iwelch · September 27, 2018, 10:06pm

oh, one of those julia gotchas.

The global was a typed floating point, one storage location. I hope that julia will eventually recognize such situations.

ok, so we are down to a factor of 4 rather than 10. Still, for a fast csv reader, this is a killer.

kristoffer.carlsson · September 27, 2018, 10:25pm

julia> f(s) = ccall((:atof, "libc"), Float64, (Ptr{UInt8},), pointer(s))
f (generic function with 2 methods)

julia> @btime f("1.23")
  27.176 ns (0 allocations: 0 bytes)
1.23

julia> f_safe(s) = ccall((:atof, "libc"), Float64, (Cstring,), s)
f_safe (generic function with 1 method)

julia> @btime f_safe("1.23")
  34.547 ns (0 allocations: 0 bytes)
1.23

julia> @btime parse(Float64, "1.23")
  31.142 ns (0 allocations: 0 bytes)
1.23

FWIW.

NiclasMattsson · September 27, 2018, 11:29pm

Benchmarking _ in _ global _ scope _ is _ probably _ the _ most _ common _ mistake _ that _ Julia _ newbies _ make, _ and _ is _ the _ topic _ of _ countless _ posts. _ But _ I _ would _ think _ you _ already _ knew _ that _ considering _ THIS.

Yifan_Liu · September 27, 2018, 11:42pm

Use traceur to avoid such common performance issues.

iwelch · September 27, 2018, 11:46pm

But _ I _ would _ think _ you _ already _ knew _ that _ considering _ THIS

yes I did know this. unfortunately, the emphasis was on the did and not on the know. when one gets old, memory has already been used up and is no longer freed up by the garbage collector.

heck, at least I do remember always to type my variables and functions, because this avoids a set of other gotchas.

iwelch · September 27, 2018, 11:50pm

yes, for real code, traceur would save my behind.

but can I ask not just for the mechanics, but also for the why? is this a temporary problem, or something that is intrinsically unavoidable in Julia?

I mean, the variable is global and typed. The code is compiled and run, and when this happens, it can see that the variable exists and is typed. I understand why type variability can cause extra code and dispatch time, but why here?

iwelch · September 27, 2018, 11:55pm

I am wondering: what is actually causing the slowing down of C-native 100m atof’s vs. julia 100m parses then? is it the C-Julia barrier, which is also crossed 100m times?

even after my global variable mishap (which, I apologize, sidetracked everyone), it still does not seem feasible to get a native julia csv decoder to the speed of fread, because the necessary string-float conversions already exceed the budget.

Yifan_Liu · September 27, 2018, 11:57pm

In the CRSP daily data, the ret variable sometimes contains characters like “C”, which will force the whole column into string. You then need to convert it back to numeric, and replace these special observation with missing value. I think such task has been handled by most existing packages. If you write your own parser, a small change in the data structure may require a rewrite of code. Are you running the code in cloud?

yuyichao · September 28, 2018, 12:03am

No it’s not typed.

No that’s undecidable.

yuyichao · September 28, 2018, 12:16am

No, pointer. This is undefined. f_safe is also not safer than f (other than the wrong use of pointer there), it just checks for embedding NUL, which isn’t really a safety issue. The safety issue, i.e. if the string is NUL terminated, is guaranteed in both cases.

iwelch · September 28, 2018, 5:04am

I think I see. although the compiler could easily figure out the type in my example, in more general cases, another function could have the side effect of changing a global variable. this is not the case with variables defined in a stack frame.

in some sense, I would need the ability to lock a global variable to retain a certain type. and julia 1.0.0 does not yet have it.

quinnj · September 28, 2018, 6:16am

Do note that CSV.jl uses the native-Julia float parser in the Parsers.jl package, which is different from what’s provided in Base Julia (Base just calls the C function strtod). There’s a known performance issue when parsing floats at full precision, but if the float is rounded to even one less significant digit, the performance is comparable with C. A few of us have ideas on how to fix this, it’s just a matter of deciding what would be easiest/best for the fix.

Again, it would be most helpful if there are specific files that can be shared, even if just privately, to compare the performance of CSV.jl vs. other parsers.

jling · September 28, 2018, 8:34am

Hi, thx for the great work in CSV.jl, it’s very nice to have DataFrame interaction. What work flow do you recommend with large CSV file? Can we dump it to other format at the moment? or can we partially |> to DataFrame?

Tero_Frondelius · September 28, 2018, 9:00am

Would it make sense to generate a big test data CSV (why not other formats as well) with all the known bad data features? Or maybe a package that generates the data file(s) locally for testing.

GunnarFarneback · September 28, 2018, 9:53am

Cf GitHub - JuliaImages/TestImages.jl: commonly used test images for image test data.

kevbonham · September 28, 2018, 10:53am

As this question is orthogonal to the original, it may be worth starting a separate thread. When there are multiple conversations happening in a single these it can make it hard to follow, and make it harder for future reference to find.