CSV Reading (rewrite in C?)

David Sander’s question: [quote]First question is why don’t you just write it in Julia, [/quote]

some more numbers.

Reading my cd.csv file into a Julia string buffer takes about 3 seconds. Although this is much slower than C, it is a one-time operation and thus acceptable for many uses.

More problematic:

A parse.(Float32, stringvector) of 100 million strings takes about 20-25 seconds on my imac. A @time begin; mysum=0.0; for i=1:n; global mysum+= parse(Float32, sv[1]); end; end takes about 50 seconds.

in contrast, the C version is about 2-5 seconds. I am assuming that this is also why R’s fread is fast.

I do not know why Julia is so slow at the string->Float conversion, but it limits the efficiency of native julia code. just guessing, this is the bottleneck for CSV.

Please post code when posting benchmarks if possible. Julia calls into C for parsing e.g. Float32s.

Trying on master I get

julia> strs = string.(1:100*10^6);

julia> @time parse.(Float32, strs);;
 10.913384 seconds (9 allocations: 381.470 MiB)

(Master, and 1.0.1 does have improve performance for parsing Floats by KristofferC · Pull Request #27764 · JuliaLang/julia · GitHub which might explain why you get slower results).

For performance, don’t use globals. Fixing that this take (of course) just as long time as just parsing the strings.

1 Like

I have no code for C to Julia. I only experimented with pure C and with pure Julia. Both code was trivial:

julia> @time begin; mystring="1.2"; mysum=0.0; for i=1:100_000_000; global mysum+= parse(Float32, mystring); end; end
 27.745808 seconds (200.02 M allocations: 2.981 GiB, 0.92% gc time)

25-50 seconds.

and quick-and-dirty C code:

#include <stdio.h>
#include <stdlib.h>

int main() {
  char s[100];
  int n=100000000;

  float mysum=0.0; sprintf(s, "%.f", 1.2);
  for (int i=0; i<n; ++i) {
    s[0]= '0'+i%10;  ## make sure we are not optimizing it away
    mysum+= atof(s);
  }
  printf("%f\n", mysum);
}

2 seconds.

That’s spending a fair amount of time updating a non-const global.

julia> @time begin; mystring="1.2"; mysum=0.0; for i=1:100_000_000; global mysum+= parse(Float32, mystring); end; end
 38.471735 seconds (200.02 M allocations: 2.981 GiB, 0.21% gc time)

julia> function f()
           mystring="1.2"
           mysum=0.0
           for i=1:100_000_000
               mysum += parse(Float32, mystring)
           end
           return mysum
       end
       @time f()
  9.861511 seconds (24.90 k allocations: 1.305 MiB)
1.2000000476837158e8

julia> @time f()
  9.834558 seconds (5 allocations: 176 bytes)
1.2000000476837158e8
3 Likes

oh, one of those julia gotchas.

The global was a typed floating point, one storage location. I hope that julia will eventually recognize such situations.

ok, so we are down to a factor of 4 rather than 10. Still, for a fast csv reader, this is a killer.

julia> f(s) = ccall((:atof, "libc"), Float64, (Ptr{UInt8},), pointer(s))
f (generic function with 2 methods)

julia> @btime f("1.23")
  27.176 ns (0 allocations: 0 bytes)
1.23

julia> f_safe(s) = ccall((:atof, "libc"), Float64, (Cstring,), s)
f_safe (generic function with 1 method)

julia> @btime f_safe("1.23")
  34.547 ns (0 allocations: 0 bytes)
1.23

julia> @btime parse(Float64, "1.23")
  31.142 ns (0 allocations: 0 bytes)
1.23

FWIW.

Benchmarking _ in _ global _ scope _ is _ probably _ the _ most _ common _ mistake _ that _ Julia _ newbies _ make, _ and _ is _ the _ topic _ of _ countless _ posts. _ But _ I _ would _ think _ you _ already _ knew _ that _ considering _ THIS. :slight_smile:

6 Likes

Use traceur to avoid such common performance issues.

1 Like

But _ I _ would _ think _ you _ already _ knew _ that _ considering _ THIS

yes I did know this. unfortunately, the emphasis was on the did and not on the know. when one gets old, memory has already been used up and is no longer freed up by the garbage collector.

heck, at least I do remember always to type my variables and functions, because this avoids a set of other gotchas.

yes, for real code, traceur would save my behind.

but can I ask not just for the mechanics, but also for the why? is this a temporary problem, or something that is intrinsically unavoidable in Julia?

I mean, the variable is global and typed. The code is compiled and run, and when this happens, it can see that the variable exists and is typed. I understand why type variability can cause extra code and dispatch time, but why here?

I am wondering: what is actually causing the slowing down of C-native 100m atof’s vs. julia 100m parses then? is it the C-Julia barrier, which is also crossed 100m times?

even after my global variable mishap (which, I apologize, sidetracked everyone), it still does not seem feasible to get a native julia csv decoder to the speed of fread, because the necessary string-float conversions already exceed the budget.

In the CRSP daily data, the ret variable sometimes contains characters like “C”, which will force the whole column into string. You then need to convert it back to numeric, and replace these special observation with missing value. I think such task has been handled by most existing packages. If you write your own parser, a small change in the data structure may require a rewrite of code. Are you running the code in cloud?

No it’s not typed.

No that’s undecidable.

1 Like

No, pointer. This is undefined. f_safe is also not safer than f (other than the wrong use of pointer there), it just checks for embedding NUL, which isn’t really a safety issue. The safety issue, i.e. if the string is NUL terminated, is guaranteed in both cases.

I think I see. although the compiler could easily figure out the type in my example, in more general cases, another function could have the side effect of changing a global variable. this is not the case with variables defined in a stack frame.

in some sense, I would need the ability to lock a global variable to retain a certain type. and julia 1.0.0 does not yet have it.

Do note that CSV.jl uses the native-Julia float parser in the Parsers.jl package, which is different from what’s provided in Base Julia (Base just calls the C function strtod). There’s a known performance issue when parsing floats at full precision, but if the float is rounded to even one less significant digit, the performance is comparable with C. A few of us have ideas on how to fix this, it’s just a matter of deciding what would be easiest/best for the fix.

Again, it would be most helpful if there are specific files that can be shared, even if just privately, to compare the performance of CSV.jl vs. other parsers.

1 Like

Hi, thx for the great work in CSV.jl, it’s very nice to have DataFrame interaction. What work flow do you recommend with large CSV file? Can we dump it to other format at the moment? or can we partially |> to DataFrame?

Would it make sense to generate a big test data CSV (why not other formats as well) with all the known bad data features? Or maybe a package that generates the data file(s) locally for testing.

1 Like

Cf GitHub - JuliaImages/TestImages.jl: commonly used test images for image test data.

As this question is orthogonal to the original, it may be worth starting a separate thread. When there are multiple conversations happening in a single these it can make it hard to follow, and make it harder for future reference to find.

2 Likes