Rough start with julia (with CSV package)

Hi,

I’ve been trying to get started with julia and something which i thought might provide a good starting point was to install the CSV package.

I installed the ubuntu package for julia which is 0.4.5 and that was a disaster. the CSV package took a VERY long time to install, minutes. Once it did finally install, “using” it resulted in all sorts of error/warning messages. Finding the the latest julia version was 0.5 i downloaded the 0.5 binary tarball and tried again with CVS. This time the installation was much more reasonable.

The precompilation did take what seems like a longish time, probably about 5-10s, but it did work.

Here’s where things get weird.

on my home machine, if i run a CSV.read on a largish (300line) csv file it only takes about 4 seconds (which still seems unreasonable).

however at work (im thinking about using julia for some non-trivial development) reading a 4 line test file took 15+ seconds. No, really, at least 15s ! Naturally while this is going on the julia process has the CPU pegged. for all practical purposes it appears to be broken.

i am also using the 0.5 binary tarball at work too.

im mystified as to why there would be such a disparity and i’m wondering if anyone has any ideas as to what i might check.

thanks!

We need more specifics – what is displayed when you run versioninfo() on each of the two machines? How do the two systems differ, if you know. Is the time required to read the csv file[s] the same when you do it a second time on each machine?

Agreed that you need to tell us more so we can actually help. You mention “all sorts of error/warning messages”. Do you know what any of them were?

For reference, I can read a 200,000+ line file with CSV.read in about 0.1 seconds on a 5-year-old macbook pro. Note that the first time you call CSV.read within a julia session, it will compile that method, so it will be slow. On my machine, that takes about 5 seconds. Did you try running the function more than once?

Hi,

machine:

Linux bluefin 4.4.0-57-generic #78-Ubuntu SMP Fri Dec 9 23:50:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

The 0.5 experience is now better, although not consistent with what i was getting earlier. I deleted the .julia directory and started over and now seem to be getting consistent results on both machines.

does a 60+s pre-compile time seem reasonable ?

if i reload julia and execute “using CSV” the time is now much shorter, < 1s.

julia> tic()
0x0000a569093dfe19

julia> using CSV
INFO: Precompiling module CSV.

julia> toc()
elapsed time: 63.172235846 seconds
63.172235846

julia> tic(); dt=CSV.read(“src/julia/test.csv”); toc();
elapsed time: 10.617358486 seconds

julia> tic(); dt=CSV.read(“src/julia/test.csv”); toc();
elapsed time: 0.003795728 seconds

I then proceeded to run a second test on a file of roughly 500,000 lines and it finished in less than 1s ! So that’s great !

Here’s the 0.4.5 experience. I have cut a LOT of the messages out because they go on for quite a while…

julia> versioninfo()
Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core™2 CPU 6300 @ 1.86GHz
WORD_SIZE: 64
BLAS: libopenblas (NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Core2)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.8

INFO: Precompiling module CSV…
WARNING: New definition
promote_op(Type{#T<:Any}, Any) at /home/brian/.julia/v0.4/NullableArrays/src/operators.jl:16
is ambiguous with:
promote_op(Any, Type{#R<:Number}) at /home/brian/.julia/v0.4/NullableArrays/src/operators.jl:18.
To fix, define
promote_op(Type{#T<:Any}, Type{#R<:Number})
before the new definition.
WARNING: New definition
broadcast(Any, NullableArrays.NullableArray…)
is ambiguous with:
broadcast(Function, DataArrays.PooledDataArray…) at /home/brian/.julia/v0.4/DataArrays/src/broadcast.jl:323.
To fix, define
broadcast(Function)
before the new definition.
WARNING: New definition
map(Any, NullableArrays.NullableArray…) at /home/brian/.julia/v0.4/NullableArrays/src/map.jl:109
is ambiguous with:
map(Function, Lazy.List…) at /home/brian/.julia/v0.4/Lazy/src/liblazy.jl:104.

etc… etc… etc…

elapsed time: 345.335024124 seconds

I didn’t bother to run CSV.read() again with 0.4.5. it wasn’t obvious to me that there was a point in continuing to try and use the stock ubuntu julia package.

60s is a lot. But if this is a brand new Julia directory, then pre-compiling CSV.jl will also mean pre-compiling all of its dependencies (like Data frames.jl, etc.). So it’s not out of the question. Fortunately, you should never need to compile all of those packages together again unless they all update simultaneously.

It’s also worth noting that if you just want to read a CSV file into an array, you can use the built-in readdlm function without installing any packages at all.

1 Like

On my relatively old iMac (mid 2011) the times are as follows, after adding the CSV module and doing the precompile:

julia> tic()
0x0003ecba74d69849

julia> using CSV

julia> toc()
elapsed time: 0.737481587 seconds
0.737481587

julia> tic(); dt = CSV.read(Pkg.dir() * "/CSV/test/test_files/SalesJan2009.csv"); toc()
elapsed time: 7.460471108 seconds
7.460471108

julia> tic(); dt = CSV.read(Pkg.dir() * "/CSV/test/test_files/SalesJan2009.csv"); toc()
elapsed time: 0.297707282 seconds
0.297707282

The size of the first file you run doesn’t appear to make any difference to that slowish first time: as @rdelts says, Julia is calling the functions for the first time so you’re measuring JIT-compilation. Here’s the same procedure run in a fresh session, but using a short CSV file rather than the larger one:

julia> tic()
0x0003ecd7f0df967a

julia> using CSV

julia> toc()
elapsed time: 0.780781317 seconds
0.780781317

julia> tic(); dt = CSV.read(Pkg.dir() * "/CSV/test/test_files/test_basic.csv"); toc()
elapsed time: 7.296985419 seconds
7.296985419

julia> tic(); dt = CSV.read(Pkg.dir() * "/CSV/test/test_files/SalesJan2009.csv"); toc()
elapsed time: 0.581490229 seconds
0.581490229

there’s hardly any difference.

For the precompilation time rather than the JIT-compilation time, you do appear to be running a bit slow at 60+ seconds:

julia> tic(); using CSV
INFO: Precompiling module CSV.
toc()

julia> toc()
elapsed time: 13.794103148 seconds
13.794103148

(and another Julia process is running full tilt at the moment).

I don’t know if anyone has done comparative benchmarks, comparing specific Julia versions running on different hardware configurations. For example, it might be interesting to know whether running with 4GB of RAM makes much of a difference (mine has 12GB). I tried to find a “simple” general purpose benchmark module for people to just download and run a standard set of miscellaneous benchmarks for comparing performance, but didn’t get much further than BenchmarkTools.jl.

Hi.
The box with the long precompilation time has 2GB of memory.
The faster box is 8GB.
2GB is not enough ? Seems unlikely.

You can add a repository for Julia. I’m running Julia 0.5.0 on Kubuntu.

Not sure about Linux, to be honest. I would have thought that the more RAM the better so that swapping to disk is minimized (and if so, it’s a cheap way to upgrade). The best way to speed up a Mac is to give it more RAM (probably because lots of other things are running at the same time, such as iCloud shenanigans…).

I don’t understand why you’re talking about RAM here. @purplishrock’s timings his second post give about 10s the first time CSV.read is called. @cormullion’s timings give 7s, which is in the same ballpark.

To improve on the current status, CSV.jl could probably precompile simple calls to CSV.read so that the first call is faster (see SnoopCompile.jl).

I think it was his long precompile time we were exploring…

OK, but as @rdeits said precompilation time is going to depend on whether you have already compiled dependencies. Anyway, I wouldn’t worry too much about it as precompilation shouldn’t happen frequently.

Yes, I think people get worried by the initial impressions of slow (pre)-compilation, but in practice, except around release times, it’s not a big issue…

@nalimilan: memory came up because i have a machine that takes 6x as long to compile. Does that seem reasonable to you ?
Granted that on a practical level it’s not an issue because it’s a one-time cost, but it certainly seems like such a long pre-compile time is indicative of some sort of problem.
as i said that 10s timing was on a machine with 8GB. so it looks like 2GB causes a 6x increase in compilation time. that’s crazy.
i’m going to run it again just to make sure that what i saw is correct.

Hard to tell without knowing what are the CPUs; often a machine with more RAM will also have a better one. Also check whether all the RAM is used during precompilation. If it’s not, then it’s unlikely to be the explanation.

I’ve found Julia to be rather sensitive to the amount of RAM (even with exactly the same processor).
(esp dealing with lots of short strings, it’s a bit of a memory hog, I’m very much looking forward to Jeff’s string changes)
One thing to do is to check with @time or @timev instead of tic/toc and see how many GC cycles occurred on the machine with 2GB and the other with 8GB.

result of

@time using CSV

58.500567 seconds (1.13 M allocations: 53.839 MB, 0.15% gc time)

the really slow box is just a really slow box, memory doesn’t seem to be the issue. The processor (cat /proc/cpuinfo)

Intel(R) Core™2 CPU 6300 @ 1.86GHz
cache size : 2048 KB

Ths is an old dell. Apparently Julia is CPU intensive…

Hi Jeffrey,

I just read this post as I too was after the same information but had read in the Julia/Docs of a version of the “versioninfo()” that has a Verbose feature. This verbose reporting details full Julia related information and helpful system information as well.

After much trial & error learning, the following command will output the Verbose option:

julia> versioninfo(true::Bool)

This is likely a good command for newbs like myself to know and use when we have “Rough start” issues.

Thanks, JNA.

appreciated