Julia is unable to load CSV files from the Kaggle competition

question
package

#1

I am using Julia (v0.6.4) to load csv files. These files can be found at this link:

Julia is unable to load them (train.csv and test.csv). I tried to use the CSV package with method CSV.read.

While in Python using pandas it load in no time. All all working perfectly.

It appears there is a serious issue in parsing/loading of these files in Julia.

Did anyone try and succeed or have answer/tip to make this work ?

Thanks.


#2

I just google and it seems there is a Pandas wrapper for julia: https://github.com/JuliaPy/Pandas.jl
Maybe that will work for you?


#3

Works for me on some simple test data.

julia> using CSV

julia> d = CSV.read("benchmarkdata.csv", header=false)
10Γ—3 DataFrames.DataFrame
β”‚ Row β”‚ Column1 β”‚ Column2             β”‚ Column3  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ c       β”‚ iteration_pi_sum    β”‚ 27.369   β”‚
β”‚ 2   β”‚ c       β”‚ matrix_multiply     β”‚ 72.068   β”‚
β”‚ 3   β”‚ c       β”‚ matrix_statistics   β”‚ 4.52399  β”‚
β”‚ 4   β”‚ c       β”‚ parse_integers      β”‚ 0.099092 β”‚
β”‚ 5   β”‚ c       β”‚ print_to_file       β”‚ 9.93013  β”‚
β”‚ 6   β”‚ c       β”‚ recursion_fibonacci β”‚ 0.022726 β”‚
β”‚ 7   β”‚ c       β”‚ recursion_quicksort β”‚ 0.258923 β”‚
β”‚ 8   β”‚ c       β”‚ userfunc_mandelbrot β”‚ 0.07669  β”‚
β”‚ 9   β”‚ fortran β”‚ iteration_pi_sum    β”‚ 27.3692  β”‚
β”‚ 10  β”‚ fortran β”‚ matrix_multiply     β”‚ 83.5437  β”‚

julia> versioninfo()
Julia Version 0.6.4
Commit 9d11f62bcb (2018-07-09 19:09 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-3960X CPU @ 3.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge MAX_THREADS=16)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, sandybridge)

gibson@sophist$ cat benchmarkdata.csv 
c,iteration_pi_sum,27.369022
c,matrix_multiply,72.067976
c,matrix_statistics,4.523993
c,parse_integers,0.099092
c,print_to_file,9.930134
c,recursion_fibonacci,0.022726
c,recursion_quicksort,0.258923
c,userfunc_mandelbrot,0.07669
fortran,iteration_pi_sum,27.369179
fortran,matrix_multiply,83.543703

You should post exactly what you tried in Julia and the resulting error message or incorrect output. Use triple backticks to quote the code blocks.


#4

You can also try to use CSVFiles.jl, it uses a different parser under the hood, so if you are lucky, it might be able to deal with those files. Syntax would be:

using CSVFiles, DataFrames

df = load("foo.csv") |> DataFrame

#5

Check out the post for β€œinspiration”, I often find that using R’s data.table’s fread is the fastest.


#6

Thanks. This works. But the same file if I try using CSV.jl or CSVFiles.jl it fails to load. These files can be downloaded from the Kaggle competition website (https://www.kaggle.com/c/santander-value-prediction-challenge/data).


#7

Normally most of the CSV files are loaded using CSV.jl or CSVFiles.jl. In this case I am refering to specific files from the Kaggle competition site which are not loaded by CSV.jl or CSVFiles.jl. These files can be downloaded from the Kaggle competition website (https://www.kaggle.com/c/santander-value-prediction-challenge/data).
This clearly means there is a bug in CSV.jl or CSVFiles.jl.