CSV.read extremely slow wrt readtable

Hello,

I have found a huge difference in speed and allocation by switching from the deprecated DataFrames.readtable to CSV.read. I have noticed that this happens with large datasets (high number of columns), thus I start guessing it could be related to the type recognition of CSV.read or the Union{T, Missing} type.

Here is an example loading dataset from around 300 cols to more than 100000.

using DataFrames, Plots
T = zeros(9)
K = zeros(9)
for i in 2:10
    t = @timed readtable(filename$i.csv)
    T[i-1] = t[2]
    K[i-1] = size(t[1],2)
end

bar(string.(Int.(K)),T, ylabel = "seconds", xlabel = "ncol", legend = false, title = "readtable")

I tried to run the same code using both CSVFiles and CSV.read, but after one hour they where still running (thus I gave up!).

The example below shows the time difference using only a small dataset (380 cols).
Here results using CSV.read:

using CSV
CSV.read(filename)
42.808322 seconds (3.55 M allocations: 184.522 MiB, 0.30% gc time)
8×380 DataFrames.DataFrame. Omitted printing of 366 columns

Here results using readtable (edited with clean console):

using DataFrames
readtable(filename)
3.518891 seconds (2.02 M allocations: 105.854 MiB, 1.18% gc time)
8×380 DataFrames.DataFrame. Omitted printing of 366 columns

Loading into a dataframe with CSVFiles

using CSVFiles, DataFrames
@time DataFrame(load(filename))
  8.311439 seconds (3.89 M allocations: 441.332 MiB, 3.30% gc time)
8×380 DataFrames.DataFrame. Omitted printing of 366 columns

Please, let me know in case you have any workaround. I was thinking about trying to preset the type of each column to confirm my hypothesis, but I would like to avoid doing it manually, especially for large datasets.

I am on ubuntu 16.04, Julia 0.6.3. Below my Package list

Thanks,

Luca

  • Atom 0.6.14
  • BenchmarkTools 0.3.1
  • CSV 0.2.5
  • CSVFiles 0.7.0
  • Clp 0.4.0
  • Combinatorics 0.6.0
  • CovarianceMatrices 0.5.0
  • DataArrays 0.7.0
  • DataFrames 0.11.6
  • DataFramesMeta 0.3.0
  • Dates 0.4.4
  • DecisionTree 0.6.5
  • Devectorize 0.4.2
  • Distributions 0.15.0
  • ExcelFiles 0.5.0
  • ExcelReaders 0.9.0
  • GLM 0.11.0
  • JuMP 0.18.2
  • Lazy 0.12.1
  • MLBase 0.7.0
  • MultivariateStats 0.4.0
  • Parameters 0.9.0
  • ParticleFilters 0.1.2
  • PlotlyJS 0.10.2
  • Plots 0.17.2
  • PyPlot 2.5.0
  • Reel 1.0.1
  • Revise 0.1.1
  • ScikitLearn 0.4.0
  • StatPlots 0.7.2
  • StatsFuns 0.6.0
  • Yeppp 0.2.0
    143 additional packages:
  • ASTInterpreter2 0.1.1
  • AUC 0.0.0- master (unregistered)
  • AbstractFFTs 0.3.1
  • AssetRegistry 0.0.1
  • AxisAlgorithms 0.2.0
  • AxisArrays 0.2.1
  • BinDeps 0.8.8
  • BinaryProvider 0.3.2
  • Blink 0.6.2
  • Cairo 0.5.2
  • Calculus 0.4.0
  • CatIndices 0.1.0
  • CategoricalArrays 0.3.9
  • Cbc 0.3.2
  • CodeTools 0.5.0
  • CodecZlib 0.4.3
  • ColorTypes 0.6.7
  • ColorVectorSpace 0.5.2
  • Colors 0.8.2
  • CommonSubexpressions 0.1.0
  • Compat 0.68.0
  • ComputationalResources 0.2.0
  • Conda 0.8.1
  • Contour 0.4.0
  • CoordinateTransformations 0.4.1
  • CustomUnitRanges 0.1.0
  • DataStreams 0.3.6
  • DataStructures 0.8.3
  • DataValues 0.3.3
  • DebuggerFramework 0.1.2
  • DiffEqDiffTools 0.4.1
  • DiffResults 0.0.3
  • DiffRules 0.0.4
  • Distances 0.6.0
  • DocSeeker 0.1.0
  • DocStringExtensions 0.4.4
  • DualNumbers 0.3.0
  • FFTViews 0.1.0
  • FFTW 0.0.4
  • FileIO 0.9.0
  • FixedPointNumbers 0.4.6
  • ForwardDiff 0.7.5
  • GR 0.31.0
  • Glob 1.1.1
  • Graphics 0.3.0
  • Gtk 0.14.0
  • GtkReactive 0.4.1
  • HTTP 0.6.12
  • Hiccup 0.1.1
  • HttpCommon 0.4.0
  • HttpParser 0.3.1
  • HttpServer 0.3.1
  • IdentityRanges 0.1.0
  • ImageAxes 0.4.0
  • ImageCore 0.6.0
  • ImageDistances 0.0.2
  • ImageFiltering 0.3.0
  • ImageMetadata 0.4.0
  • ImageMorphology 0.0.2
  • ImageTransformations 0.4.1
  • ImageView 0.6.0
  • Images 0.14.0
  • IndirectArrays 0.4.2
  • IniFile 0.4.0
  • InternedStrings 0.6.2
  • Interpolations 0.7.3
  • IntervalSets 0.2.0
  • IterTools 0.2.1
  • IterableTables 0.7.1
  • IteratorInterfaceExtensions 0.0.2
  • Iterators 0.3.1
  • JSON 0.17.2
  • Juno 0.4.1
  • KernelDensity 0.4.1
  • LNR 0.0.2
  • LaTeXStrings 0.3.0
  • LineSearches 3.2.5
  • MacroTools 0.4.1
  • MappedArrays 0.0.7
  • MathProgBase 0.7.1
  • MbedTLS 0.5.11
  • Measures 0.2.0
  • Media 0.3.0
  • Missings 0.2.9
  • Mustache 0.3.2
  • Mux 0.3.0
  • NFP 0.0.0- loss_function_implementation (unregistered)
  • NLSolversBase 4.4.1
  • NaNMath 0.3.1
  • NamedTuples 4.0.2
  • Nullables 0.0.5
  • OffsetArrays 0.5.0
  • Optim 0.14.1
  • PDMats 0.8.0
  • POMDPToolbox 0.2.7
  • POMDPs 0.6.7
  • PaddedViews 0.3.0
  • Pidfile 1.0.0
  • PlotThemes 0.2.0
  • PlotUtils 0.4.4
  • PlotlyBase 0.1.1
  • Polynomials 0.3.2
  • PooledArrays 0.2.1
  • PositiveFactorizations 0.1.0
  • ProgressMeter 0.5.5
  • PyCall 1.16.1
  • QuadGK 0.2.1
  • RangeArrays 0.2.0
  • Ratios 0.2.0
  • Reactive 0.7.0
  • RecipesBase 0.3.1
  • Reexport 0.1.0
  • Requires 0.4.4
  • ReverseDiffSparse 0.8.1
  • Rmath 0.3.3
  • Rotations 0.7.0
  • RoundingIntegers 0.0.3
  • SHA 0.5.7
  • SIUnits 0.1.0
  • ScikitLearnBase 0.3.0
  • ShowItLikeYouBuildIt 0.2.0
  • Showoff 0.2.0
  • SimpleTraits 0.6.0
  • SortingAlgorithms 0.2.1
  • SpecialFunctions 0.5.0
  • StaticArrays 0.7.1
  • StatsBase 0.23.0
  • StatsModels 0.2.5
  • StringDistances 0.2.1
  • TableShowUtils 0.0.1
  • TableTraits 0.2.0
  • TableTraitsUtils 0.1.3
  • TexExtensions 0.1.0
  • TextParse 0.5.0
  • TiledIteration 0.1.0
  • Tokenize 0.4.2
  • TranscodingStreams 0.5.2
  • URIParser 0.3.1
  • VersionParsing 1.1.1
  • VideoIO 0.2.0
  • WeakRefStrings 0.4.7
  • WebSockets 0.5.0
  • WoodburyMatrices 0.2.2

Would be great if you could also try CSVFiles.jl and report back how that fares. I have no idea (never tried it with a file with many columns), but would be quite interested.

If you want to avoid the Union{Missing,T} story entirely, try to load it into into an IndexedTables.jl:

using CSVFiles, IndexedTables

@time it = load("foo.csv") |> table

Would also be great if you could also try the native JuliaDB loading story and report that back as well :slight_smile:

1 Like

I added an example to better describe the problem. Running the code with readtable took around 60 sec, with both CSV.read and CSVFiles.load |> DataFrame potentially it can take +infinity (my pc was close to crashing in both cases and after I gave the sigterm to julia it was still stunned for a while)

Interesting. Could you test whether it’s faster if you specify column types manually? And what happens if you pass allowmissing=:none? It would be useful to check under Julia 0.7, just in case (it should make a large difference when missing values are involved).

BTW, what is the typical number of rows and columns in your datasets? 100,000 sounds really really large.

Would also be good to check this with IndexedTables.jl with the code I suggested above. If the reason for the problems here is the Union{T,Missing} storage on Julia 0.6, you would get around that with IndexedTables.jl. If that turns out to be the issue, then I could work around it for the CSVFiles/DataFrames combo if https://github.com/JuliaData/DataFrames.jl/pull/1414 was merged.

Oh, one more thing: does your data actually have missing values? CSVFiles will automatically detect those and only use a data type for missing values if they are there. So if your files don’t have missing values, then my theories above could not explain the problems you are seeing when loading with CSVFiles.

Sorry for the late answer, busy period. I have no actual missings in the dataset…

Any chance you could share a dataset that we could use for replication?

These files are the data I was using: Dropbox - File Deleted

1 Like

Hi all,

I noticed the same problem Luca highlighted. I am using a csv with ~4500x5000 entries and missing observations. Any news?

Filippo


Edit

Right now, I figured that a combination of DataArray and readdlm is faster than CSV or CSVFiles for many large input files:

e.g., DataArray(readdlm(“./data/train.csv”, ‘,’)[2:end,:]);

You would lose the column titles, but you can keep track of them elsewhere or create a DataFrame ex-post. Not sure if the above works for CSVs with missing values. However, it helped me when CSV and CSVFiles were either crashing or running forever.

Same observation here with a ~5.5k x 7k “.tsv” data set (no missings). My timing results on Julia 0.6.2:

Base.readdlm: 35.19s
DataFrames.readtable: 39.1s
pandas.read_csv (python): 12.9s
CSV.read: > 1h (didn’t finish)
Queryverse.load |> DataFrame: > 1h (didn’t finish)
Queryverse.load |> Table (IndexedTables): > 1h (aborted)
JuliaDB.loadtable: > 1h (StackOverflowError)

Edit

In response to @davidanthoff:
TextParse.csvread: 663.1s

Could you also try TextParse.jl?

I have to use CSV fairly often (oh how I long to go back to the days where I could get people to give me data in proper binary formats) and it seems quite fast to me. I just read in a 7\cdot 10^6 row, 7 column table in about 15 s. So, it seems that something bad is happening for large number of columns. Perhaps there is some issue where the time complexity is not linear in the number of columns? That would be an interesting thing to test.

I tried with both TextParse.jl and CSV.read(file; transpose=true). It is taking so long that I decided to give up. However, it takes ~14 seconds to read and convert to DataArray the same csv using the code I posted above.

Note: I am currently using a MacBook Pro with i7-4980HQ, 16 GB 1600 MHz DDR3 and Julia 0.6.4.

Here’s a quick test I did with master on 0.7:

using DataFrames, CSV, BenchmarkTools, DataStructures

const TESTFILE = "test.csv"

testdf(n::Integer, m::Integer) = DataFrame(zeros(n,m))

function benchcsv(n::Integer, M::AbstractVector{<:Integer}, filename::String=TESTFILE)
    reads = OrderedDict{Any,Any}()
    writes = OrderedDict{Any,Any}()
    for m ∈ M
        tdf = testdf(n,m)
        @info("testing for $n rows, $m columns...")
        writes[m] = @benchmark CSV.write($filename, $tdf) evals=1 samples=1
        reads[m] = @benchmark CSV.read($filename) evals=1 samples=1
    end
    reads, writes
end
julia> reads, writes = benchcsv(100, [i*50 for i ∈ 1:8]);

julia> writes
OrderedDict{Any,Any} with 8 entries:
  50  => Trial(3.801 ms)
  100 => Trial(7.927 ms)
  150 => Trial(13.925 ms)
  200 => Trial(21.150 ms)
  250 => Trial(32.119 ms)
  300 => Trial(42.205 ms)
  350 => Trial(50.099 ms)
  400 => Trial(63.212 ms)

julia> reads
OrderedDict{Any,Any} with 8 entries:
  50  => Trial(22.135 ms)
  100 => Trial(107.798 ms)
  150 => Trial(242.544 ms)
  200 => Trial(426.780 ms)
  250 => Trial(687.501 ms)
  300 => Trial(1.004 s)
  350 => Trial(1.695 s)
  400 => Trial(1.666 s)

Indeed the reads seem to have some rather odd performance characteristics as there are some pretty big jumps. So something is definitely fishy about how this scales with the number of columns (I’d expect linear).

1 Like