CSV.read extremely slow wrt readtable


#1

Hello,

I have found a huge difference in speed and allocation by switching from the deprecated DataFrames.readtable to CSV.read. I have noticed that this happens with large datasets (high number of columns), thus I start guessing it could be related to the type recognition of CSV.read or the Union{T, Missing} type.

Here is an example loading dataset from around 300 cols to more than 100000.

using DataFrames, Plots
T = zeros(9)
K = zeros(9)
for i in 2:10
    t = @timed readtable(filename$i.csv)
    T[i-1] = t[2]
    K[i-1] = size(t[1],2)
end

bar(string.(Int.(K)),T, ylabel = "seconds", xlabel = "ncol", legend = false, title = "readtable")

I tried to run the same code using both CSVFiles and CSV.read, but after one hour they where still running (thus I gave up!).

The example below shows the time difference using only a small dataset (380 cols).
Here results using CSV.read:

using CSV
CSV.read(filename)
42.808322 seconds (3.55 M allocations: 184.522 MiB, 0.30% gc time)
8×380 DataFrames.DataFrame. Omitted printing of 366 columns

Here results using readtable (edited with clean console):

using DataFrames
readtable(filename)
3.518891 seconds (2.02 M allocations: 105.854 MiB, 1.18% gc time)
8×380 DataFrames.DataFrame. Omitted printing of 366 columns

Loading into a dataframe with CSVFiles

using CSVFiles, DataFrames
@time DataFrame(load(filename))
  8.311439 seconds (3.89 M allocations: 441.332 MiB, 3.30% gc time)
8×380 DataFrames.DataFrame. Omitted printing of 366 columns

Please, let me know in case you have any workaround. I was thinking about trying to preset the type of each column to confirm my hypothesis, but I would like to avoid doing it manually, especially for large datasets.

I am on ubuntu 16.04, Julia 0.6.3. Below my Package list

Thanks,

Luca

  • Atom 0.6.14
  • BenchmarkTools 0.3.1
  • CSV 0.2.5
  • CSVFiles 0.7.0
  • Clp 0.4.0
  • Combinatorics 0.6.0
  • CovarianceMatrices 0.5.0
  • DataArrays 0.7.0
  • DataFrames 0.11.6
  • DataFramesMeta 0.3.0
  • Dates 0.4.4
  • DecisionTree 0.6.5
  • Devectorize 0.4.2
  • Distributions 0.15.0
  • ExcelFiles 0.5.0
  • ExcelReaders 0.9.0
  • GLM 0.11.0
  • JuMP 0.18.2
  • Lazy 0.12.1
  • MLBase 0.7.0
  • MultivariateStats 0.4.0
  • Parameters 0.9.0
  • ParticleFilters 0.1.2
  • PlotlyJS 0.10.2
  • Plots 0.17.2
  • PyPlot 2.5.0
  • Reel 1.0.1
  • Revise 0.1.1
  • ScikitLearn 0.4.0
  • StatPlots 0.7.2
  • StatsFuns 0.6.0
  • Yeppp 0.2.0
    143 additional packages:
  • ASTInterpreter2 0.1.1
  • AUC 0.0.0- master (unregistered)
  • AbstractFFTs 0.3.1
  • AssetRegistry 0.0.1
  • AxisAlgorithms 0.2.0
  • AxisArrays 0.2.1
  • BinDeps 0.8.8
  • BinaryProvider 0.3.2
  • Blink 0.6.2
  • Cairo 0.5.2
  • Calculus 0.4.0
  • CatIndices 0.1.0
  • CategoricalArrays 0.3.9
  • Cbc 0.3.2
  • CodeTools 0.5.0
  • CodecZlib 0.4.3
  • ColorTypes 0.6.7
  • ColorVectorSpace 0.5.2
  • Colors 0.8.2
  • CommonSubexpressions 0.1.0
  • Compat 0.68.0
  • ComputationalResources 0.2.0
  • Conda 0.8.1
  • Contour 0.4.0
  • CoordinateTransformations 0.4.1
  • CustomUnitRanges 0.1.0
  • DataStreams 0.3.6
  • DataStructures 0.8.3
  • DataValues 0.3.3
  • DebuggerFramework 0.1.2
  • DiffEqDiffTools 0.4.1
  • DiffResults 0.0.3
  • DiffRules 0.0.4
  • Distances 0.6.0
  • DocSeeker 0.1.0
  • DocStringExtensions 0.4.4
  • DualNumbers 0.3.0
  • FFTViews 0.1.0
  • FFTW 0.0.4
  • FileIO 0.9.0
  • FixedPointNumbers 0.4.6
  • ForwardDiff 0.7.5
  • GR 0.31.0
  • Glob 1.1.1
  • Graphics 0.3.0
  • Gtk 0.14.0
  • GtkReactive 0.4.1
  • HTTP 0.6.12
  • Hiccup 0.1.1
  • HttpCommon 0.4.0
  • HttpParser 0.3.1
  • HttpServer 0.3.1
  • IdentityRanges 0.1.0
  • ImageAxes 0.4.0
  • ImageCore 0.6.0
  • ImageDistances 0.0.2
  • ImageFiltering 0.3.0
  • ImageMetadata 0.4.0
  • ImageMorphology 0.0.2
  • ImageTransformations 0.4.1
  • ImageView 0.6.0
  • Images 0.14.0
  • IndirectArrays 0.4.2
  • IniFile 0.4.0
  • InternedStrings 0.6.2
  • Interpolations 0.7.3
  • IntervalSets 0.2.0
  • IterTools 0.2.1
  • IterableTables 0.7.1
  • IteratorInterfaceExtensions 0.0.2
  • Iterators 0.3.1
  • JSON 0.17.2
  • Juno 0.4.1
  • KernelDensity 0.4.1
  • LNR 0.0.2
  • LaTeXStrings 0.3.0
  • LineSearches 3.2.5
  • MacroTools 0.4.1
  • MappedArrays 0.0.7
  • MathProgBase 0.7.1
  • MbedTLS 0.5.11
  • Measures 0.2.0
  • Media 0.3.0
  • Missings 0.2.9
  • Mustache 0.3.2
  • Mux 0.3.0
  • NFP 0.0.0- loss_function_implementation (unregistered)
  • NLSolversBase 4.4.1
  • NaNMath 0.3.1
  • NamedTuples 4.0.2
  • Nullables 0.0.5
  • OffsetArrays 0.5.0
  • Optim 0.14.1
  • PDMats 0.8.0
  • POMDPToolbox 0.2.7
  • POMDPs 0.6.7
  • PaddedViews 0.3.0
  • Pidfile 1.0.0
  • PlotThemes 0.2.0
  • PlotUtils 0.4.4
  • PlotlyBase 0.1.1
  • Polynomials 0.3.2
  • PooledArrays 0.2.1
  • PositiveFactorizations 0.1.0
  • ProgressMeter 0.5.5
  • PyCall 1.16.1
  • QuadGK 0.2.1
  • RangeArrays 0.2.0
  • Ratios 0.2.0
  • Reactive 0.7.0
  • RecipesBase 0.3.1
  • Reexport 0.1.0
  • Requires 0.4.4
  • ReverseDiffSparse 0.8.1
  • Rmath 0.3.3
  • Rotations 0.7.0
  • RoundingIntegers 0.0.3
  • SHA 0.5.7
  • SIUnits 0.1.0
  • ScikitLearnBase 0.3.0
  • ShowItLikeYouBuildIt 0.2.0
  • Showoff 0.2.0
  • SimpleTraits 0.6.0
  • SortingAlgorithms 0.2.1
  • SpecialFunctions 0.5.0
  • StaticArrays 0.7.1
  • StatsBase 0.23.0
  • StatsModels 0.2.5
  • StringDistances 0.2.1
  • TableShowUtils 0.0.1
  • TableTraits 0.2.0
  • TableTraitsUtils 0.1.3
  • TexExtensions 0.1.0
  • TextParse 0.5.0
  • TiledIteration 0.1.0
  • Tokenize 0.4.2
  • TranscodingStreams 0.5.2
  • URIParser 0.3.1
  • VersionParsing 1.1.1
  • VideoIO 0.2.0
  • WeakRefStrings 0.4.7
  • WebSockets 0.5.0
  • WoodburyMatrices 0.2.2

#2

Would be great if you could also try CSVFiles.jl and report back how that fares. I have no idea (never tried it with a file with many columns), but would be quite interested.

If you want to avoid the Union{Missing,T} story entirely, try to load it into into an IndexedTables.jl:

using CSVFiles, IndexedTables

@time it = load("foo.csv") |> table

Would also be great if you could also try the native JuliaDB loading story and report that back as well :slight_smile:


#3

I added an example to better describe the problem. Running the code with readtable took around 60 sec, with both CSV.read and CSVFiles.load |> DataFrame potentially it can take +infinity (my pc was close to crashing in both cases and after I gave the sigterm to julia it was still stunned for a while)


#4

Interesting. Could you test whether it’s faster if you specify column types manually? And what happens if you pass allowmissing=:none? It would be useful to check under Julia 0.7, just in case (it should make a large difference when missing values are involved).

BTW, what is the typical number of rows and columns in your datasets? 100,000 sounds really really large.


#5

Would also be good to check this with IndexedTables.jl with the code I suggested above. If the reason for the problems here is the Union{T,Missing} storage on Julia 0.6, you would get around that with IndexedTables.jl. If that turns out to be the issue, then I could work around it for the CSVFiles/DataFrames combo if https://github.com/JuliaData/DataFrames.jl/pull/1414 was merged.


#6

Oh, one more thing: does your data actually have missing values? CSVFiles will automatically detect those and only use a data type for missing values if they are there. So if your files don’t have missing values, then my theories above could not explain the problems you are seeing when loading with CSVFiles.


#7

Sorry for the late answer, busy period. I have no actual missings in the dataset…


#8

Any chance you could share a dataset that we could use for replication?


#9

These files are the data I was using: https://www.dropbox.com/sh/uj1i7jvolnwfbzf/AABFH0qENzdB6x0ZNTTG9kwEa?dl=0