CSV.read extremely slow wrt readtable

lucabrugnolini · June 16, 2018, 9:31pm

Hello,

I have found a huge difference in speed and allocation by switching from the deprecated DataFrames.readtable to CSV.read. I have noticed that this happens with large datasets (high number of columns), thus I start guessing it could be related to the type recognition of CSV.read or the Union{T, Missing} type.

Here is an example loading dataset from around 300 cols to more than 100000.

using DataFrames, Plots
T = zeros(9)
K = zeros(9)
for i in 2:10
    t = @timed readtable(filename$i.csv)
    T[i-1] = t[2]
    K[i-1] = size(t[1],2)
end

bar(string.(Int.(K)),T, ylabel = "seconds", xlabel = "ncol", legend = false, title = "readtable")

I tried to run the same code using both CSVFiles and CSV.read, but after one hour they where still running (thus I gave up!).

The example below shows the time difference using only a small dataset (380 cols).
Here results using CSV.read:

using CSV
CSV.read(filename)
42.808322 seconds (3.55 M allocations: 184.522 MiB, 0.30% gc time)
8×380 DataFrames.DataFrame. Omitted printing of 366 columns

Here results using readtable (edited with clean console):

using DataFrames
readtable(filename)
3.518891 seconds (2.02 M allocations: 105.854 MiB, 1.18% gc time)
8×380 DataFrames.DataFrame. Omitted printing of 366 columns

Loading into a dataframe with CSVFiles

using CSVFiles, DataFrames
@time DataFrame(load(filename))
  8.311439 seconds (3.89 M allocations: 441.332 MiB, 3.30% gc time)
8×380 DataFrames.DataFrame. Omitted printing of 366 columns

Please, let me know in case you have any workaround. I was thinking about trying to preset the type of each column to confirm my hypothesis, but I would like to avoid doing it manually, especially for large datasets.

I am on ubuntu 16.04, Julia 0.6.3. Below my Package list

Thanks,

Luca

Atom 0.6.14
BenchmarkTools 0.3.1
CSV 0.2.5
CSVFiles 0.7.0
Clp 0.4.0
Combinatorics 0.6.0
CovarianceMatrices 0.5.0
DataArrays 0.7.0
DataFrames 0.11.6
DataFramesMeta 0.3.0
Dates 0.4.4
DecisionTree 0.6.5
Devectorize 0.4.2
Distributions 0.15.0
ExcelFiles 0.5.0
ExcelReaders 0.9.0
GLM 0.11.0
JuMP 0.18.2
Lazy 0.12.1
MLBase 0.7.0
MultivariateStats 0.4.0
Parameters 0.9.0
ParticleFilters 0.1.2
PlotlyJS 0.10.2
Plots 0.17.2
PyPlot 2.5.0
Reel 1.0.1
Revise 0.1.1
ScikitLearn 0.4.0
StatPlots 0.7.2
StatsFuns 0.6.0
Yeppp 0.2.0
143 additional packages:
ASTInterpreter2 0.1.1
AUC 0.0.0- master (unregistered)
AbstractFFTs 0.3.1
AssetRegistry 0.0.1
AxisAlgorithms 0.2.0
AxisArrays 0.2.1
BinDeps 0.8.8
BinaryProvider 0.3.2
Blink 0.6.2
Cairo 0.5.2
Calculus 0.4.0
CatIndices 0.1.0
CategoricalArrays 0.3.9
Cbc 0.3.2
CodeTools 0.5.0
CodecZlib 0.4.3
ColorTypes 0.6.7
ColorVectorSpace 0.5.2
Colors 0.8.2
CommonSubexpressions 0.1.0
Compat 0.68.0
ComputationalResources 0.2.0
Conda 0.8.1
Contour 0.4.0
CoordinateTransformations 0.4.1
CustomUnitRanges 0.1.0
DataStreams 0.3.6
DataStructures 0.8.3
DataValues 0.3.3
DebuggerFramework 0.1.2
DiffEqDiffTools 0.4.1
DiffResults 0.0.3
DiffRules 0.0.4
Distances 0.6.0
DocSeeker 0.1.0
DocStringExtensions 0.4.4
DualNumbers 0.3.0
FFTViews 0.1.0
FFTW 0.0.4
FileIO 0.9.0
FixedPointNumbers 0.4.6
ForwardDiff 0.7.5
GR 0.31.0
Glob 1.1.1
Graphics 0.3.0
Gtk 0.14.0
GtkReactive 0.4.1
HTTP 0.6.12
Hiccup 0.1.1
HttpCommon 0.4.0
HttpParser 0.3.1
HttpServer 0.3.1
IdentityRanges 0.1.0
ImageAxes 0.4.0
ImageCore 0.6.0
ImageDistances 0.0.2
ImageFiltering 0.3.0
ImageMetadata 0.4.0
ImageMorphology 0.0.2
ImageTransformations 0.4.1
ImageView 0.6.0
Images 0.14.0
IndirectArrays 0.4.2
IniFile 0.4.0
InternedStrings 0.6.2
Interpolations 0.7.3
IntervalSets 0.2.0
IterTools 0.2.1
IterableTables 0.7.1
IteratorInterfaceExtensions 0.0.2
Iterators 0.3.1
JSON 0.17.2
Juno 0.4.1
KernelDensity 0.4.1
LNR 0.0.2
LaTeXStrings 0.3.0
LineSearches 3.2.5
MacroTools 0.4.1
MappedArrays 0.0.7
MathProgBase 0.7.1
MbedTLS 0.5.11
Measures 0.2.0
Media 0.3.0
Missings 0.2.9
Mustache 0.3.2
Mux 0.3.0
NFP 0.0.0- loss_function_implementation (unregistered)
NLSolversBase 4.4.1
NaNMath 0.3.1
NamedTuples 4.0.2
Nullables 0.0.5
OffsetArrays 0.5.0
Optim 0.14.1
PDMats 0.8.0
POMDPToolbox 0.2.7
POMDPs 0.6.7
PaddedViews 0.3.0
Pidfile 1.0.0
PlotThemes 0.2.0
PlotUtils 0.4.4
PlotlyBase 0.1.1
Polynomials 0.3.2
PooledArrays 0.2.1
PositiveFactorizations 0.1.0
ProgressMeter 0.5.5
PyCall 1.16.1
QuadGK 0.2.1
RangeArrays 0.2.0
Ratios 0.2.0
Reactive 0.7.0
RecipesBase 0.3.1
Reexport 0.1.0
Requires 0.4.4
ReverseDiffSparse 0.8.1
Rmath 0.3.3
Rotations 0.7.0
RoundingIntegers 0.0.3
SHA 0.5.7
SIUnits 0.1.0
ScikitLearnBase 0.3.0
ShowItLikeYouBuildIt 0.2.0
Showoff 0.2.0
SimpleTraits 0.6.0
SortingAlgorithms 0.2.1
SpecialFunctions 0.5.0
StaticArrays 0.7.1
StatsBase 0.23.0
StatsModels 0.2.5
StringDistances 0.2.1
TableShowUtils 0.0.1
TableTraits 0.2.0
TableTraitsUtils 0.1.3
TexExtensions 0.1.0
TextParse 0.5.0
TiledIteration 0.1.0
Tokenize 0.4.2
TranscodingStreams 0.5.2
URIParser 0.3.1
VersionParsing 1.1.1
VideoIO 0.2.0
WeakRefStrings 0.4.7
WebSockets 0.5.0
WoodburyMatrices 0.2.2

davidanthoff · June 17, 2018, 1:09am

Would be great if you could also try CSVFiles.jl and report back how that fares. I have no idea (never tried it with a file with many columns), but would be quite interested.

If you want to avoid the Union{Missing,T} story entirely, try to load it into into an IndexedTables.jl:

using CSVFiles, IndexedTables

@time it = load("foo.csv") |> table

Would also be great if you could also try the native JuliaDB loading story and report that back as well

lucabrugnolini · June 17, 2018, 9:35am

I added an example to better describe the problem. Running the code with readtable took around 60 sec, with both CSV.read and CSVFiles.load |> DataFrame potentially it can take +infinity (my pc was close to crashing in both cases and after I gave the sigterm to julia it was still stunned for a while)

nalimilan · June 17, 2018, 5:35pm

Interesting. Could you test whether it’s faster if you specify column types manually? And what happens if you pass allowmissing=:none? It would be useful to check under Julia 0.7, just in case (it should make a large difference when missing values are involved).

BTW, what is the typical number of rows and columns in your datasets? 100,000 sounds really really large.

davidanthoff · June 17, 2018, 6:17pm

Would also be good to check this with IndexedTables.jl with the code I suggested above. If the reason for the problems here is the Union{T,Missing} storage on Julia 0.6, you would get around that with IndexedTables.jl. If that turns out to be the issue, then I could work around it for the CSVFiles/DataFrames combo if https://github.com/JuliaData/DataFrames.jl/pull/1414 was merged.

davidanthoff · June 17, 2018, 6:27pm

Oh, one more thing: does your data actually have missing values? CSVFiles will automatically detect those and only use a data type for missing values if they are there. So if your files don’t have missing values, then my theories above could not explain the problems you are seeing when loading with CSVFiles.

lucabrugnolini · July 7, 2018, 4:49pm

Sorry for the late answer, busy period. I have no actual missings in the dataset…

davidanthoff · July 7, 2018, 4:59pm

Any chance you could share a dataset that we could use for replication?

lucabrugnolini · July 7, 2018, 5:47pm

These files are the data I was using: Dropbox - File Deleted

fipelle · July 27, 2018, 2:09am

Hi all,

I noticed the same problem Luca highlighted. I am using a csv with ~4500x5000 entries and missing observations. Any news?

Filippo

Edit

Right now, I figured that a combination of DataArray and readdlm is faster than CSV or CSVFiles for many large input files:

e.g., DataArray(readdlm(“./data/train.csv”, ‘,’)[2:end,:]);

You would lose the column titles, but you can keep track of them elsewhere or create a DataFrame ex-post. Not sure if the above works for CSVs with missing values. However, it helped me when CSV and CSVFiles were either crashing or running forever.

jtackm · July 27, 2018, 11:47am

Same observation here with a ~5.5k x 7k “.tsv” data set (no missings). My timing results on Julia 0.6.2:

Base.readdlm: 35.19s
DataFrames.readtable: 39.1s
pandas.read_csv (python): 12.9s
CSV.read: > 1h (didn’t finish)
Queryverse.load |> DataFrame: > 1h (didn’t finish)
Queryverse.load |> Table (IndexedTables): > 1h (aborted)
JuliaDB.loadtable: > 1h (StackOverflowError)

Edit

In response to @davidanthoff:
TextParse.csvread: 663.1s

davidanthoff · July 27, 2018, 1:47pm

Could you also try TextParse.jl?

ExpandingMan · July 27, 2018, 2:08pm

I have to use CSV fairly often (oh how I long to go back to the days where I could get people to give me data in proper binary formats) and it seems quite fast to me. I just read in a 7\cdot 10^6 row, 7 column table in about 15 s. So, it seems that something bad is happening for large number of columns. Perhaps there is some issue where the time complexity is not linear in the number of columns? That would be an interesting thing to test.

fipelle · July 27, 2018, 5:15pm

I tried with both TextParse.jl and CSV.read(file; transpose=true). It is taking so long that I decided to give up. However, it takes ~14 seconds to read and convert to DataArray the same csv using the code I posted above.

Note: I am currently using a MacBook Pro with i7-4980HQ, 16 GB 1600 MHz DDR3 and Julia 0.6.4.

ExpandingMan · July 27, 2018, 6:57pm

Here’s a quick test I did with master on 0.7:

using DataFrames, CSV, BenchmarkTools, DataStructures

const TESTFILE = "test.csv"

testdf(n::Integer, m::Integer) = DataFrame(zeros(n,m))

function benchcsv(n::Integer, M::AbstractVector{<:Integer}, filename::String=TESTFILE)
    reads = OrderedDict{Any,Any}()
    writes = OrderedDict{Any,Any}()
    for m ∈ M
        tdf = testdf(n,m)
        @info("testing for $n rows, $m columns...")
        writes[m] = @benchmark CSV.write($filename, $tdf) evals=1 samples=1
        reads[m] = @benchmark CSV.read($filename) evals=1 samples=1
    end
    reads, writes
end

julia> reads, writes = benchcsv(100, [i*50 for i ∈ 1:8]);

julia> writes
OrderedDict{Any,Any} with 8 entries:
  50  => Trial(3.801 ms)
  100 => Trial(7.927 ms)
  150 => Trial(13.925 ms)
  200 => Trial(21.150 ms)
  250 => Trial(32.119 ms)
  300 => Trial(42.205 ms)
  350 => Trial(50.099 ms)
  400 => Trial(63.212 ms)

julia> reads
OrderedDict{Any,Any} with 8 entries:
  50  => Trial(22.135 ms)
  100 => Trial(107.798 ms)
  150 => Trial(242.544 ms)
  200 => Trial(426.780 ms)
  250 => Trial(687.501 ms)
  300 => Trial(1.004 s)
  350 => Trial(1.695 s)
  400 => Trial(1.666 s)

Indeed the reads seem to have some rather odd performance characteristics as there are some pretty big jumps. So something is definitely fishy about how this scales with the number of columns (I’d expect linear).

Topic		Replies	Views
CSV read in is too slow than other language General Usage performance	13	1358	June 21, 2023
Alternative to DataFrame Readtable to read large data files with headers Data	17	4042	November 12, 2018
Reading Data Is Still Too Slow Data	35	8815	August 2, 2019
My experiences reading CSVs from the Fannie Mae datasets Data performance , csv	62	6143	August 26, 2019
CSV : problem to write big dataframes Data csv	20	2812	May 29, 2023

CSV.read extremely slow wrt readtable

Related topics