Hello,
I have found a huge difference in speed and allocation by switching from the deprecated DataFrames.readtable to CSV.read. I have noticed that this happens with large datasets (high number of columns), thus I start guessing it could be related to the type recognition of CSV.read or the Union{T, Missing} type.
Here is an example loading dataset from around 300 cols to more than 100000.
using DataFrames, Plots
T = zeros(9)
K = zeros(9)
for i in 2:10
t = @timed readtable(filename$i.csv)
T[i-1] = t[2]
K[i-1] = size(t[1],2)
end
bar(string.(Int.(K)),T, ylabel = "seconds", xlabel = "ncol", legend = false, title = "readtable")
I tried to run the same code using both CSVFiles and CSV.read, but after one hour they where still running (thus I gave up!).
The example below shows the time difference using only a small dataset (380 cols).
Here results using CSV.read:
using CSV
CSV.read(filename)
42.808322 seconds (3.55 M allocations: 184.522 MiB, 0.30% gc time)
8×380 DataFrames.DataFrame. Omitted printing of 366 columns
Here results using readtable (edited with clean console):
using DataFrames
readtable(filename)
3.518891 seconds (2.02 M allocations: 105.854 MiB, 1.18% gc time)
8×380 DataFrames.DataFrame. Omitted printing of 366 columns
Loading into a dataframe with CSVFiles
using CSVFiles, DataFrames
@time DataFrame(load(filename))
8.311439 seconds (3.89 M allocations: 441.332 MiB, 3.30% gc time)
8×380 DataFrames.DataFrame. Omitted printing of 366 columns
Please, let me know in case you have any workaround. I was thinking about trying to preset the type of each column to confirm my hypothesis, but I would like to avoid doing it manually, especially for large datasets.
I am on ubuntu 16.04, Julia 0.6.3. Below my Package list
Thanks,
Luca
- Atom 0.6.14
- BenchmarkTools 0.3.1
- CSV 0.2.5
- CSVFiles 0.7.0
- Clp 0.4.0
- Combinatorics 0.6.0
- CovarianceMatrices 0.5.0
- DataArrays 0.7.0
- DataFrames 0.11.6
- DataFramesMeta 0.3.0
- Dates 0.4.4
- DecisionTree 0.6.5
- Devectorize 0.4.2
- Distributions 0.15.0
- ExcelFiles 0.5.0
- ExcelReaders 0.9.0
- GLM 0.11.0
- JuMP 0.18.2
- Lazy 0.12.1
- MLBase 0.7.0
- MultivariateStats 0.4.0
- Parameters 0.9.0
- ParticleFilters 0.1.2
- PlotlyJS 0.10.2
- Plots 0.17.2
- PyPlot 2.5.0
- Reel 1.0.1
- Revise 0.1.1
- ScikitLearn 0.4.0
- StatPlots 0.7.2
- StatsFuns 0.6.0
- Yeppp 0.2.0
143 additional packages: - ASTInterpreter2 0.1.1
- AUC 0.0.0- master (unregistered)
- AbstractFFTs 0.3.1
- AssetRegistry 0.0.1
- AxisAlgorithms 0.2.0
- AxisArrays 0.2.1
- BinDeps 0.8.8
- BinaryProvider 0.3.2
- Blink 0.6.2
- Cairo 0.5.2
- Calculus 0.4.0
- CatIndices 0.1.0
- CategoricalArrays 0.3.9
- Cbc 0.3.2
- CodeTools 0.5.0
- CodecZlib 0.4.3
- ColorTypes 0.6.7
- ColorVectorSpace 0.5.2
- Colors 0.8.2
- CommonSubexpressions 0.1.0
- Compat 0.68.0
- ComputationalResources 0.2.0
- Conda 0.8.1
- Contour 0.4.0
- CoordinateTransformations 0.4.1
- CustomUnitRanges 0.1.0
- DataStreams 0.3.6
- DataStructures 0.8.3
- DataValues 0.3.3
- DebuggerFramework 0.1.2
- DiffEqDiffTools 0.4.1
- DiffResults 0.0.3
- DiffRules 0.0.4
- Distances 0.6.0
- DocSeeker 0.1.0
- DocStringExtensions 0.4.4
- DualNumbers 0.3.0
- FFTViews 0.1.0
- FFTW 0.0.4
- FileIO 0.9.0
- FixedPointNumbers 0.4.6
- ForwardDiff 0.7.5
- GR 0.31.0
- Glob 1.1.1
- Graphics 0.3.0
- Gtk 0.14.0
- GtkReactive 0.4.1
- HTTP 0.6.12
- Hiccup 0.1.1
- HttpCommon 0.4.0
- HttpParser 0.3.1
- HttpServer 0.3.1
- IdentityRanges 0.1.0
- ImageAxes 0.4.0
- ImageCore 0.6.0
- ImageDistances 0.0.2
- ImageFiltering 0.3.0
- ImageMetadata 0.4.0
- ImageMorphology 0.0.2
- ImageTransformations 0.4.1
- ImageView 0.6.0
- Images 0.14.0
- IndirectArrays 0.4.2
- IniFile 0.4.0
- InternedStrings 0.6.2
- Interpolations 0.7.3
- IntervalSets 0.2.0
- IterTools 0.2.1
- IterableTables 0.7.1
- IteratorInterfaceExtensions 0.0.2
- Iterators 0.3.1
- JSON 0.17.2
- Juno 0.4.1
- KernelDensity 0.4.1
- LNR 0.0.2
- LaTeXStrings 0.3.0
- LineSearches 3.2.5
- MacroTools 0.4.1
- MappedArrays 0.0.7
- MathProgBase 0.7.1
- MbedTLS 0.5.11
- Measures 0.2.0
- Media 0.3.0
- Missings 0.2.9
- Mustache 0.3.2
- Mux 0.3.0
- NFP 0.0.0- loss_function_implementation (unregistered)
- NLSolversBase 4.4.1
- NaNMath 0.3.1
- NamedTuples 4.0.2
- Nullables 0.0.5
- OffsetArrays 0.5.0
- Optim 0.14.1
- PDMats 0.8.0
- POMDPToolbox 0.2.7
- POMDPs 0.6.7
- PaddedViews 0.3.0
- Pidfile 1.0.0
- PlotThemes 0.2.0
- PlotUtils 0.4.4
- PlotlyBase 0.1.1
- Polynomials 0.3.2
- PooledArrays 0.2.1
- PositiveFactorizations 0.1.0
- ProgressMeter 0.5.5
- PyCall 1.16.1
- QuadGK 0.2.1
- RangeArrays 0.2.0
- Ratios 0.2.0
- Reactive 0.7.0
- RecipesBase 0.3.1
- Reexport 0.1.0
- Requires 0.4.4
- ReverseDiffSparse 0.8.1
- Rmath 0.3.3
- Rotations 0.7.0
- RoundingIntegers 0.0.3
- SHA 0.5.7
- SIUnits 0.1.0
- ScikitLearnBase 0.3.0
- ShowItLikeYouBuildIt 0.2.0
- Showoff 0.2.0
- SimpleTraits 0.6.0
- SortingAlgorithms 0.2.1
- SpecialFunctions 0.5.0
- StaticArrays 0.7.1
- StatsBase 0.23.0
- StatsModels 0.2.5
- StringDistances 0.2.1
- TableShowUtils 0.0.1
- TableTraits 0.2.0
- TableTraitsUtils 0.1.3
- TexExtensions 0.1.0
- TextParse 0.5.0
- TiledIteration 0.1.0
- Tokenize 0.4.2
- TranscodingStreams 0.5.2
- URIParser 0.3.1
- VersionParsing 1.1.1
- VideoIO 0.2.0
- WeakRefStrings 0.4.7
- WebSockets 0.5.0
- WoodburyMatrices 0.2.2