I have found a huge difference in speed and allocation by switching from the deprecated DataFrames.readtable to CSV.read. I have noticed that this happens with large datasets (high number of columns), thus I start guessing it could be related to the type recognition of CSV.read or the Union{T, Missing} type.
Here is an example loading dataset from around 300 cols to more than 100000.
using DataFrames, Plots
T = zeros(9)
K = zeros(9)
for i in 2:10
t = @timed readtable(filename$i.csv)
T[i-1] = t[2]
K[i-1] = size(t[1],2)
bar(string.(Int.(K)),T, ylabel = "seconds", xlabel = "ncol", legend = false, title = "readtable")
I tried to run the same code using both CSVFiles and CSV.read, but after one hour they where still running (thus I gave up!).
The example below shows the time difference using only a small dataset (380 cols).
Here results using CSV.read:
using CSV
42.808322 seconds (3.55 M allocations: 184.522 MiB, 0.30% gc time)
8×380 DataFrames.DataFrame. Omitted printing of 366 columns
Here results using readtable (edited with clean console):
using DataFrames
3.518891 seconds (2.02 M allocations: 105.854 MiB, 1.18% gc time)
8×380 DataFrames.DataFrame. Omitted printing of 366 columns
Loading into a dataframe with CSVFiles
using CSVFiles, DataFrames
@time DataFrame(load(filename))
8.311439 seconds (3.89 M allocations: 441.332 MiB, 3.30% gc time)
8×380 DataFrames.DataFrame. Omitted printing of 366 columns
Please, let me know in case you have any workaround. I was thinking about trying to preset the type of each column to confirm my hypothesis, but I would like to avoid doing it manually, especially for large datasets.
I am on ubuntu 16.04, Julia 0.6.3. Below my Package list
- Atom 0.6.14
- BenchmarkTools 0.3.1
- CSV 0.2.5
- CSVFiles 0.7.0
- Clp 0.4.0
- Combinatorics 0.6.0
- CovarianceMatrices 0.5.0
- DataArrays 0.7.0
- DataFrames 0.11.6
- DataFramesMeta 0.3.0
- Dates 0.4.4
- DecisionTree 0.6.5
- Devectorize 0.4.2
- Distributions 0.15.0
- ExcelFiles 0.5.0
- ExcelReaders 0.9.0
- GLM 0.11.0
- JuMP 0.18.2
- Lazy 0.12.1
- MLBase 0.7.0
- MultivariateStats 0.4.0
- Parameters 0.9.0
- ParticleFilters 0.1.2
- PlotlyJS 0.10.2
- Plots 0.17.2
- PyPlot 2.5.0
- Reel 1.0.1
- Revise 0.1.1
- ScikitLearn 0.4.0
- StatPlots 0.7.2
- StatsFuns 0.6.0
- Yeppp 0.2.0
