Here is a benchmark you can run:
using Random
using DataFrames
function testspeed(m, n)
Random.seed!(1234)
println("\n$m categories, $(m*n) total rows")
x = repeat([randstring() for i in 1:m], n)
println("Categorical generation time")
@time y = categorical(x)
df = DataFrame(x=x, y=y, z=1)
println("String time")
@time by(df, :x, :z=>sum)
println("Categorical time")
@time by(df, :y, :z=>sum)
nothing
end
testspeed(10, 10) # precompile
for i in 1:6
testspeed(10^i, 10^(8-i))
end
which produces
10 categories, 100000000 total rows
Categorical generation time
7.474020 seconds (100.00 M allocations: 3.353 GiB, 19.12% gc time)
String time
5.538156 seconds (253 allocations: 3.980 GiB, 19.41% gc time)
Categorical time
2.484464 seconds (303 allocations: 2.235 GiB, 26.33% gc time)
100 categories, 100000000 total rows
Categorical generation time
5.180744 seconds (100.00 M allocations: 3.353 GiB, 19.99% gc time)
String time
7.026005 seconds (1.06 k allocations: 3.980 GiB, 16.44% gc time)
Categorical time
3.776462 seconds (1.30 k allocations: 2.235 GiB, 18.57% gc time)
1000 categories, 100000000 total rows
Categorical generation time
5.077316 seconds (100.00 M allocations: 3.353 GiB, 21.20% gc time)
String time
6.861087 seconds (10.15 k allocations: 3.981 GiB, 16.28% gc time)
Categorical time
3.750762 seconds (12.21 k allocations: 2.236 GiB, 17.40% gc time)
10000 categories, 100000000 total rows
Categorical generation time
6.873999 seconds (100.01 M allocations: 3.355 GiB, 15.22% gc time)
String time
8.694547 seconds (109.16 k allocations: 3.983 GiB, 12.69% gc time)
Categorical time
5.585156 seconds (138.20 k allocations: 2.240 GiB, 12.53% gc time)
100000 categories, 100000000 total rows
Categorical generation time
7.396810 seconds (100.10 M allocations: 3.377 GiB, 14.65% gc time)
String time
11.422482 seconds (999.15 k allocations: 4.012 GiB, 9.91% gc time)
Categorical time
7.270968 seconds (1.30 M allocations: 2.287 GiB, 9.57% gc time)
1000000 categories, 100000000 total rows
Categorical generation time
27.256516 seconds (101.00 M allocations: 3.560 GiB, 10.31% gc time)
String time
12.614024 seconds (8.00 M allocations: 4.241 GiB, 15.22% gc time)
Categorical time
28.275873 seconds (11.00 M allocations: 2.678 GiB, 8.29% gc time)
so the problem kicks-in for many small categories (but maybe we could handle such case). Also note categorical generation time which is large and that for smaller number of categories we might try to get bigger gains.