The best I could find is map( x → dict , DF.column )
Your solution is perfectly fine.
The main reason that Python needs an explicit function for this is performance - Pandas map on a Dictionary is much faster than e.g. applying an anonymous Python function.
As an alternative to an anonymous function you could use broadcasting:
getindex.(Ref(dict), DF.column)
The Ref
indicates here that broadcasting is not done for the dictionary, but only for the array (2nd argument).
Thanks lungben. I didn’t know about the Ref syntax . Thats useful.
Dictionary lookup cant be directly broadast ie. dict.[ DF.column ]
doesnt work.
Unfortunately not, but I agree that it would be nice to have.
Maybe it is worth filing an issue to introduce a syntax like this?
d.[A]
# should be equivalent to
getindex.(Ref(d), A)
Or is there a fundamental reason against it?
That would be great.
The experts may chime in soon with a concise explanation, but in the meantime if you’d like to dig into the reasoning that led to the current design, check out issues #18618 and #25904.
for me at least using ggplot2’s mpg dataset
idx_map = Dict(key => idx for (idx, key) in enumerate(unique(df.class)))
julia> @time map(x -> idx_map[x], df.class)
0.060553 seconds (124.05 k allocations: 8.437 MiB, 99.45% compilation time)
julia> @time getindex.(Ref(idx_map), df.class)
0.000025 seconds (4 allocations: 2.062 KiB)
Huge win for getindex.
!
Maybe. But @time
is not suitable for microbenchmarks. Can you try the same using BenchmarkTools? And remember variable interpolation.
This is a compilation time thing. Creating a new anonymous function has a fixed compilation cost, so map(t -> ..., x)
is slow
julia> x = rand(1:5, 100);
julia> d = Dict(1 => "A", 2 => "B", 3 => "C", 4 => "D", 5 => "E");
julia> @time getindex.(Ref(d), x);
0.144137 seconds (172.92 k allocations: 9.074 MiB, 42.83% gc time)
julia> @time getindex.(Ref(d), x);
0.000016 seconds (4 allocations: 976 bytes)
julia> @time map(xi -> d[xi], x);
0.081783 seconds (98.49 k allocations: 5.256 MiB)
julia> @time map(xi -> d[xi], x);
0.078627 seconds (55.19 k allocations: 2.922 MiB)
julia> get_from_d = let d = d
xi -> d[xi]
end;
julia> @time map(get_from_d, x);
0.043167 seconds (43.72 k allocations: 2.272 MiB)
julia> @time map(get_from_d, x);
0.000024 seconds (2 allocations: 928 bytes)
This re-compilation problem only shows up in global scope, though.
julia> function get_from_d_wrapper(d, x)
map(xi -> d[xi], x)
end;
julia> @time get_from_d_wrapper(d, x);
0.037174 seconds (46.42 k allocations: 2.426 MiB)
julia> @time get_from_d_wrapper(d, x);
0.000006 seconds (1 allocation: 896 bytes)