Converting all strings to either upper- or lower-case is a common task in data prep/cleaning. Currently, I do this as follows:
function uppercase_strings(df::DataFrame)
for row in eachrow(describe(df))
if row[:eltype] === String
df[row[:variable]] = [uppercase(df[row[:variable]][i]) for i in 1:length(df[row[:variable]])]
end
end
end
The performance has been fine so far (for my needs), but I’m wondering if someone can help optimize this to make it as performant as possible?
julia> a = Array{String}(undef, 30000,10);
julia> for i in 1:length(a)
a[i] = randstring(20)
end
julia> @btime uppercase_strings(df) setup=(df=DataFrame(copy($a)))
174.917 ms (1800494 allocations: 99.70 MiB)
(I have to change df[!, row[:variable]] to comply with new syntax.
Some attempts at improving below; as @jiling says the syntax needs to be updated for the new DataFrames indexing, and I’ve changed the comprehension into a broadcast (there was a Slack discussion recently I think where @bkamins showed that this is faster for column operations on DataFrames) and in the third function I’m using the eachcol function to iterate over columns (maybe the type check could be made less brittle):
using Random, DataFrames
function uppercase_strings(df::DataFrame)
for row in eachrow(describe(df))
if row[:eltype] === String
df[!, row[:variable]] = [uppercase(df[!, row[:variable]][i]) for i in 1:length(df[!, row[:variable]])]
end
end
end
function ucs2(df)
for row in eachrow(describe(df))
if row[:eltype] === String
df[!, row[:variable]] = uppercase.(df[!, row[:variable]])
end
end
end
function ucs3(df)
for (i, c) in enumerate(eachcol(df))
if typeof(c) == Array{String, 1}
df[!, i] = uppercase.(c)
end
end
end
a = Array{String}(undef, 30000,10);
for i in 1:length(a)
a[i] = randstring(20)
end
Gives for me:
julia> @btime uppercase_strings(df) setup=(df=DataFrame(copy($a)))
203.504 ms (1801504 allocations: 99.78 MiB)
julia> @btime ucs2(df) setup=(df=DataFrame(copy($a)))
167.436 ms (1506564 allocations: 95.27 MiB)
julia> @btime ucs3(df) setup=(df=DataFrame(copy($a)))
130.182 ms (1500051 allocations: 80.11 MiB)
function ucs4(df)
for c in eachcol(df)
if typeof(c) == Array{String, 1}
c .= uppercase.(c)
end
end
end
which is in place and should be a bit faster because it is an in-place operation (but it replaces old vectors - which sometimes might not be desirable - the earlier solution avoided overwriting).
Can you please benchmark it on the same machine so that we can see the difference?
julia> @btime ucs4(df) setup=(df=DataFrame(copy($a)))
126.057 ms (1500021 allocations: 77.82 MiB)
(I actually had a faster version of ucs4 before thay ran in <100ms, but that was because I forgot the dot in .= so it didn’t actually do the assignment )
You do not notice differences, because the major cost is in uppercase function. Here is a benchmark that shows it (we can see the difference in this benchmark because we do a minimal number of uppercase operations and we have a lot of rows). I have also added another version 5 - with function barrier that is yet faster. Here are the definitions:
function ucs3(df)
for (i, c) in enumerate(eachcol(df))
if typeof(c) == Array{String, 1}
df[!, i] = uppercase.(c)
end
end
end
function ucs4(df)
for c in eachcol(df)
if typeof(c) == Array{String, 1}
c .= uppercase.(c)
end
end
end
_helper(c) = nothing
_helper(c::AbstractVector{String}) = (c .= uppercase.(c))
function ucs5(df)
foreach(_helper, eachcol(df))
end
and runtimes:
julia> using DataFrames, BenchmarkTools
julia> df = DataFrame([["a"] for i in 1:10^5]);
julia> @benchmark ucs3($df)
BenchmarkTools.Trial:
memory estimate: 38.14 MiB
allocs estimate: 899490
--------------
minimum time: 75.094 ms (12.09% GC)
median time: 88.880 ms (21.55% GC)
mean time: 88.182 ms (19.22% GC)
maximum time: 108.175 ms (17.67% GC)
--------------
samples: 57
evals/sample: 1
julia> @benchmark ucs4($df)
BenchmarkTools.Trial:
memory estimate: 27.47 MiB
allocs estimate: 700001
--------------
minimum time: 67.443 ms (15.62% GC)
median time: 73.634 ms (15.44% GC)
mean time: 77.277 ms (18.06% GC)
maximum time: 102.569 ms (21.50% GC)
--------------
samples: 65
evals/sample: 1
julia> @benchmark ucs5($df)
BenchmarkTools.Trial:
memory estimate: 24.41 MiB
allocs estimate: 500001
--------------
minimum time: 27.760 ms (41.79% GC)
median time: 28.618 ms (41.13% GC)
mean time: 30.502 ms (43.29% GC)
maximum time: 47.211 ms (49.03% GC)
--------------
samples: 164
evals/sample: 1