Convert all strings in a DataFrame to upper case in the most performant way possible

Converting all strings to either upper- or lower-case is a common task in data prep/cleaning. Currently, I do this as follows:

function uppercase_strings(df::DataFrame)
    for row in eachrow(describe(df))
        if row[:eltype] === String
            df[row[:variable]] = [uppercase(df[row[:variable]][i]) for i in 1:length(df[row[:variable]])]
        end
    end
end 

The performance has been fine so far (for my needs), but I’m wondering if someone can help optimize this to make it as performant as possible?

why different rows have different types?

Hi @jling, I’m not sure I understand the question. For the function above to work, all items in a column would have to be of type String.

ah, sorry I mis-read your code.

Your current implementation is pretty fast:

julia> a = Array{String}(undef, 30000,10);
julia> for i in 1:length(a)
           a[i] = randstring(20)
       end
julia> @btime uppercase_strings(df) setup=(df=DataFrame(copy($a)))
  174.917 ms (1800494 allocations: 99.70 MiB)

(I have to change df[!, row[:variable]] to comply with new syntax.

1 Like

I think there might have been some confusion over your usage of row(describe(df)) to iterate through the columns of the underlying df and check types…

Are you assuming that all columns are single type? I.e. no Unions, missings etc?

1 Like

Some attempts at improving below; as @jiling says the syntax needs to be updated for the new DataFrames indexing, and I’ve changed the comprehension into a broadcast (there was a Slack discussion recently I think where @bkamins showed that this is faster for column operations on DataFrames) and in the third function I’m using the eachcol function to iterate over columns (maybe the type check could be made less brittle):

using Random, DataFrames

function uppercase_strings(df::DataFrame)
    for row in eachrow(describe(df))
        if row[:eltype] === String
            df[!, row[:variable]] = [uppercase(df[!, row[:variable]][i]) for i in 1:length(df[!, row[:variable]])]
        end
    end
end 

function ucs2(df)
	for row in eachrow(describe(df))
        if row[:eltype] === String
            df[!, row[:variable]] = uppercase.(df[!, row[:variable]])
        end
    end
end 	

function ucs3(df)
	for (i, c) in enumerate(eachcol(df))
        if typeof(c) == Array{String, 1}
            df[!, i] = uppercase.(c)
        end
    end
end 	

a = Array{String}(undef, 30000,10);
for i in 1:length(a)
   a[i] = randstring(20)
end

Gives for me:

julia> @btime uppercase_strings(df) setup=(df=DataFrame(copy($a)))                                                                                          
  203.504 ms (1801504 allocations: 99.78 MiB)                                                                                                               
                                                                                                                                                            
julia> @btime ucs2(df) setup=(df=DataFrame(copy($a)))                                                                                                       
  167.436 ms (1506564 allocations: 95.27 MiB)                                                                                                               
                                                                                                                                                            
julia> @btime ucs3(df) setup=(df=DataFrame(copy($a)))                                                                                                       
  130.182 ms (1500051 allocations: 80.11 MiB)    
1 Like

You can do this:

function ucs4(df)
	for c in eachcol(df)
        if typeof(c) == Array{String, 1}
            c .= uppercase.(c)
        end
    end
end 

which is in place and should be a bit faster because it is an in-place operation (but it replaces old vectors - which sometimes might not be desirable - the earlier solution avoided overwriting).

Can you please benchmark it on the same machine so that we can see the difference? :smile:

1 Like

Seems to benchmark exactly the same as ucs3:

julia> @btime ucs4(df) setup=(df=DataFrame(copy($a)))                                                                                                       
  126.057 ms (1500021 allocations: 77.82 MiB) 

(I actually had a faster version of ucs4 before thay ran in <100ms, but that was because I forgot the dot in .= so it didn’t actually do the assignment :slight_smile: )

1 Like

All of you are awesome. Thanks so much!!!

You do not notice differences, because the major cost is in uppercase function. Here is a benchmark that shows it (we can see the difference in this benchmark because we do a minimal number of uppercase operations and we have a lot of rows). I have also added another version 5 - with function barrier that is yet faster. Here are the definitions:

function ucs3(df)
    for (i, c) in enumerate(eachcol(df))
        if typeof(c) == Array{String, 1}
            df[!, i] = uppercase.(c)
        end
    end
end     

function ucs4(df)
    for c in eachcol(df)
        if typeof(c) == Array{String, 1}
            c .= uppercase.(c)
        end
    end
end 

_helper(c) = nothing
_helper(c::AbstractVector{String}) = (c .= uppercase.(c))

function ucs5(df)
    foreach(_helper, eachcol(df))
end

and runtimes:

julia> using DataFrames, BenchmarkTools

julia> df = DataFrame([["a"] for i in 1:10^5]);

julia> @benchmark ucs3($df)
BenchmarkTools.Trial:
  memory estimate:  38.14 MiB
  allocs estimate:  899490
  --------------
  minimum time:     75.094 ms (12.09% GC)
  median time:      88.880 ms (21.55% GC)
  mean time:        88.182 ms (19.22% GC)
  maximum time:     108.175 ms (17.67% GC)
  --------------
  samples:          57
  evals/sample:     1

julia> @benchmark ucs4($df)
BenchmarkTools.Trial:
  memory estimate:  27.47 MiB
  allocs estimate:  700001
  --------------
  minimum time:     67.443 ms (15.62% GC)
  median time:      73.634 ms (15.44% GC)
  mean time:        77.277 ms (18.06% GC)
  maximum time:     102.569 ms (21.50% GC)
  --------------
  samples:          65
  evals/sample:     1

julia> @benchmark ucs5($df)
BenchmarkTools.Trial:
  memory estimate:  24.41 MiB
  allocs estimate:  500001
  --------------
  minimum time:     27.760 ms (41.79% GC)
  median time:      28.618 ms (41.13% GC)
  mean time:        30.502 ms (43.29% GC)
  maximum time:     47.211 ms (49.03% GC)
  --------------
  samples:          164
  evals/sample:     1