Convert all strings in a DataFrame to upper case in the most performant way possible

mthelm85 · July 25, 2019, 1:57pm

Converting all strings to either upper- or lower-case is a common task in data prep/cleaning. Currently, I do this as follows:

function uppercase_strings(df::DataFrame)
    for row in eachrow(describe(df))
        if row[:eltype] === String
            df[row[:variable]] = [uppercase(df[row[:variable]][i]) for i in 1:length(df[row[:variable]])]
        end
    end
end

The performance has been fine so far (for my needs), but I’m wondering if someone can help optimize this to make it as performant as possible?

jling · July 25, 2019, 2:09pm

why different rows have different types?

mthelm85 · July 25, 2019, 2:49pm

Hi @jling, I’m not sure I understand the question. For the function above to work, all items in a column would have to be of type String.

jling · July 25, 2019, 3:14pm

ah, sorry I mis-read your code.

Your current implementation is pretty fast:

julia> a = Array{String}(undef, 30000,10);
julia> for i in 1:length(a)
           a[i] = randstring(20)
       end
julia> @btime uppercase_strings(df) setup=(df=DataFrame(copy($a)))
  174.917 ms (1800494 allocations: 99.70 MiB)

(I have to change df[!, row[:variable]] to comply with new syntax.

nilshg · July 25, 2019, 3:15pm

I think there might have been some confusion over your usage of row(describe(df)) to iterate through the columns of the underlying df and check types…

Are you assuming that all columns are single type? I.e. no Unions, missings etc?

nilshg · July 25, 2019, 3:35pm

Some attempts at improving below; as @jiling says the syntax needs to be updated for the new DataFrames indexing, and I’ve changed the comprehension into a broadcast (there was a Slack discussion recently I think where @bkamins showed that this is faster for column operations on DataFrames) and in the third function I’m using the eachcol function to iterate over columns (maybe the type check could be made less brittle):

using Random, DataFrames

function uppercase_strings(df::DataFrame)
    for row in eachrow(describe(df))
        if row[:eltype] === String
            df[!, row[:variable]] = [uppercase(df[!, row[:variable]][i]) for i in 1:length(df[!, row[:variable]])]
        end
    end
end 

function ucs2(df)
	for row in eachrow(describe(df))
        if row[:eltype] === String
            df[!, row[:variable]] = uppercase.(df[!, row[:variable]])
        end
    end
end 	

function ucs3(df)
	for (i, c) in enumerate(eachcol(df))
        if typeof(c) == Array{String, 1}
            df[!, i] = uppercase.(c)
        end
    end
end 	

a = Array{String}(undef, 30000,10);
for i in 1:length(a)
   a[i] = randstring(20)
end

Gives for me:

julia> @btime uppercase_strings(df) setup=(df=DataFrame(copy($a)))                                                                                          
  203.504 ms (1801504 allocations: 99.78 MiB)                                                                                                               
                                                                                                                                                            
julia> @btime ucs2(df) setup=(df=DataFrame(copy($a)))                                                                                                       
  167.436 ms (1506564 allocations: 95.27 MiB)                                                                                                               
                                                                                                                                                            
julia> @btime ucs3(df) setup=(df=DataFrame(copy($a)))                                                                                                       
  130.182 ms (1500051 allocations: 80.11 MiB)

bkamins · July 25, 2019, 3:40pm

You can do this:

function ucs4(df)
	for c in eachcol(df)
        if typeof(c) == Array{String, 1}
            c .= uppercase.(c)
        end
    end
end

which is in place and should be a bit faster because it is an in-place operation (but it replaces old vectors - which sometimes might not be desirable - the earlier solution avoided overwriting).

Can you please benchmark it on the same machine so that we can see the difference?

nilshg · July 25, 2019, 3:42pm

Seems to benchmark exactly the same as ucs3:

julia> @btime ucs4(df) setup=(df=DataFrame(copy($a)))                                                                                                       
  126.057 ms (1500021 allocations: 77.82 MiB)

(I actually had a faster version of ucs4 before thay ran in <100ms, but that was because I forgot the dot in .= so it didn’t actually do the assignment )

mthelm85 · July 25, 2019, 4:14pm

All of you are awesome. Thanks so much!!!

bkamins · July 25, 2019, 5:24pm

You do not notice differences, because the major cost is in uppercase function. Here is a benchmark that shows it (we can see the difference in this benchmark because we do a minimal number of uppercase operations and we have a lot of rows). I have also added another version 5 - with function barrier that is yet faster. Here are the definitions:

function ucs3(df)
    for (i, c) in enumerate(eachcol(df))
        if typeof(c) == Array{String, 1}
            df[!, i] = uppercase.(c)
        end
    end
end     

function ucs4(df)
    for c in eachcol(df)
        if typeof(c) == Array{String, 1}
            c .= uppercase.(c)
        end
    end
end 

_helper(c) = nothing
_helper(c::AbstractVector{String}) = (c .= uppercase.(c))

function ucs5(df)
    foreach(_helper, eachcol(df))
end

and runtimes:

julia> using DataFrames, BenchmarkTools

julia> df = DataFrame([["a"] for i in 1:10^5]);

julia> @benchmark ucs3($df)
BenchmarkTools.Trial:
  memory estimate:  38.14 MiB
  allocs estimate:  899490
  --------------
  minimum time:     75.094 ms (12.09% GC)
  median time:      88.880 ms (21.55% GC)
  mean time:        88.182 ms (19.22% GC)
  maximum time:     108.175 ms (17.67% GC)
  --------------
  samples:          57
  evals/sample:     1

julia> @benchmark ucs4($df)
BenchmarkTools.Trial:
  memory estimate:  27.47 MiB
  allocs estimate:  700001
  --------------
  minimum time:     67.443 ms (15.62% GC)
  median time:      73.634 ms (15.44% GC)
  mean time:        77.277 ms (18.06% GC)
  maximum time:     102.569 ms (21.50% GC)
  --------------
  samples:          65
  evals/sample:     1

julia> @benchmark ucs5($df)
BenchmarkTools.Trial:
  memory estimate:  24.41 MiB
  allocs estimate:  500001
  --------------
  minimum time:     27.760 ms (41.79% GC)
  median time:      28.618 ms (41.13% GC)
  mean time:        30.502 ms (43.29% GC)
  maximum time:     47.211 ms (49.03% GC)
  --------------
  samples:          164
  evals/sample:     1

Topic		Replies	Views
Lowercase dataframe column strings with missing values New to Julia question , dataframes	1	492	March 28, 2022
String column dataframe: replace for another string built from a substring from each row Data	1	1280	January 20, 2021
On iterating through columns of a data frame and check if they contain String values General Usage type , dataframes , data_structures	4	2069	April 28, 2020
Better way to clean and normalize a DataFrame of strings? Data strings , dataframes	4	1267	April 12, 2021
Having problems to modifying a DataFrame using a for loop and eachcol New to Julia question	2	330	June 21, 2022

Convert all strings in a DataFrame to upper case in the most performant way possible

Related topics