Make this more efficient?

r0b0ty · February 25, 2022, 11:57pm

Hello, all.

I have a DataFrame column, df.polygons, with strings of ten numbers separated by commas as shown below

julia> df.polygons
400000-element Vector{String}:
“79,8,79.09999999999999,8,79.09999999999999,7.9,79,7.9,79,8”
“79.09999999999999,8,79.2,8,79.2,7.9,79.09999999999999,7.9,79.09999999999999,8”
“79.2,8,79.3,8,79.3,7.9,79.2,7.9,79.2,8”
“79.3,8,79.40000000000001,8,79.40000000000001,7.9,79.3,7.9,79.3,8”
“79.40000000000001,8,79.5,8,79.5,7.9,79.40000000000001,7.9,79.40000000000001,8”
“79.5,8,79.59999999999999,8,79.59999999999999,7.9,79.5,7.9,79.5,8”
“79.59999999999999,8,79.7,8,79.7,7.9,79.59999999999999,7.9,79.59999999999999,8”
“79.7,8,79.8,8,79.8,7.9,79.7,7.9,79.7,8”
“79.8,8,79.90000000000001,8,79.90000000000001,7.9,79.8,7.9,79.8,8”
“79.90000000000001,8,80,8,80,7.9,79.90000000000001,7.9,79.90000000000001,8”
“79,7.9,79.09999999999999,7.9,79.09999999999999,7.8,79,7.8,79,7.9”
⋮
“12,29.1,12.1,29.1,12.1,29,12,29,12,29.1”

I want to place each number (excluding the last two), as a Float64, into its own DataFrame column (either in the original DataFrame or a new one, as done below).

I’ve tried two methods (both work, with the second variation being 300 ms slower than the first), but I feel like there’s a more efficient way.

Can you come up with something more clever or efficient?

dfpoly = DataFrame(
        :corner1lat => Float64[],
        :corner1lon => Float64[],
        :corner2lat => Float64[],
        :corner2lon => Float64[],
        :corner3lat => Float64[],
        :corner3lon => Float64[],
        :corner4lat => Float64[],
        :corner4lon => Float64[]
    )
 
 # VARIATION 1
 
    for i in 1:length(df.polygons)
        @chain split(df[i, :polygons], ",") begin
            [parse(Float64, _[i]) for i in 1:8]
            push!(dfpoly, _')
        end
    end
    
# VARIATION 2
    for stringvector in df.polygons
        @chain split.(stringvector, ",") begin
            parse.(Float64, _)
            push!(dfpoly, _[1:8]')
        end
    end

awasserman · February 26, 2022, 5:07am

Looks like it’s possible to parse these lines using the CSV package.

Try the following:

using CSV
dfpoly = CSV.read(IOBuffer(join(df.polygons, "\n")), DataFrame; header = 0)

bkamins · February 26, 2022, 6:01am

julia> df
3×1 DataFrame
 Row │ polygons
     │ String
─────┼───────────────────────────────────
   1 │ 79,8,79.09999999999999,8,79.0999…
   2 │ 79.09999999999999,8,79.2,8,79.2,…
   3 │ 79.2,8,79.3,8,79.3,7.9,79.2,7.9,…

julia> using DataFramesMeta

julia> @rtransform(df, $AsTable = parse.(Float64, split(:polygons,",")))
3×11 DataFrame
 Row │ polygons                           x1       x2       x3       x4       x5       x6       x7       x8       x9       x10
     │ String                             Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 79,8,79.09999999999999,8,79.0999…     79.0      8.0     79.1      8.0     79.1      7.9     79.0      7.9     79.0      8.0
   2 │ 79.09999999999999,8,79.2,8,79.2,…     79.1      8.0     79.2      8.0     79.2      7.9     79.1      7.9     79.1      8.0
   3 │ 79.2,8,79.3,8,79.3,7.9,79.2,7.9,…     79.2      8.0     79.3      8.0     79.3      7.9     79.2      7.9     79.2      8.0

rocco_sprmnt21 · February 26, 2022, 12:40pm


col=[
"79,8,79.09999999999999,8,79.09999999999999,7.9,79,7.9,79,8"
,"79.09999999999999,8,79.2,8,79.2,7.9,79.09999999999999,7.9,79.09999999999999,8"
,"79.2,8,79.3,8,79.3,7.9,79.2,7.9,79.2,8"
,"79.3,8,79.40000000000001,8,79.40000000000001,7.9,79.3,7.9,79.3,8"
,"79.40000000000001,8,79.5,8,79.5,7.9,79.40000000000001,7.9,79.40000000000001,8"
,"79.5,8,79.59999999999999,8,79.59999999999999,7.9,79.5,7.9,79.5,8"
,"79.59999999999999,8,79.7,8,79.7,7.9,79.59999999999999,7.9,79.59999999999999,8"
,"79.7,8,79.8,8,79.8,7.9,79.7,7.9,79.7,8"
,"79.8,8,79.90000000000001,8,79.90000000000001,7.9,79.8,7.9,79.8,8"
,"79.90000000000001,8,80,8,80,7.9,79.90000000000001,7.9,79.90000000000001,8"
,"79,7.9,79.09999999999999,7.9,79.09999999999999,7.8,79,7.8,79,7.9"]


text=(join(col,"\n"))


open("poligon.csv", "w") do io
    write(io, text)
end


dfs=CSV.read("poligon.csv",DataFrame,header=false, delim=",")

EDITED

opss … I had not seen the answer from @awasserman. pardon!

r0b0ty · February 26, 2022, 1:51pm

Very interesting, guys. I really appreciate your feedback. I’ve been trying to train my brain to use vectorization when possible, so this is all quite helpful. Also on my “to do” list is to become familiar with DataFramesMeta, as @bkamins employed. With it, the desired result was obtained in a single line.
Thanks again!

rafael.guerra · February 26, 2022, 3:30pm

Another one-liner:

DataFrame(reduce(hcat,[parse.(Float64, vi) for vi in split.(df.polygons,',')])'[:,1:8], :auto)

r0b0ty · February 26, 2022, 3:41pm

Nice, @rafael.guerra. The real data set will have over 2.2 million rows, and I’ll probably run this often, so I’ll time the execution. I’m also trying learn about multi-threading and the packages that exist. Thanks again for your input.

lawless-m · March 7, 2022, 11:10am

So I pre added the columns

function columnate!(df)
    for k in 2:9
        df[!, Symbol("k$k")] = zeros(size(df, 1))
    end
end

The naive multi threaded version using split is slower than single threaded (for me) [this is for 2m rows]


function splitize(df)
    ks = names(df)
    Threads.@threads for k in 1:size(df, 1)
        vs = split(df[!, :ps][k], ',')
        for j in 2:9
            df[!, ks[j]][k] = parse(Float64, vs[j-1])
        end
    end
end

julia> Threads.nthreads()
4
julia> @time  splitize(df)
 38.600944 seconds (56.00 M allocations: 3.368 GiB, 3.18% gc time)

julia> Threads.nthreads()
1
julia> @time  splitize(df)
 31.398387 seconds (56.00 M allocations: 3.368 GiB, 2.76% gc time)

So I tried splitting into pieces, thinking perhaps the overhead of contexing 1 row at a time was outweighing the actual work

function splitize(df)
    ks = names(df)
    step = 250_000
    @sync begin
        for i in 1:step:2_000_000
            Threads.@spawn begin
                for k in 0:(step-1)
                    vs = split(df[!, :ps][i+k], ',')
                    for j in 2:9
                        df[!, ks[j]][i+k] = parse(Float64, vs[j-1])
                    end
                end
            end
        end
    end
end

And no matter what step size, the results were always about the same


julia> @time  splitize(df)
 38.745159 seconds (56.21 M allocations: 3.380 GiB, 2.54% gc time, 0.26% compilation time)

nilshg · March 7, 2022, 11:24am

Not exactly sure what you are benchmarking, but creating some fake data by just taking the OP’s 11 example rows and repeating them 200k times (for a total size of 2.2m rows), I get with CSV.jl:

julia> @btime CSV.read(IOBuffer(join(col_2m, "\n")), DataFrame; header = false);
  3.527 s (88000636 allocations: 4.04 GiB)

single threaded, and with 4 threads:

julia> @btime CSV.read(IOBuffer(join(col_2m, "\n")), DataFrame; header = false);
  510.617 ms (8064 allocations: 345.45 MiB)

lawless-m · March 7, 2022, 11:29am

 rows = Vector(undef, 2_000_000)
 for r in 1:length(rows)
        rows[r] = join(65.5 .+ 5rand(10), ",")
 end        
 df = DataFrame(ps=rows)

nilshg · March 7, 2022, 11:52am

For some reason that’s a bit slower than the repeat(col, 200_000) which is what I used in my benchmark above - with four threads:

julia> @btime CSV.read(IOBuffer(join(df.ps, "\n")), DataFrame; header = false);
  1.022 s (9303 allocations: 575.91 MiB)

(note I did df.ps = identity.(df.ps) as for some reason your code produced an Any vector)

lawless-m · March 7, 2022, 12:37pm

oh yeah, that made a difference

rows = Vector{String}(undef, 2_000_000)

...

 30.352554 seconds (58.11 M allocations: 3.434 GiB, 5.23% gc time, 0.28% compilation time)

Topic		Replies	Views
How to split up a Float column in array or dataframe (not String) New to Julia	8	1577	June 2, 2020
Split dataframe row into multiple rows Data dataframes	8	1795	May 1, 2022
Expand row to multiple rows Data	11	1293	December 15, 2020
Converting DataValues.DataValue{String} to String New to Julia query , dataframes	10	1271	May 11, 2020
Split and flatten a field from CSV and use as DataFrame New to Julia dataframes	5	725	May 27, 2020

Make this more efficient?

Related topics