Make this more efficient?

Hello, all.

I have a DataFrame column, df.polygons, with strings of ten numbers separated by commas as shown below

julia> df.polygons
400000-element Vector{String}:
“79,8,79.09999999999999,8,79.09999999999999,7.9,79,7.9,79,8”
“79.09999999999999,8,79.2,8,79.2,7.9,79.09999999999999,7.9,79.09999999999999,8”
“79.2,8,79.3,8,79.3,7.9,79.2,7.9,79.2,8”
“79.3,8,79.40000000000001,8,79.40000000000001,7.9,79.3,7.9,79.3,8”
“79.40000000000001,8,79.5,8,79.5,7.9,79.40000000000001,7.9,79.40000000000001,8”
“79.5,8,79.59999999999999,8,79.59999999999999,7.9,79.5,7.9,79.5,8”
“79.59999999999999,8,79.7,8,79.7,7.9,79.59999999999999,7.9,79.59999999999999,8”
“79.7,8,79.8,8,79.8,7.9,79.7,7.9,79.7,8”
“79.8,8,79.90000000000001,8,79.90000000000001,7.9,79.8,7.9,79.8,8”
“79.90000000000001,8,80,8,80,7.9,79.90000000000001,7.9,79.90000000000001,8”
“79,7.9,79.09999999999999,7.9,79.09999999999999,7.8,79,7.8,79,7.9”

“12,29.1,12.1,29.1,12.1,29,12,29,12,29.1”

I want to place each number (excluding the last two), as a Float64, into its own DataFrame column (either in the original DataFrame or a new one, as done below).

I’ve tried two methods (both work, with the second variation being 300 ms slower than the first), but I feel like there’s a more efficient way.

Can you come up with something more clever or efficient?

dfpoly = DataFrame(
        :corner1lat => Float64[],
        :corner1lon => Float64[],
        :corner2lat => Float64[],
        :corner2lon => Float64[],
        :corner3lat => Float64[],
        :corner3lon => Float64[],
        :corner4lat => Float64[],
        :corner4lon => Float64[]
    )
 
 # VARIATION 1
 
    for i in 1:length(df.polygons)
        @chain split(df[i, :polygons], ",") begin
            [parse(Float64, _[i]) for i in 1:8]
            push!(dfpoly, _')
        end
    end
    
# VARIATION 2
    for stringvector in df.polygons
        @chain split.(stringvector, ",") begin
            parse.(Float64, _)
            push!(dfpoly, _[1:8]')
        end
    end

Looks like it’s possible to parse these lines using the CSV package.

Try the following:

using CSV
dfpoly = CSV.read(IOBuffer(join(df.polygons, "\n")), DataFrame; header = 0)
1 Like
julia> df
3×1 DataFrame
 Row │ polygons
     │ String
─────┼───────────────────────────────────
   1 │ 79,8,79.09999999999999,8,79.0999…
   2 │ 79.09999999999999,8,79.2,8,79.2,…
   3 │ 79.2,8,79.3,8,79.3,7.9,79.2,7.9,…

julia> using DataFramesMeta

julia> @rtransform(df, $AsTable = parse.(Float64, split(:polygons,",")))
3×11 DataFrame
 Row │ polygons                           x1       x2       x3       x4       x5       x6       x7       x8       x9       x10
     │ String                             Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 79,8,79.09999999999999,8,79.0999…     79.0      8.0     79.1      8.0     79.1      7.9     79.0      7.9     79.0      8.0
   2 │ 79.09999999999999,8,79.2,8,79.2,…     79.1      8.0     79.2      8.0     79.2      7.9     79.1      7.9     79.1      8.0
   3 │ 79.2,8,79.3,8,79.3,7.9,79.2,7.9,…     79.2      8.0     79.3      8.0     79.3      7.9     79.2      7.9     79.2      8.0

col=[
"79,8,79.09999999999999,8,79.09999999999999,7.9,79,7.9,79,8"
,"79.09999999999999,8,79.2,8,79.2,7.9,79.09999999999999,7.9,79.09999999999999,8"
,"79.2,8,79.3,8,79.3,7.9,79.2,7.9,79.2,8"
,"79.3,8,79.40000000000001,8,79.40000000000001,7.9,79.3,7.9,79.3,8"
,"79.40000000000001,8,79.5,8,79.5,7.9,79.40000000000001,7.9,79.40000000000001,8"
,"79.5,8,79.59999999999999,8,79.59999999999999,7.9,79.5,7.9,79.5,8"
,"79.59999999999999,8,79.7,8,79.7,7.9,79.59999999999999,7.9,79.59999999999999,8"
,"79.7,8,79.8,8,79.8,7.9,79.7,7.9,79.7,8"
,"79.8,8,79.90000000000001,8,79.90000000000001,7.9,79.8,7.9,79.8,8"
,"79.90000000000001,8,80,8,80,7.9,79.90000000000001,7.9,79.90000000000001,8"
,"79,7.9,79.09999999999999,7.9,79.09999999999999,7.8,79,7.8,79,7.9"]


text=(join(col,"\n"))


open("poligon.csv", "w") do io
    write(io, text)
end


dfs=CSV.read("poligon.csv",DataFrame,header=false, delim=",")

EDITED

opss … I had not seen the answer from @awasserman. pardon!

Very interesting, guys. I really appreciate your feedback. I’ve been trying to train my brain to use vectorization when possible, so this is all quite helpful. Also on my “to do” list is to become familiar with DataFramesMeta, as @bkamins employed. With it, the desired result was obtained in a single line.
Thanks again!

Another one-liner:

DataFrame(reduce(hcat,[parse.(Float64, vi) for vi in split.(df.polygons,',')])'[:,1:8], :auto) 

Nice, @rafael.guerra. The real data set will have over 2.2 million rows, and I’ll probably run this often, so I’ll time the execution. I’m also trying learn about multi-threading and the packages that exist. Thanks again for your input.

So I pre added the columns

function columnate!(df)
    for k in 2:9
        df[!, Symbol("k$k")] = zeros(size(df, 1))
    end
end

The naive multi threaded version using split is slower than single threaded (for me) [this is for 2m rows]


function splitize(df)
    ks = names(df)
    Threads.@threads for k in 1:size(df, 1)
        vs = split(df[!, :ps][k], ',')
        for j in 2:9
            df[!, ks[j]][k] = parse(Float64, vs[j-1])
        end
    end
end

julia> Threads.nthreads()
4
julia> @time  splitize(df)
 38.600944 seconds (56.00 M allocations: 3.368 GiB, 3.18% gc time)

julia> Threads.nthreads()
1
julia> @time  splitize(df)
 31.398387 seconds (56.00 M allocations: 3.368 GiB, 2.76% gc time)

So I tried splitting into pieces, thinking perhaps the overhead of contexing 1 row at a time was outweighing the actual work

function splitize(df)
    ks = names(df)
    step = 250_000
    @sync begin
        for i in 1:step:2_000_000
            Threads.@spawn begin
                for k in 0:(step-1)
                    vs = split(df[!, :ps][i+k], ',')
                    for j in 2:9
                        df[!, ks[j]][i+k] = parse(Float64, vs[j-1])
                    end
                end
            end
        end
    end
end

And no matter what step size, the results were always about the same


julia> @time  splitize(df)
 38.745159 seconds (56.21 M allocations: 3.380 GiB, 2.54% gc time, 0.26% compilation time)

Not exactly sure what you are benchmarking, but creating some fake data by just taking the OP’s 11 example rows and repeating them 200k times (for a total size of 2.2m rows), I get with CSV.jl:

julia> @btime CSV.read(IOBuffer(join(col_2m, "\n")), DataFrame; header = false);
  3.527 s (88000636 allocations: 4.04 GiB)

single threaded, and with 4 threads:

julia> @btime CSV.read(IOBuffer(join(col_2m, "\n")), DataFrame; header = false);
  510.617 ms (8064 allocations: 345.45 MiB)
 rows = Vector(undef, 2_000_000)
 for r in 1:length(rows)
        rows[r] = join(65.5 .+ 5rand(10), ",")
 end        
 df = DataFrame(ps=rows)
1 Like

For some reason that’s a bit slower than the repeat(col, 200_000) which is what I used in my benchmark above - with four threads:

julia> @btime CSV.read(IOBuffer(join(df.ps, "\n")), DataFrame; header = false);
  1.022 s (9303 allocations: 575.91 MiB)

(note I did df.ps = identity.(df.ps) as for some reason your code produced an Any vector)

oh yeah, that made a difference

rows = Vector{String}(undef, 2_000_000)

...

 30.352554 seconds (58.11 M allocations: 3.434 GiB, 5.23% gc time, 0.28% compilation time)