Why making tuples from excels columns take so much times?

I have an excel sheet that has four colums of length 10000, and I’d like to create list of tuples from two colums. So, I used list comprehension first, and after 10 minutes is was still running. So, I did this:

twos = Vector{Tuple{Float16, Float16}}()

for i in 1:length(df.First_column)
    twos[i] = (df.First_column[i], df.second_column[i])

But the above code shows BoundsError: attempt to access 0-element Vector{Tuple{Float16, Float16}} at index [1]. What is not correct?

Is there a better way to create a list of tuples from two excel columns? In each row , I just want to put two adjance columns values into tuples.


twos is an empty vector as you created. You need

push!(twos, (df.First_column[i], df.second_column[i]))

Or else define the vector with the appropriate length from start, with, for example filling it
with tuples of zeros:

julia> twos = fill((Float16(0),Float16(0)), 3)
3-element Vector{Tuple{Float16, Float16}}:
 (0.0, 0.0)
 (0.0, 0.0)
 (0.0, 0.0)

Concerning the other part, I don’t know, here this is fast:

julia> df = DataFrame()
0×0 DataFrame

julia> df.x = rand(10^4);

julia> df.y = rand(10^4);

julia> twos = Vector{Tuple{Float16, Float16}}()
Tuple{Float16, Float16}[]

julia> @time for i in eachindex(df.x)
           push!(twos, (df.x[i], df.y[i]))
  0.001532 seconds (58.99 k allocations: 1.365 MiB)
1 Like

I find it hard to believe it would take >10 minutes to parse a spreadsheet with 10000 rows and create a vector of 10000 tuples.

Anyway, a simpler approach is possible.

Based on the variable name df, I assume you are using DataFrames.jl. So you could do

M = [df.First_column df.First_column]

to create a 10000 \times 2 matrix. Perhaps, it would already suit your needs.

If not, you can transform that matrix into a vector of Tuples:

twos = Tuple.(eachrow(M))
1 Like

Exactly, I am also wondering why this happens. It takes very long.

Thanks for the suggestion

Can you show us this code? There is no reason it should take that long.

Indeed, I just tested this out of interest and am getting:

julia> using XLSX

julia> @time XLSX.readtable(path_to_test_file, 1)
  0.707339 seconds (1.39 M allocations: 69.331 MiB, 25.17% gc time, 60.57% compilation time)

inferring element types and storing things in a DataFrame:

julia> @time df = DataFrame(XLSX.readtable(path_to_test_file, 1; infer_eltypes = true))
  0.252611 seconds (715.62 k allocations: 25.391 MiB, 51.60% gc time)
9999×2 DataFrame

creating a tuple from this with a list comprehension as OP said:

julia> @time [(x, y) for (x, y) ∈ zip(df[!, 1], df[!, 2])]
  0.057351 seconds (119.60 k allocations: 8.269 MiB, 83.58% compilation time)
9999-element Vector{Tuple{Float64, Float64}}:

I suppose things can get slower if the overall Excel file is bigger (my test file is just 10,000 rows, 2 columns), but I’d say if this is taking longer than a few seconds something is off.

1 Like