Why making tuples from excels columns take so much times?

Optimization · March 24, 2023, 1:01pm

I have an excel sheet that has four colums of length 10000, and I’d like to create list of tuples from two colums. So, I used list comprehension first, and after 10 minutes is was still running. So, I did this:

twos = Vector{Tuple{Float16, Float16}}()

for i in 1:length(df.First_column)
    twos[i] = (df.First_column[i], df.second_column[i])
end

But the above code shows BoundsError: attempt to access 0-element Vector{Tuple{Float16, Float16}} at index [1]. What is not correct?

Is there a better way to create a list of tuples from two excel columns? In each row , I just want to put two adjance columns values into tuples.

Thanks

lmiq · March 24, 2023, 1:08pm

twos is an empty vector as you created. You need

push!(twos, (df.First_column[i], df.second_column[i]))

Or else define the vector with the appropriate length from start, with, for example filling it
with tuples of zeros:

julia> twos = fill((Float16(0),Float16(0)), 3)
3-element Vector{Tuple{Float16, Float16}}:
 (0.0, 0.0)
 (0.0, 0.0)
 (0.0, 0.0)

Concerning the other part, I don’t know, here this is fast:

julia> df = DataFrame()
0×0 DataFrame

julia> df.x = rand(10^4);

julia> df.y = rand(10^4);

julia> twos = Vector{Tuple{Float16, Float16}}()
Tuple{Float16, Float16}[]

julia> @time for i in eachindex(df.x)
           push!(twos, (df.x[i], df.y[i]))
       end;
  0.001532 seconds (58.99 k allocations: 1.365 MiB)

barucden · March 24, 2023, 1:22pm

I find it hard to believe it would take >10 minutes to parse a spreadsheet with 10000 rows and create a vector of 10000 tuples.

Anyway, a simpler approach is possible.

Based on the variable name df, I assume you are using DataFrames.jl. So you could do

M = [df.First_column df.First_column]

to create a 10000 \times 2 matrix. Perhaps, it would already suit your needs.

If not, you can transform that matrix into a vector of Tuples:

twos = Tuple.(eachrow(M))

Optimization · March 24, 2023, 2:15pm

Exactly, I am also wondering why this happens. It takes very long.

Thanks for the suggestion

pdeffebach · March 24, 2023, 2:33pm

Can you show us this code? There is no reason it should take that long.

nilshg · March 24, 2023, 2:42pm

Indeed, I just tested this out of interest and am getting:

julia> using XLSX

julia> @time XLSX.readtable(path_to_test_file, 1)
  0.707339 seconds (1.39 M allocations: 69.331 MiB, 25.17% gc time, 60.57% compilation time)

inferring element types and storing things in a DataFrame:

julia> @time df = DataFrame(XLSX.readtable(path_to_test_file, 1; infer_eltypes = true))
  0.252611 seconds (715.62 k allocations: 25.391 MiB, 51.60% gc time)
9999×2 DataFrame

creating a tuple from this with a list comprehension as OP said:

julia> @time [(x, y) for (x, y) ∈ zip(df[!, 1], df[!, 2])]
  0.057351 seconds (119.60 k allocations: 8.269 MiB, 83.58% compilation time)
9999-element Vector{Tuple{Float64, Float64}}:

I suppose things can get slower if the overall Excel file is bigger (my test file is just 10,000 rows, 2 columns), but I’d say if this is taking longer than a few seconds something is off.

Topic		Replies	Views
DataFrame construction from array of tuples General Usage data	12	7117	November 28, 2022
No method matching getindex for XLSX? New to Julia	5	406	January 5, 2023
Writing data from Julia to an Excel Spreadsheet New to Julia xlsx	18	7784	August 10, 2024
Convert dataframe to array of tuples General Usage question , dataframes	1	1186	July 12, 2022
Best way to generate vector of tuples based on two vectors Performance error , cplex , try-catch , using , highs	4	152	July 31, 2024

Why making tuples from excels columns take so much times?

Related topics