Is there an equivalent of eachindex() for DataFrames?

phantom · October 19, 2022, 7:17am

Hi suppose I have an array and want append some string to entry I can iterate across it with eachindex()

singers = [["Marley", "Kiiara", "Sinead"] ["Rynn","Illenium","Nelly"]]

for k = eachindex(singers)
singers[k] = "$(singers[k]) sold out."
end

Just curious if there was a similar function or way to achieve this achieve this with a DataFrame?
If I try something like

df = DataFrame(singers, :auto) 

for (i,j) = (eachrow(singersdf),eachcol(singersdf))
       singersdf[ i , j ] = "$(singersdf[ i , j ]) today"
       end

Julia returns a MethodError. Thank you!

rafael.guerra · October 19, 2022, 7:52am

One option might be:

for i in 1:nrow(df), j in 1:ncol(df)
    df[i,j] = "$(df[i,j]) sold out"
end

bkamins · October 19, 2022, 9:57am

eachindex is not supoprted for data frame. You can use what @rafael.guerra proposed. However, the main reason why it is not supported is that such indexing is very inefficient so it will have a reasonable performance only for small data frames. If you need to do such iteration use function barrier.

phantom · October 20, 2022, 1:09am

Thanks so much this is very helpful. Just to clarify the performance boosting methods you discussed in your Efficiency of DataFrame Row Iteration blog posts only applies if I am iterating over the rows of multiple or all columns of a DataFrame? So if I were just iterating over the rows of a single column (say each row of df.x1) then the performance boost would not apply because at that point I am just iterating over a vector?

phantom · October 20, 2022, 1:12am

thanks so much! this is what I was thinking about but didn’t know the correct format.

bkamins · October 20, 2022, 7:05am

To have high performance you need to use function barrier. Here is a most basic example:

function_barrier(vec) = ... your code iterating elements of a vector
map(function_barrier, eachcol(df))

The point is that in order to be efficient you must pass a column to a separate function. Then inside this function all will be fast.

The reason is that DataFrame object is not type stable, so for example even:

for col in eachcol(df)
    for v in col
        ... your code
    end
end

will be slow, because Julia does not know the element type of col at compilation time.

rafael.guerra · October 20, 2022, 7:25am

In this case, as all data frame elements are strings, it should be faster to create a matrix of strings, iterate over each index of this matrix and edit the strings, and finally convert the result back to a DataFrame for further work?

bkamins · October 20, 2022, 7:39am

If all columns have the same type then what is enough is:

for col in eachcol(df)
    for v in col::Vector{String} # assuming this is the type of column
        ... your code
    end
end

of course converting to a Matrix or to Tables.columntable also will work in this case.

nilshg · October 20, 2022, 8:12am

I might be missing something here but I would probably write:

function f(x)
    ... your code
end

f.(eachcol(df))

which should get around the problem?

rafael.guerra · October 20, 2022, 8:14am

In the general case, we would need a function of both indices: (i,j) -> f(i,j)

bkamins · October 20, 2022, 12:00pm

So it would be f.(enumerate(eachcol(df))) or f.(pairs(eachcol(df))) depending on what kind of column index user wants.

phantom · October 20, 2022, 11:41pm

Thanks for all the explanations and additional methods! Sorry I am still a little confused about the distinction between f.(eachcol(df)), f.(enumerate(eachcol(df))), and f.(pairs(eachcol(df))) and how to implement the latter two.

for col in eachcol(df)
           for v in col::Vector{String} 
               println("$v today")
           end
       end

gives the desired result of

Marley sold out today
Kiiara sold out today
Sinead sold out today
Rynn sold out today
Illenium sold out today
Nelly sold out today

Likewise if I define

function f(x)
     for i = x
    println("$i today")
end

then both

julia> f.(eachcol(df))

and

map(f,eachcol(df))

yield

Marley sold out today
Kiiara sold out today
Sinead sold out today
Rynn sold out today
Illenium sold out today
Nelly sold out today
2-element Vector{Nothing}:
 nothing
 nothing

but

julia> f.(enumerate(eachcol(df)))

1 today
["Marley sold out", "Kiiara sold out", "Sinead sold out"] today
2 today
["Rynn sold out", "Illenium sold out", "Nelly sold out"] today
2-element Vector{Nothing}:
 nothing
 nothing

and

julia> f.(pairs(eachcol(df)))

ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved
Stacktrace:
 [1] broadcastable(#unused#::Base.Pairs{Symbol, AbstractVector, Vector{Symbol}, DataFrames.DataFrameColumns{DataFrame}})
   @ Base.Broadcast ./broadcast.jl:705
 [2] broadcasted(::Function, ::Base.Pairs{Symbol, AbstractVector, Vector{Symbol}, DataFrames.DataFrameColumns{DataFrame}})
   @ Base.Broadcast ./broadcast.jl:1295
 [3] top-level scope
   @ REPL[309]:1

Is this because with enumerate and pairs it is no longer a vector being inputted into the function? How does the function need to be modified for it to work? Sorry I’m sure I’m missing something simple here. I though maybe removing the row iteration in the function might work but

function g(x)
   println("$x today")
end

yields

g.(enumerate(eachcol(df)))

(1, ["Marley sold out", "Kiiara sold out", "Sinead sold out"]) today
(2, ["Rynn sold out", "Illenium sold out", "Nelly sold out"]) today
2-element Vector{Nothing}:
 nothing
 nothing

and

g.(pairs(eachcol(df)))

ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved
Stacktrace:
 [1] broadcastable(#unused#::Base.Pairs{Symbol, AbstractVector, Vector{Symbol}, DataFrames.DataFrameColumns{DataFrame}})
   @ Base.Broadcast ./broadcast.jl:705
 [2] broadcasted(::Function, ::Base.Pairs{Symbol, AbstractVector, Vector{Symbol}, DataFrames.DataFrameColumns{DataFrame}})
   @ Base.Broadcast ./broadcast.jl:1295
 [3] top-level scope
   @ REPL[314]:1

other modifications to the function I’ve tried also yields errors.

bkamins · October 21, 2022, 7:07am

Let me give you a simpler example of the difference:

julia> df = DataFrame(a=1, b=2, c=3)
1×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> collect(eachcol(df))
3-element Vector{AbstractVector}:
 [1]
 [2]
 [3]

julia> collect(enumerate(eachcol(df)))
3-element Vector{Tuple{Int64, AbstractVector}}:
 (1, [1])
 (2, [2])
 (3, [3])

julia> collect(pairs(eachcol(df)))
3-element Vector{Pair{Symbol, AbstractVector}}:
 :a => [1]
 :b => [2]
 :c => [3]

So, as you can see, the difference is just hat you have different objects returned. In the eachindex case you get column number as a first element. In the pairs case you get column name as a first element.

As for broadcasting not working for pairs - I have forgotten that pairs returns AbstractDict, so in this case you need to use foreach instead.

phantom · October 21, 2022, 10:18am

great thanks for clearing that up!

Topic		Replies	Views
Iterating over a DataFrame New to Julia iterative , dataframes , function	2	715	May 26, 2021
Iterating over row in a DataFrame New to Julia	8	5254	December 12, 2020
Iterate over all columns in a DataFrame New to Julia dataframes	3	3069	May 24, 2021
Performance of eachrow(::DataFrame) Data	4	501	August 24, 2023
Combine two capabilities of DataFrames.eachcol Data	2	629	August 24, 2019

Is there an equivalent of eachindex() for DataFrames?

Related topics