Is there an equivalent of eachindex() for DataFrames?

Hi suppose I have an array and want append some string to entry I can iterate across it with eachindex()

singers = [["Marley", "Kiiara", "Sinead"] ["Rynn","Illenium","Nelly"]]

for k = eachindex(singers)
singers[k] = "$(singers[k]) sold out."
end 

Just curious if there was a similar function or way to achieve this achieve this with a DataFrame?
If I try something like

df = DataFrame(singers, :auto) 

for (i,j) = (eachrow(singersdf),eachcol(singersdf))
       singersdf[ i , j ] = "$(singersdf[ i , j ]) today"
       end

Julia returns a MethodError. Thank you!

One option might be:

for i in 1:nrow(df), j in 1:ncol(df)
    df[i,j] = "$(df[i,j]) sold out"
end
2 Likes

eachindex is not supoprted for data frame. You can use what @rafael.guerra proposed. However, the main reason why it is not supported is that such indexing is very inefficient so it will have a reasonable performance only for small data frames. If you need to do such iteration use function barrier.

1 Like

Thanks so much this is very helpful. Just to clarify the performance boosting methods you discussed in your Efficiency of DataFrame Row Iteration blog posts only applies if I am iterating over the rows of multiple or all columns of a DataFrame? So if I were just iterating over the rows of a single column (say each row of df.x1) then the performance boost would not apply because at that point I am just iterating over a vector?

thanks so much! this is what I was thinking about but didn’t know the correct format.

To have high performance you need to use function barrier. Here is a most basic example:

function_barrier(vec) = ... your code iterating elements of a vector
map(function_barrier, eachcol(df))

The point is that in order to be efficient you must pass a column to a separate function. Then inside this function all will be fast.

The reason is that DataFrame object is not type stable, so for example even:

for col in eachcol(df)
    for v in col
        ... your code
    end
end

will be slow, because Julia does not know the element type of col at compilation time.

1 Like

In this case, as all data frame elements are strings, it should be faster to create a matrix of strings, iterate over each index of this matrix and edit the strings, and finally convert the result back to a DataFrame for further work?

1 Like

If all columns have the same type then what is enough is:

for col in eachcol(df)
    for v in col::Vector{String} # assuming this is the type of column
        ... your code
    end
end

of course converting to a Matrix or to Tables.columntable also will work in this case.

1 Like

I might be missing something here but I would probably write:

function f(x)
    ... your code
end

f.(eachcol(df))

which should get around the problem?

1 Like

In the general case, we would need a function of both indices: (i,j) -> f(i,j)

1 Like

So it would be f.(enumerate(eachcol(df))) or f.(pairs(eachcol(df))) depending on what kind of column index user wants.

2 Likes

Thanks for all the explanations and additional methods! Sorry I am still a little confused about the distinction between f.(eachcol(df)), f.(enumerate(eachcol(df))), and f.(pairs(eachcol(df))) and how to implement the latter two.

for col in eachcol(df)
           for v in col::Vector{String} 
               println("$v today")
           end
       end

gives the desired result of

Marley sold out today
Kiiara sold out today
Sinead sold out today
Rynn sold out today
Illenium sold out today
Nelly sold out today

Likewise if I define

function f(x)
     for i = x
    println("$i today")
end

then both

julia> f.(eachcol(df))

and

map(f,eachcol(df))

yield

Marley sold out today
Kiiara sold out today
Sinead sold out today
Rynn sold out today
Illenium sold out today
Nelly sold out today
2-element Vector{Nothing}:
 nothing
 nothing

but

julia> f.(enumerate(eachcol(df)))

1 today
["Marley sold out", "Kiiara sold out", "Sinead sold out"] today
2 today
["Rynn sold out", "Illenium sold out", "Nelly sold out"] today
2-element Vector{Nothing}:
 nothing
 nothing

and

julia> f.(pairs(eachcol(df)))

ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved
Stacktrace:
 [1] broadcastable(#unused#::Base.Pairs{Symbol, AbstractVector, Vector{Symbol}, DataFrames.DataFrameColumns{DataFrame}})
   @ Base.Broadcast ./broadcast.jl:705
 [2] broadcasted(::Function, ::Base.Pairs{Symbol, AbstractVector, Vector{Symbol}, DataFrames.DataFrameColumns{DataFrame}})
   @ Base.Broadcast ./broadcast.jl:1295
 [3] top-level scope
   @ REPL[309]:1

Is this because with enumerate and pairs it is no longer a vector being inputted into the function? How does the function need to be modified for it to work? Sorry I’m sure I’m missing something simple here. I though maybe removing the row iteration in the function might work but

function g(x)
   println("$x today")
end

yields

g.(enumerate(eachcol(df)))

(1, ["Marley sold out", "Kiiara sold out", "Sinead sold out"]) today
(2, ["Rynn sold out", "Illenium sold out", "Nelly sold out"]) today
2-element Vector{Nothing}:
 nothing
 nothing

and

g.(pairs(eachcol(df)))

ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved
Stacktrace:
 [1] broadcastable(#unused#::Base.Pairs{Symbol, AbstractVector, Vector{Symbol}, DataFrames.DataFrameColumns{DataFrame}})
   @ Base.Broadcast ./broadcast.jl:705
 [2] broadcasted(::Function, ::Base.Pairs{Symbol, AbstractVector, Vector{Symbol}, DataFrames.DataFrameColumns{DataFrame}})
   @ Base.Broadcast ./broadcast.jl:1295
 [3] top-level scope
   @ REPL[314]:1

other modifications to the function I’ve tried also yields errors.

Let me give you a simpler example of the difference:

julia> df = DataFrame(a=1, b=2, c=3)
1×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> collect(eachcol(df))
3-element Vector{AbstractVector}:
 [1]
 [2]
 [3]

julia> collect(enumerate(eachcol(df)))
3-element Vector{Tuple{Int64, AbstractVector}}:
 (1, [1])
 (2, [2])
 (3, [3])

julia> collect(pairs(eachcol(df)))
3-element Vector{Pair{Symbol, AbstractVector}}:
 :a => [1]
 :b => [2]
 :c => [3]

So, as you can see, the difference is just hat you have different objects returned. In the eachindex case you get column number as a first element. In the pairs case you get column name as a first element.

As for broadcasting not working for pairs - I have forgotten that pairs returns AbstractDict, so in this case you need to use foreach instead.

1 Like

great thanks for clearing that up!