Summing and counting values in a df by row

I have a dataframe of the form:

4×4 DataFrame
 Row │ ID     Q1     Q2       Q3
     │ Int64  Int64  Int64?   Int64?
─────┼────────────────────────────────
   1 │     1      1        0        1
   2 │     2      1        1        0
   3 │     3      0        0  missing
   4 │     4      0  missing        1

The data are quiz answers, and missing indicates that the person did not attempt the question.

I want to add two columns:

  • :score for the number of correct responses; a* nd,
  • :attempt for the number of questions attempted.

Expected result

4×6 DataFrame
 Row │ ID     Q1     Q2       Q3       score  attempts
     │ Int64  Int64  Int64?   Int64?   Int64  Int64
─────┼────────────────────────────────────────────────
   1 │     1      1        0        1      2        3
   2 │     2      1        1        0      2        3
   3 │     3      0        0  missing      0        2
   4 │     4      0  missing        1      1        2

My best effort so far is:

julia> df1 = transform(df, r"Q" => (+) => :score)

4×5 DataFrame
 Row │ ID     Q1     Q2       Q3       score
     │ Int64  Int64  Int64?   Int64?   Int64?
─────┼─────────────────────────────────────────
   1 │     1      1        0        1        2
   2 │     2      1        1        0        2
   3 │     3      0        0  missing  missing
   4 │     4      0  missing        1  missing

which doesn’t give me totals for rows with missing values. I have tried various permutations of skipmissing without success. Also I can’t quite work out an expression to count the non-missing values in each row.

Thanks in advance.

EDIT:
@pdeffebach provides an efficient solution to the :score problem. I also found a possibly less efficient solution which I can also apply to the :attempts problem.

transform(df, r"Q" => ByRow((x...) -> sum(skipmissing(x))) => :sum)
transform(df1, r"Q" => ByRow((x...) -> count(>(-1),skipmissing(x))) => :attempts)
1 Like

This is a tricky one, and is only really answered by an in depth understanding of the src => fun => dest syntax.

you want

julia> transform(df, AsTable([:a, :b]) => ByRow(sum ∘ skipmissing) => :score)
2×3 DataFrame
 Row │ a      b        score
     │ Int64  Int64?   Int64
─────┼───────────────────────
   1 │     1        3      4
   2 │     2  missing      2

the AsTable combined with ByRow means the function takes a named tuple. skipmissing works just fine on a named tuple.

4 Likes

Thanks again @pdeffebach
It’s clear I have to get my head around the src => fun => dest syntax.

Your solution worked, though I did just find an alternative solution which also worked for me. Not sure if it is as efficient but for me it is more comprehensible.

transform(df, r"Q" => ByRow((x...) -> sum(skipmissing(x))) => :sum)

Another way:

using DataFrames

ID = 1:4
Q1 = [1, 1, 0, 0]
Q2 = [0, 1, 0, missing]
Q3 = [1, 0, missing, 1]
df = DataFrame(ID=ID, Q1=Q1, Q2=Q2, Q3=Q3)

df[:score] = sum(eachcol(coalesce.(df[Not(:ID)],0)))
df[:attempts] = sum((eachcol(.~ismissing.(df[[:Q1,:Q2,:Q3]]))))

julia> df
4×6 DataFrame
│ Row │ ID    │ Q1    │ Q2      │ Q3      │ score │ attempts │
│     │ Int64 │ Int64 │ Int64?  │ Int64?  │ Int64 │ Int64    │
├─────┼───────┼───────┼─────────┼─────────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0       │ 1       │ 2     │ 3        │
│ 2   │ 2     │ 1     │ 1       │ 0       │ 2     │ 3        │
│ 3   │ 3     │ 0     │ 0       │ missing │ 0     │ 2        │
│ 4   │ 4     │ 0     │ missing │ 1       │ 1     │ 2        │
3 Likes

I am trying these functions out. When all the values of a row are missing, pdeffebach’s solution works as shown:

id = 1:5
q1 = [1,1,0,0,missing]
q2 = [0,1,0,missing,missing]
q3 = [missing, missing, missing,missing, missing]
df = DataFrame(ID=id, q1=q1, q2=q2, q3=q3)
transform(df, AsTable(r"q") => ByRow(sum ∘ skipmissing) => :score)
5×5 DataFrame
 Row │ ID     q1       q2       q3       score 
     │ Int64  Int64?   Int64?   Missing  Int64
─────┼─────────────────────────────────────────
   1 │     1        1        0  missing      1
   2 │     2        1        1  missing      2
   3 │     3        0        0  missing      0
   4 │     4        0  missing  missing      0
   5 │     5  missing  missing  missing      0

A nice to have would be to have the sum of all missings also be a missing rather than a zero. This would only affect row 5 in this example. Is this possible?

This method seems to work:

julia> dorow(v) = foldl((r,t)->coalesce(r+t,t,r),v; init=missing);

julia> transform(df, AsTable(r"q") => ByRow(dorow) => :score)
5×5 DataFrame
 Row │ ID     q1       q2       q3       score   
     │ Int64  Int64?   Int64?   Missing  Int64?  
─────┼───────────────────────────────────────────
   1 │     1        1        0  missing        1
   2 │     2        1        1  missing        2
   3 │     3        0        0  missing        0
   4 │     4        0  missing  missing        0
   5 │     5  missing  missing  missing  missing 

1 Like
julia> df = DataFrame(ID=ID, Q1=Q1, Q2=Q2, Q3=Q3)
5×4 DataFrame
 Row │ ID     Q1       Q2       Q3      
     │ Int64  Int64?   Int64?   Int64?
─────┼──────────────────────────────────
   1 │     1        1        0        1
   3 │     3        0        0  missing
   4 │     4        0  missing        1
   5 │     5  missing  missing  missing

julia> df.score=[sum(filter(!ismissing, values(r))) for r in collect.(eachrow(df[:,Not(:ID)]))]5-element Vector{Union{Missing, Int64}}:
 2
 2
 0
 1
  missing

julia> df
5×5 DataFrame
 Row │ ID     Q1       Q2       Q3       score   
     │ Int64  Int64?   Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │     1        1        0        1        2
   2 │     2        1        1        0        2
   3 │     3        0        0  missing        0
   4 │     4        0  missing        1        1
   5 │     5  missing  missing  missing  missing

I propose a different solution. However, I would like to understand why the use of collect is necessary.

I note that a filtered missing vector produces an empty vector of type Missing.
Why doesn’t the same thing happen for the missing tuple?


julia> v=[missing,missing,missing]
3-element Vector{Missing}:
 missing
 missing
 missing

julia> typeof(filter(!ismissing,v))
Vector{Missing} (alias for Array{Missing, 1})

julia> t=(missing,missing,missing)
(missing, missing, missing)

julia> typeof(filter(!ismissing,t))
Tuple{}

julia> sum(filter(!ismissing,v))
missing

julia> sum(filter(!ismissing,t))
ERROR: ArgumentError: reducing with add_sum over an empty collection of element type Union{} is not allowed.

Missings.jl has a new function called emptymissing which can help in this case. But it was only just merged and a new version hasn’t been released yet.

As you say, with collect you make a vector such as [missing, missing] of type Vector{Missing}. After filtering this becomes an empty vector of type Vector{Missing}. The sum of an empty vector is simply the “zero value” for the element type of the vector. In this case the zero value is defined as missing:

julia> zero(Missing)  # Zero value for type Missing
missing

julia> sum(Missing[])  # Sum of empty vector of type Missing
missing

Without collect, what you have is a tuple such as (missing, missing). This has type Tuple{Missing, Missing}. When you filter-out one value you get a tuple (missing,) of type Tuple{Missing}. When you filter-out the last value you get a tuple () of type Tuple{}. The sum function should return the zero value for the element type, but there is no element type left, so sum doesn’t work:

julia> sum(())
ERROR: ArgumentError: reducing with add_sum over an empty collection of element type Union{} is not allowed.
1 Like

yes.
But why?
I guess there are some very good reasons why this is the case, but I would like to know why the empty tuple can’t have a specific type.

A tuple is an immutable list of values, where each value can have any type. (1, 2.0, "three") has type Tuple{Int64, Float64, String}. If there is no value, there is no value type!

By the way you also cannot compute the sum of an empty vector of “unspecified” type Any because there is no “zero value” defined for Any:

julia> sum([])
ERROR: MethodError: no method matching zero(::Type{Any})
2 Likes

Some additional considerations that may clarify a few more details on the particular handling of tuple types.

julia> vi=[missing,missing]
2-element Vector{Missing}:
 missing
 missing

julia> typeof(vi)
Vector{Missing} (alias for Array{Missing, 1})

julia> vi2=[missing,missing,missing]
3-element Vector{Missing}:
 missing
 missing
 missing

julia> typeof(vi)==typeof(vi2)
true

julia> tm=(missing,missing)
(missing, missing)

julia> typeof(tm)
Tuple{Missing, Missing}

julia> tm2=(missing,missing,missing)
(missing, missing, missing)

julia>             typeof(tm2)
Tuple{Missing, Missing, Missing}

julia> typeof(tm)==typeof(tm2)
false

Thanks all for your help. Julia is indeed a rich language, especially when coupled with experts that frequent discourse!

Here is what I ended up using, where the column to sum starts with the letter “q”:

# sum, count & rank the quizzers and write the results
dfquizzersum.Total=[sum(filter(!ismissing, values(r))) for r in collect.(eachrow(dfquizzersum[:,r"q"]))]
dfquizzersum.Count=[length(filter(!ismissing, values(r))) for r in collect.(eachrow(dfquizzersum[:,r"q"]))]
dfquizzersum.Rank = competerank(-1 .* dfquizzersum.Total)   # use tiedrank for "1 2.5 2.5 4" ranking, competerank for "1 2 2 4"

and competerank is found in StatsBase.

Staying within the mini-language (and complying with the title request) you could get the same result like this (I don’t know if it’s more efficient).

passmiss(r)=filter(!ismissing, collect(r))
transform(df, Cols(r"Q")=>ByRow((r...)->[sum(passmiss(r)), length(passmiss(r))] )=>[:total, :count])

PS
But if you have the two columns of the total and the count, why do you complicate your life by also using the missing (instead of the 0) in the total column?
What additional information does the use of missing give you?