DataFrames: conditional probabilities

hendri54 · April 10, 2024, 8:54pm

I have two DataFrames. One gives Prob(Z | Y), the other Prob(Y | X).
I would like to compute Prob(Z | X). How?

M(non)WE:

using DataFrames, Random

dfZY = DataFrame(
    z = repeat([1,2]; outer = 3),
    y = repeat([1,2,3]; inner = 2),
    probZY = [0.2, 0.8,  0.3, 0.7,  0.6, 0.4]
    )

dfYX = DataFrame(
    y = repeat([1,2,3]; outer = 2),
    x = repeat([1,2]; inner = 3),
    probYX = [0.2, 0.3, 0.5,  0.5, 0.3, 0.2]
    )

# dfZX should be Prob(Z|X) = sum_Y Prob(Z|Y) Prob(Y|X)

Thank you for suggestions.

pdeffebach · April 10, 2024, 9:16pm

Here’s a DataFramesMeta.jl solution which I think is what you want

julia> @chain dfZY begin
           leftjoin(dfYX, on = :y)
           @by [:x, :z] :probZX = sum(:probZY .* :probYX)
       end
4×3 DataFrame
 Row │ x       z      probZX  
     │ Int64?  Int64  Float64 
─────┼────────────────────────
   1 │      1      1     0.43
   2 │      1      2     0.57
   3 │      2      1     0.31
   4 │      2      2     0.69

hendri54 · April 10, 2024, 9:21pm

Thank you, but the result should be “2 x 2” with entries such as Prob(z = 2 | x = 1) and Prob(z = 1 | x = 2)

Dan · April 10, 2024, 9:23pm

This seems better:

julia> @chain dfYX begin
           innerjoin(dfZY; on=:y)
           groupby([:z,:x])
           combine([:probYX, :probZY] => ((yx,zy)->sum(yx.*zy)) => :probZX)
       end
4×3 DataFrame
 Row │ z      x      probZX  
     │ Int64  Int64  Float64 
─────┼───────────────────────
   1 │     1      1     0.43
   2 │     1      2     0.31
   3 │     2      1     0.57
   4 │     2      2     0.69

jar1 · April 10, 2024, 9:25pm

julia> @chain dfYX begin
           innerjoin(dfZY; on=:y)
           groupby([:x, :z])
           combine([:probYX, :probZY] => sum∘(.*) => :probZX)
       end
4×3 DataFrame
 Row │ x      z      probZX  
     │ Int64  Int64  Float64 
─────┼───────────────────────
   1 │     1      1     0.43
   2 │     1      2     0.57
   3 │     2      1     0.31
   4 │     2      2     0.69

hendri54 · April 10, 2024, 9:28pm

That looks right (though it will take me a bit of time to figure it out).
Thanks!

bertschi · April 10, 2024, 9:34pm

Yes, this seems correct.
Basically, you want to multiply the transition matrices p(Z|Y) and p(Y|X). Viewing the data frame as a matrix in coordinate format, one can quickly check:

julia> using SparseArrays

julia> pZX = sparse(dfZY.z, dfZY.y, dfZY.probZY) * sparse(dfYX.y, dfYX.x, dfYX.probYX)
2×2 SparseMatrixCSC{Float64, Int64} with 4 stored entries:
 0.43  0.31
 0.57  0.69

julia> DataFrame([:z, :x, :probZX] .=> findnz(pZX))
4×3 DataFrame
 Row │ z      x      probZX  
     │ Int64  Int64  Float64 
─────┼───────────────────────
   1 │     1      1     0.43
   2 │     2      1     0.57
   3 │     1      2     0.31
   4 │     2      2     0.69

Dan · April 10, 2024, 9:39pm

Yep, matrix multiplication is nice for this question. Initially I tried to get to the matrices without the sparse trick which depends a bit on the values of X,Y,Z. It went something like:

julia> using NamedArrays

julia> uYX = unstack(dfYX, :x, :y, :probYX);

julia> naYX = NamedArray(Matrix(uYX[:,2:end]),(uYX.x, names(uYX)[2:end]),(:X,:Y));

julia> uZY = unstack(dfZY, :y, :z, :probZY);

julia> naZY = NamedArray(Matrix(uZY[:,2:end]),(uZY.y, names(uZY)[2:end]),(:Y,:Z));

julia> naYX*naZY
2×2 Named Matrix{Union{Missing, Float64}}
X ╲ Z │    1     2
──────┼───────────
1     │ 0.43  0.57
2     │ 0.31  0.69

It’s always useful to remember unstack and friends.

bertschi · April 10, 2024, 9:53pm

Had seen this trick somewhere on the J page where it had been used to implement stack. Working with indices can be quite cool and there are some neat identities, e.g., the rank of each element in a vector can be obtained as sortperm ∘ sortperm.
Ideally, I would like to have a data notation that is somewhat independent on how its stored, i.e., long or wide. E.g., in TensorCast the matrix multiplication would be expressed as

@reduce pZX[z,x] := sum(y) pZY[z,y] * pYX[y,x]

just imagine that something similar would work on data frames

@combine dfZX[:probZX | :z, :x] := sum(:y) dfZY[:probZY | :y, :z] * dfYX[:probYX | :x, :y]

and compile into join, groupby, combine etc.

Dan · April 10, 2024, 10:30pm

Seems to me unstack is a bit stuck in the matrix/pivot-table world and hasn’t advanced to the tensor world. More precisely,
unstack takes one colkey variable which turns into an additional dimension, when it should be able to accept several colkeys and make an N+1 dimensional Array like type. Perhaps even directly into a NamedArray.
Perhaps the syntax should follow read which takes a “sink” type.

(This message can be moved to a different thread)

pdeffebach · April 10, 2024, 10:47pm

Sorry! I’ve edited my post with the correct outcome

rocco_sprmnt21 · April 11, 2024, 7:19pm

you can use a function like this

gg(df,col)=groupindices(groupby(df,col))

to get the “indexes” to use in the transition matrices construction


gg(df,col)=groupindices(groupby(df,col))

function trnsmatr(df)
    df[:,1]=gg(df,1)
    df[:,2]=gg(df,2)
    sdf=sort(df,[2,1])
    n=nrow(unique(df,1))
    reshape(sdf[:,3],n,:)
end



trnsmatr(dfZY) * trnsmatr(dfYX)

Topic		Replies	Views
Transform operation using two or more columns in a DataFrame Data dataframes	6	409	February 28, 2022
Combining elements from multiple rows by conditionals into columns with DataFramesMeta General Usage dataframes , dataframesmeta	10	350	July 5, 2023
Constructing Markov Transition Frequency Matrix with DataFrames New to Julia	7	595	November 7, 2023
Grouping by values in either of two columns Data question	13	784	April 14, 2024
DataFrames: `@combine` only if a condition is met General Usage question , dataframes , dataframesmeta	10	185	March 23, 2025

DataFrames: conditional probabilities

Related topics