DataFrames: conditional probabilities

I have two DataFrames. One gives Prob(Z | Y), the other Prob(Y | X).
I would like to compute Prob(Z | X). How?

M(non)WE:

using DataFrames, Random

dfZY = DataFrame(
    z = repeat([1,2]; outer = 3),
    y = repeat([1,2,3]; inner = 2),
    probZY = [0.2, 0.8,  0.3, 0.7,  0.6, 0.4]
    )

dfYX = DataFrame(
    y = repeat([1,2,3]; outer = 2),
    x = repeat([1,2]; inner = 3),
    probYX = [0.2, 0.3, 0.5,  0.5, 0.3, 0.2]
    )

# dfZX should be Prob(Z|X) = sum_Y Prob(Z|Y) Prob(Y|X)

Thank you for suggestions.

1 Like

Here’s a DataFramesMeta.jl solution which I think is what you want

julia> @chain dfZY begin
           leftjoin(dfYX, on = :y)
           @by [:x, :z] :probZX = sum(:probZY .* :probYX)
       end
4Γ—3 DataFrame
 Row β”‚ x       z      probZX  
     β”‚ Int64?  Int64  Float64 
─────┼────────────────────────
   1 β”‚      1      1     0.43
   2 β”‚      1      2     0.57
   3 β”‚      2      1     0.31
   4 β”‚      2      2     0.69

Thank you, but the result should be β€œ2 x 2” with entries such as Prob(z = 2 | x = 1) and Prob(z = 1 | x = 2)

2 Likes

This seems better:

julia> @chain dfYX begin
           innerjoin(dfZY; on=:y)
           groupby([:z,:x])
           combine([:probYX, :probZY] => ((yx,zy)->sum(yx.*zy)) => :probZX)
       end
4Γ—3 DataFrame
 Row β”‚ z      x      probZX  
     β”‚ Int64  Int64  Float64 
─────┼───────────────────────
   1 β”‚     1      1     0.43
   2 β”‚     1      2     0.31
   3 β”‚     2      1     0.57
   4 β”‚     2      2     0.69
2 Likes
julia> @chain dfYX begin
           innerjoin(dfZY; on=:y)
           groupby([:x, :z])
           combine([:probYX, :probZY] => sum∘(.*) => :probZX)
       end
4Γ—3 DataFrame
 Row β”‚ x      z      probZX  
     β”‚ Int64  Int64  Float64 
─────┼───────────────────────
   1 β”‚     1      1     0.43
   2 β”‚     1      2     0.57
   3 β”‚     2      1     0.31
   4 β”‚     2      2     0.69
1 Like

That looks right (though it will take me a bit of time to figure it out).
Thanks!

Yes, this seems correct.
Basically, you want to multiply the transition matrices p(Z|Y) and p(Y|X). Viewing the data frame as a matrix in coordinate format, one can quickly check:

julia> using SparseArrays

julia> pZX = sparse(dfZY.z, dfZY.y, dfZY.probZY) * sparse(dfYX.y, dfYX.x, dfYX.probYX)
2Γ—2 SparseMatrixCSC{Float64, Int64} with 4 stored entries:
 0.43  0.31
 0.57  0.69

julia> DataFrame([:z, :x, :probZX] .=> findnz(pZX))
4Γ—3 DataFrame
 Row β”‚ z      x      probZX  
     β”‚ Int64  Int64  Float64 
─────┼───────────────────────
   1 β”‚     1      1     0.43
   2 β”‚     2      1     0.57
   3 β”‚     1      2     0.31
   4 β”‚     2      2     0.69
3 Likes

Yep, matrix multiplication is nice for this question. Initially I tried to get to the matrices without the sparse trick which depends a bit on the values of X,Y,Z. It went something like:

julia> using NamedArrays

julia> uYX = unstack(dfYX, :x, :y, :probYX);

julia> naYX = NamedArray(Matrix(uYX[:,2:end]),(uYX.x, names(uYX)[2:end]),(:X,:Y));

julia> uZY = unstack(dfZY, :y, :z, :probZY);

julia> naZY = NamedArray(Matrix(uZY[:,2:end]),(uZY.y, names(uZY)[2:end]),(:Y,:Z));

julia> naYX*naZY
2Γ—2 Named Matrix{Union{Missing, Float64}}
X β•² Z β”‚    1     2
──────┼───────────
1     β”‚ 0.43  0.57
2     β”‚ 0.31  0.69

It’s always useful to remember unstack and friends.

1 Like

Had seen this trick somewhere on the J page where it had been used to implement stack. Working with indices can be quite cool and there are some neat identities, e.g., the rank of each element in a vector can be obtained as sortperm ∘ sortperm.
Ideally, I would like to have a data notation that is somewhat independent on how its stored, i.e., long or wide. E.g., in TensorCast the matrix multiplication would be expressed as

@reduce pZX[z,x] := sum(y) pZY[z,y] * pYX[y,x]

just imagine that something similar would work on data frames

@combine dfZX[:probZX | :z, :x] := sum(:y) dfZY[:probZY | :y, :z] * dfYX[:probYX | :x, :y]

and compile into join, groupby, combine etc.

2 Likes

Seems to me unstack is a bit stuck in the matrix/pivot-table world and hasn’t advanced to the tensor world. More precisely,
unstack takes one colkey variable which turns into an additional dimension, when it should be able to accept several colkeys and make an N+1 dimensional Array like type. Perhaps even directly into a NamedArray.
Perhaps the syntax should follow read which takes a β€œsink” type.

(This message can be moved to a different thread)

Sorry! I’ve edited my post with the correct outcome

you can use a function like this

gg(df,col)=groupindices(groupby(df,col))

to get the β€œindexes” to use in the transition matrices construction


gg(df,col)=groupindices(groupby(df,col))

function trnsmatr(df)
    df[:,1]=gg(df,1)
    df[:,2]=gg(df,2)
    sdf=sort(df,[2,1])
    n=nrow(unique(df,1))
    reshape(sdf[:,3],n,:)
end



trnsmatr(dfZY) * trnsmatr(dfYX)