# DataFrames: conditional probabilities

I have two DataFrames. One gives Prob(Z | Y), the other Prob(Y | X).
I would like to compute Prob(Z | X). How?

M(non)WE:

using DataFrames, Random

dfZY = DataFrame(
z = repeat([1,2]; outer = 3),
y = repeat([1,2,3]; inner = 2),
probZY = [0.2, 0.8,  0.3, 0.7,  0.6, 0.4]
)

dfYX = DataFrame(
y = repeat([1,2,3]; outer = 2),
x = repeat([1,2]; inner = 3),
probYX = [0.2, 0.3, 0.5,  0.5, 0.3, 0.2]
)

# dfZX should be Prob(Z|X) = sum_Y Prob(Z|Y) Prob(Y|X)

Thank you for suggestions.

1 Like

Hereβs a DataFramesMeta.jl solution which I think is what you want

julia> @chain dfZY begin
leftjoin(dfYX, on = :y)
@by [:x, :z] :probZX = sum(:probZY .* :probYX)
end
4Γ3 DataFrame
Row β x       z      probZX
β Int64?  Int64  Float64
ββββββΌββββββββββββββββββββββββ
1 β      1      1     0.43
2 β      1      2     0.57
3 β      2      1     0.31
4 β      2      2     0.69

Thank you, but the result should be β2 x 2β with entries such as Prob(z = 2 | x = 1) and Prob(z = 1 | x = 2)

2 Likes

This seems better:

julia> @chain dfYX begin
innerjoin(dfZY; on=:y)
groupby([:z,:x])
combine([:probYX, :probZY] => ((yx,zy)->sum(yx.*zy)) => :probZX)
end
4Γ3 DataFrame
Row β z      x      probZX
β Int64  Int64  Float64
ββββββΌβββββββββββββββββββββββ
1 β     1      1     0.43
2 β     1      2     0.31
3 β     2      1     0.57
4 β     2      2     0.69
2 Likes
julia> @chain dfYX begin
innerjoin(dfZY; on=:y)
groupby([:x, :z])
combine([:probYX, :probZY] => sumβ(.*) => :probZX)
end
4Γ3 DataFrame
Row β x      z      probZX
β Int64  Int64  Float64
ββββββΌβββββββββββββββββββββββ
1 β     1      1     0.43
2 β     1      2     0.57
3 β     2      1     0.31
4 β     2      2     0.69
1 Like

That looks right (though it will take me a bit of time to figure it out).
Thanks!

Yes, this seems correct.
Basically, you want to multiply the transition matrices p(Z|Y) and p(Y|X). Viewing the data frame as a matrix in coordinate format, one can quickly check:

julia> using SparseArrays

julia> pZX = sparse(dfZY.z, dfZY.y, dfZY.probZY) * sparse(dfYX.y, dfYX.x, dfYX.probYX)
2Γ2 SparseMatrixCSC{Float64, Int64} with 4 stored entries:
0.43  0.31
0.57  0.69

julia> DataFrame([:z, :x, :probZX] .=> findnz(pZX))
4Γ3 DataFrame
Row β z      x      probZX
β Int64  Int64  Float64
ββββββΌβββββββββββββββββββββββ
1 β     1      1     0.43
2 β     2      1     0.57
3 β     1      2     0.31
4 β     2      2     0.69
3 Likes

Yep, matrix multiplication is nice for this question. Initially I tried to get to the matrices without the sparse trick which depends a bit on the values of X,Y,Z. It went something like:

julia> using NamedArrays

julia> uYX = unstack(dfYX, :x, :y, :probYX);

julia> naYX = NamedArray(Matrix(uYX[:,2:end]),(uYX.x, names(uYX)[2:end]),(:X,:Y));

julia> uZY = unstack(dfZY, :y, :z, :probZY);

julia> naZY = NamedArray(Matrix(uZY[:,2:end]),(uZY.y, names(uZY)[2:end]),(:Y,:Z));

julia> naYX*naZY
2Γ2 Named Matrix{Union{Missing, Float64}}
X β² Z β    1     2
βββββββΌβββββββββββ
1     β 0.43  0.57
2     β 0.31  0.69

Itβs always useful to remember unstack and friends.

1 Like

Had seen this trick somewhere on the J page where it had been used to implement stack. Working with indices can be quite cool and there are some neat identities, e.g., the rank of each element in a vector can be obtained as sortperm β sortperm.
Ideally, I would like to have a data notation that is somewhat independent on how its stored, i.e., long or wide. E.g., in TensorCast the matrix multiplication would be expressed as

@reduce pZX[z,x] := sum(y) pZY[z,y] * pYX[y,x]

just imagine that something similar would work on data frames

@combine dfZX[:probZX | :z, :x] := sum(:y) dfZY[:probZY | :y, :z] * dfYX[:probYX | :x, :y]

and compile into join, groupby, combine etc.

2 Likes

Seems to me unstack is a bit stuck in the matrix/pivot-table world and hasnβt advanced to the tensor world. More precisely,
unstack takes one colkey variable which turns into an additional dimension, when it should be able to accept several colkeys and make an N+1 dimensional Array like type. Perhaps even directly into a NamedArray.

(This message can be moved to a different thread)

Sorry! Iβve edited my post with the correct outcome

you can use a function like this

gg(df,col)=groupindices(groupby(df,col))

to get the βindexesβ to use in the transition matrices construction

gg(df,col)=groupindices(groupby(df,col))

function trnsmatr(df)
df[:,1]=gg(df,1)
df[:,2]=gg(df,2)
sdf=sort(df,[2,1])
n=nrow(unique(df,1))
reshape(sdf[:,3],n,:)
end

trnsmatr(dfZY) * trnsmatr(dfYX)