I want to draw N independent (but not identically distributed) samples from a conditional Bernoulli distribution: y[i] ~ Ber(p[i]), where p[i] = logistic(sum(w .* x[i])). I am currently using the code below, but it seems overly complex. Is there a better way?

Random.seed!(1)
D = 5;
N = 10;
X = randn(N, D)
w = randn(D)
logits = X*w
probs = [StatsFuns.logistic(l) for l in logits]
y = zeros(N)
ndx = findall(x -> x, rand(N) .> probs)
for i in ndx; y[i] = 1; end

Also, what’s the best way to generalize this to more than 2 classes? i.e. how best to sample from a conditional categorical distribution where p[i,:] = softmax(W * x[i])?

Thanks a lot. For future reference, this is what I came up with.

"""
logreg_sample(W, X)
Sample from a logistic regression model with C classes and D features.
Inputs:
`W`: C*D matrix of weights.
`X`: N*D matrix of feature vectors.
Outputs:
`y_cat`: N-vector of integer labels from {1,...,C}.
`y_onehot`: N-vector of Flux:OneHotVector.
"""
function logreg_sample(W, X)
C = size(W, 1)
N = size(X, 1)
@assert size(W,2) == size(X,2)
y_cat = Array{Int}(undef, N)
y_onehot = Array{Flux.OneHotVector}(undef, N)
# Binary case
#y = Flux.logistic.(X * w) .< rand(N)
for i=1:N
probs = Flux.softmax(W * X[i,:])
dist = Distributions.Categorical(probs)
y_cat[i] = rand(dist)
y_onehot[i] = Flux.onehot(y_cat[i], 1:C)
end
return y_cat, y_onehot
end

One annoying detail I came across: in flux.jl, it seems that design matrices always store the examples along the columns, so they have size D x N instead of N x D. This required the following change to Tamas’s code:

function sample1(w, X)
D, N = size(X)
W = reshape(w, (1, D))
return (vec(Flux.sigmoid.(W * X)) .< rand(N))
end

Personally I find it easier to use an array compression, as shown below. Is there any disadvantage to this?

function sample2(w, X)
D, N = size(X)
p = [Flux.sigmoid(sum(w .* X[:,i])) for i in 1:N]
return p .< rand(N)
end

If N is the number of observations, I think this is quite natural in Julia when you’re in a situation where you might only need a sub-sample of observations to evaluate an objective (say in stochastic gradient descent). That way, you’re accessing data in memory according to the underlying memory-layout when grabbing an observation. If the layout is Nobs x Nfeatures then you’ll jump around in memory when you try to get the i’th observation as Data[i, :].

Isn’t it just two different paradigms - classic data analysis (e.g. with DataFrames) usually has observations as rows, machine learning as columns. (MultivariateStats uses a bit of both, I think determined by the wind direction at the time of coding.) There are good reasons for both but it’s also something that can really trip you up.

Yes I guessed that memory layout is why. It’s interesting that nearly all the ML textbooks stick to the convention that the design matrix X is N x D. The only exception I know of is the Gaussian Process book by Rasmussen and Williams (IIRC). It actually simplifies some of the equations (as well as the code) if X is D x N, but it will look unfamiliar (to some) to write OLS estimator as (X X’)^{-1} X y instead of the more familiar (X’X)^{-1} X’ y. I am tempted to switch to the D * N convention for v2 of my own book (“Machine learning: a probabilistic perspective”), as I make the switch from Matlab to Julia

I’m not so sure it’s just a field-difference thing (and as @murphyk pointed out most textbooks in ML use nxp). There’s also something to be said with the fact that it’s usually desirable to have columns be of the same type for data storage I guess (which is a reason why most tabular-data packages in Julia use nxp afaiu). If you just have a big array of floating point numbers this doesn’t matter but in general that wouldn’t be the case.