I’m new to Flux.jl, and I’m kind of confused by the fact that there is no stochastic gradient descent algorithm. I checked the source code, and it seemed to me that Descent is just GD instead of SGD.

Of course we can just introduce some random factors into the training data. Anyway, what is the best practice to perform SGD in Flux.jl?

The S in SGD doesn’t come from choosing a random direction. It still moves in the direction that minimizes the loss. The SGD comes from passing in partial data batches e.g. if your data is 1_000_000 records you only pass it 32 records at a time. These 32 are randomly re-assigned each epoch, so your GD is stochastic by the randomness of where you are in the loss function and the randomness in the batch.