What is the difference between rand(array, n)
and sample(array, n)
?
sample
is from where?
sample
is from where?
StatsBase.sample(array, n)
and StatsBase.rand(array, n)
rand
is from Base
.
I don’t know how consistent people are about this, but as I think of it, rand
is usually lower-level, while sample
is often built in terms of (potentially many) calls to rand
. sample
is also often used for MCMC.
@Tamas_Papp I’ve heard you make similar arguments, anything to add?
If you use @edit
to inspect the code of calls to both functions you will probably reach the same understanding given by @cscherrer. It seems like sample
calls a sample!
that may call many different options, ranging from direct_sample
(which then calls rand
) or things like: samplepair
, fisher_yates_sample!
, self_avoid_sample!
, and seqsample_c!
(more complex ways of sampling, all probably built over rand
).
StatsBase.sample
is for picking random items from a collection, with potentially weights and with/without replacement. I don’t think it was intended to be a generic function, it is a utility for a special case.
Random.rand
is a generic IID sampler.
@jzr, all of this is documented, which part of the documentation did you find unclear?
I don’t think it was intended to be a generic function, it is a utility for a special case.
My confusion arose because sample
is indeed used as a generic function by Turing.jl and other packages. If I understand correctly, there are multiple interpretations of the distinction between these functions. (Namely, “rand is generic iid; sample isn’t generic”, and “rand is simple low-level drawing; sample is higher-level drawing that uses rand”.)
Would it have been appropriate usage for the MCMC packages to use rand
for their user-facing sampling interface (instead of sample
)?
Personally I think that using StatsBase.sample
for MCMC is bad style (“punning”).
I fully agree with this view. The function sample
refers to an operation performed on a dataset (or population) and can take weights and replacement options. The function rand
is for random variables with a given distribution (e.g. Distributions.jl
).
@Tamas_Papp @juliohm what name would you use for a function that samples from a posterior distribution, given a model, observed data, and sampling algorithm?
Soss currently uses a different function name for each algorithm, but I like the idea of having the abstracted away a bit, so making the algorithm an argument is appealing to me.
I think that Turing.sample
is OK. It just should not coincide with StatsBase.sample
, which is a rather pointless pun.
@cscherrer I think I would end up choosing rand
for this posterior distribution so that it is clear that it is not a sample from a population a la StatsBase.sample
. I am trying to follow this pattern in my packages whenever I can so that users know when they can pass weights for example.
And I would maybe describe this functionality as:
A function that draws from a posterior distribution given a model, observed data and algorithm…
But that is all a matter of taste I guess as we all understand the meaning of these concepts. It would be nice to have a consistent terminology across multiple ecosystems though.
I’m pretty sure Turing overrides and re-exports StatsBase.sample
.
But these samples are not usually iid, which breaks a common assumption of rand
. To me this is a more critical distinction than requiring that sample
is only for drawing a sample from a population, which isn’t at all clear to me. Are you requiring here that a “population” be finite, and unweighted?
If someone uses rand
thinking the results will be iid, violating that could cause big problems. As for sample
, I really don’t see such a risk in extending it.
Oh I had IID samples in mind when you wrote:
I am assuming they are not IID because of an algorithmic (MCMC) detail?
Yes, and I think that is what most people have in mind when they hear the term population in statistics, but I may be wrong.
I think the IID property is relevant, but isn’t the critical aspect to differentiate between rand
and sample
semantics.
Yes, that’s right. Sometimes we can get IID samples, but that’s not usually the case.
Maybe it depends on which branch of statistics. The expression “sample from the posterior” comes up a lot in Bayesian stats, and we usually don’t assume the result will be IID.
That said, StatsBase.sample
is a utility function for a very specific stats/probability exercise, ie pulling objects from a bag/urn with/without replacement. This meaning predates the Bayesian/MCMC usage by centuries, and is really not the same thing.