What is the difference between rand and sample?

jzr · December 23, 2020, 9:53pm

What is the difference between rand(array, n) and sample(array, n)?

Henrique_Becker · December 23, 2020, 10:48pm

sample is from where?

jzr · December 23, 2020, 10:52pm

sample is from where?

StatsBase.sample(array, n) and StatsBase.rand(array, n)

cscherrer · December 23, 2020, 11:01pm

rand is from Base.

I don’t know how consistent people are about this, but as I think of it, rand is usually lower-level, while sample is often built in terms of (potentially many) calls to rand. sample is also often used for MCMC.

@Tamas_Papp I’ve heard you make similar arguments, anything to add?

Henrique_Becker · December 23, 2020, 11:18pm

If you use @edit to inspect the code of calls to both functions you will probably reach the same understanding given by @cscherrer. It seems like sample calls a sample! that may call many different options, ranging from direct_sample (which then calls rand) or things like: samplepair, fisher_yates_sample!, self_avoid_sample!, and seqsample_c! (more complex ways of sampling, all probably built over rand).

Tamas_Papp · December 24, 2020, 7:12am

StatsBase.sample is for picking random items from a collection, with potentially weights and with/without replacement. I don’t think it was intended to be a generic function, it is a utility for a special case.

Random.rand is a generic IID sampler.

@jzr, all of this is documented, which part of the documentation did you find unclear?

jzr · December 24, 2020, 8:10am

I don’t think it was intended to be a generic function, it is a utility for a special case.

My confusion arose because sample is indeed used as a generic function by Turing.jl and other packages. If I understand correctly, there are multiple interpretations of the distinction between these functions. (Namely, “rand is generic iid; sample isn’t generic”, and “rand is simple low-level drawing; sample is higher-level drawing that uses rand”.)

Would it have been appropriate usage for the MCMC packages to use rand for their user-facing sampling interface (instead of sample)?

Tamas_Papp · December 24, 2020, 9:15am

Personally I think that using StatsBase.sample for MCMC is bad style (“punning”).

juliohm · December 24, 2020, 12:53pm

I fully agree with this view. The function sample refers to an operation performed on a dataset (or population) and can take weights and replacement options. The function rand is for random variables with a given distribution (e.g. Distributions.jl).

cscherrer · December 24, 2020, 3:05pm

@Tamas_Papp @juliohm what name would you use for a function that samples from a posterior distribution, given a model, observed data, and sampling algorithm?

Soss currently uses a different function name for each algorithm, but I like the idea of having the abstracted away a bit, so making the algorithm an argument is appealing to me.

Tamas_Papp · December 24, 2020, 3:32pm

I think that Turing.sample is OK. It just should not coincide with StatsBase.sample, which is a rather pointless pun.

juliohm · December 24, 2020, 3:45pm

@cscherrer I think I would end up choosing rand for this posterior distribution so that it is clear that it is not a sample from a population a la StatsBase.sample. I am trying to follow this pattern in my packages whenever I can so that users know when they can pass weights for example.

juliohm · December 24, 2020, 3:51pm

And I would maybe describe this functionality as:

A function that draws from a posterior distribution given a model, observed data and algorithm…

But that is all a matter of taste I guess as we all understand the meaning of these concepts. It would be nice to have a consistent terminology across multiple ecosystems though.

cscherrer · December 24, 2020, 6:06pm

I’m pretty sure Turing overrides and re-exports StatsBase.sample.

But these samples are not usually iid, which breaks a common assumption of rand. To me this is a more critical distinction than requiring that sample is only for drawing a sample from a population, which isn’t at all clear to me. Are you requiring here that a “population” be finite, and unweighted?

If someone uses rand thinking the results will be iid, violating that could cause big problems. As for sample, I really don’t see such a risk in extending it.

juliohm · December 24, 2020, 6:24pm

Oh I had IID samples in mind when you wrote:

I am assuming they are not IID because of an algorithmic (MCMC) detail?

Yes, and I think that is what most people have in mind when they hear the term population in statistics, but I may be wrong.

I think the IID property is relevant, but isn’t the critical aspect to differentiate between rand and sample semantics.

cscherrer · December 24, 2020, 6:29pm

Yes, that’s right. Sometimes we can get IID samples, but that’s not usually the case.

Maybe it depends on which branch of statistics. The expression “sample from the posterior” comes up a lot in Bayesian stats, and we usually don’t assume the result will be IID.

Tamas_Papp · December 26, 2020, 8:26am

That said, StatsBase.sample is a utility function for a very specific stats/probability exercise, ie pulling objects from a bag/urn with/without replacement. This meaning predates the Bayesian/MCMC usage by centuries, and is really not the same thing.

juliohm · January 29, 2025, 1:10pm

I would like to revive this topic for two reasons:

We should try to be consistent terminology-wise and I agree with @Tamas_Papp that the term sample was introduced way before McMC, Turing, etc. If Bayesian packages need a name that is not Base.rand, they could perhaps adopt draw(posterior) and own the name draw as part of their API. It would be better than sample at least.
Is there a lightweight, actively-maintained alternative to StatsBase.jl nowadays when it comes to sampling with weights and w/o replacement?

Tamas_Papp · January 29, 2025, 6:14pm

Given that all the APIs are different (at this point), I am not sure that a common verb makes sense.

In any case, DynamicHMC.jl does not use sample as part of its API, I have chosen more expressive names (I hope) like mcmc_with_warmup.

Topic		Replies	Views
Function rand() or sample() for MCMC sampling and similar? Probabilistic Programming	19	1759	September 23, 2019
Rand/logpdf semantic consistency Probabilistic Programming	12	1150	December 3, 2020
Where can I find complete documents for sample function? New to Julia question , turing	5	387	June 9, 2023
Weighted sampling algorithms not yet shipped? General Usage question , statistics	1	324	October 2, 2021
Required methods to sample a multivariate distribution Statistics distributions	1	335	October 26, 2023

What is the difference between rand and sample?

Related topics