Imagine I want to use a small dataset (from an experiment) to fit a bayesian model.

Is it right if I first apply that model to a large dataset (such as the public health service) and use the output as the prior for my small analysis?

For example I have data from 40 people and I want to see if a regression model shows that smoking has an effect on coronary disease.
And I first fit that model to the data from 1 million people (excluding my 40).

The literature says that the prior should be deciden from experts decisions or formet experiments.

I think the answer is a bit subjective. If the large dataset and the experiment are measuring the same thing, then it would be logically defensible to use the posteriors of a model fit to the large public health dataset as the prior for the model fit to the small experiment. In that case, though, the results of the small experiment are likely not going to matter, even if they are quite different from the big dataset. They will essentially become the 1,000,001th through 1,000,040th datapoints, 0.004% of the data.

In order to give a better answer, can you tell us a bit more about the purpose of the small experiment? Was it done differently, or on a different population, or something, than the existing public health service data?

Generate Model, no matter your sample size is 40 or 1 million, They are from same latent generate data source . This part is same at Classical Statistics and Bayesian Statistics. From this point view,
Classical Statistic still has a prior model same with Bayesian Statistics. So what is prior modelâ€™s params? Scientist try to different way to find them.

Dealing with uncertainty, as for a large size population , 1 million size sample is not cover all information on population . so uncertainty always have . But this type uncertainty we canâ€™t handle, because of samples is fixed . we can only handle modelâ€™s uncertainty. If we represents prior models as params , then we can assign probability distributions to params. Then we get Bayesian prior model. This probability distributions means now we have lot of model, which one , or which interval is our optimized
model? Only we can do is mimicking data generate processing, we just sampling from prior model, that is prior distributionsâ€™s working . You know samples has distributions themself. So our data is a joint Distributions of prior distribution and sampleâ€™s distribution. That is intersection is bridge, so we can use bayesian theorem to computing. conditional distributions of model

In short. Bayes statistics essence is intersection of two distributions, even if your sample size =1 , still has joint distribution . This is a computing frame ,not care about your sample size.

If you use the posteriors from the large dataset and if your data isnâ€™t extremely poorly explained by the posterior, using the public health data could act as a nice regularization against biases in your small dataset.

If your data is poorly explained by the posteriors, a) you will have trouble sampling because your data wonâ€™t be able to update the posterior distribution well b) would indicate that the large public dataset might not be a good prior for your data

Imagine trying to fit a gaussian to a cluster with a mean at x = 100 and you feed the model a prior of Normal(0,1)â€¦ a) sampling wonâ€™t go well because p(mu = 100 | Normal(0,1)) is infinitesimal and b) if you donâ€™t have a lot of data the posterior wonâ€™t move much.

To quote A. German et al. In their â€śBayesian Data Analysisâ€ť book

If an experiment as a whole is replicated several times, then the parameters of the prior distribution can themselves be estimated from data

I think this speaks to what already has been mentioned, that you need to know whether the stochastic processes generating the data are similar to the one you are later interested into.

I like to think of that in the frame of empirical Bayes which could itself be seen as a dummy hierarchical model.