Can we use logistic regression directly in a case control study?


Say I have a very imbalanced dataset, so I decide to do a case-control study.

For example, we have 1 million healthy people and 300 people with cancer.
I take 300 people from each group and I want to fit a logistic model.

How do I need to adapt the model or modify the results to take into account that the data doesn’t come from a random sampling?
Or I can just use it as is because logistic models are using OR, and this is good for case-control problems?

Another option would be to get a random sample from the population and use different weights for ill and healthy people.

I don’t think this is a question about logistic regression at all, but a question about causal inference more broadly - your issue is that you are comparing a “treatment” group (those with cancer) to a control group, but the treatment group is selected on observed outcome and it is therefore unlikely that potential outcomes are independent of treatment status, which is what you need to identify treatment effects.

This isn’t really a Julia question at all but about basics of causal inference so I would recommend you refer to the standard literature in the field, such as the Imbens/Rubin textbook.

1 Like

Also you can look at propensity score matching techniques. Try to look at this repo.