How to solve regression problem with no related data each other

This is more of a problem solving question rather than programming. I’ll delete this post if it is not appropriate.

I am trying to solve a regression problem. And I looked at scatter plot that is frustrating me. :confused:
I can’t see any continuous independent variables which has positive or negative relationship with target variable “perf”.

Is there any models or algorithms that can help this situation?
Do I need to do feature transformation?

I would really appreciate with little advices.

I would start with more basic questions and try to test hypothesis before attempting any regression model.

Formulate your hypothesis in a frequentist or Bayesian setting (Julia is awesome for that) and then you will have more evidence to guide your next steps. Keep in mind that it is not always possible to build a predictive model, and that these kinds of tests can really help you gain insight about the problem at hand.

2 Likes

Thanks for replying.

sometimes there’s just nothing in your data… and here it may be such a case. I agree with the advice to formulate hypotheses and work with that but eyeballing the plot, I’d be surprised if you found a strong regression model.

One thing you can do as well is bin your target variable i.e. instead of trying to predict pref you try to predict perf < t1 , perf >= t1 (possibly more classes but if you already get bad results with 2 classes it’s not a great sign).

If you have a strong sense that there should be an exploitable relationship, then I’d dig a bit deeper in the data to try to figure out whether there are sources of noise that you could eliminate.

Just a few thoughts though, good luck!

1 Like

Thanks for the advice.

I agree with the general comments that you should form a hypothesis first, but it’s worth noting that this plot hides the density of points because they are overlapping. This is most evident in the sex-work plot which only has four points showing. However, there are many more points overlapping, so you can’t see which of the four corners is more common. If the top right and bottom left have more observations than the other two, then you’d have a positive correlation between sex and work.

One way to better see the density of points is to set alpha to a lower value, e.g. 0.3, so you can see where the points overlap. Another option is to look at a 2d histogram or density plot, e.g.
https://docs.juliaplots.org/latest/generated/gr/#gr-ref10

Here is the equivalent idea in R: https://www.r-graph-gallery.com/2d-density-plot-with-ggplot2.html