How to handle/ignore missing values when fitting differential equations?


I have a time-series of hourly weather data records that you can check here. You can see that for every hour in these two days (July 16-17 2018), we have a measurement of temperature, and other variables. I am interested in using Julia to fit differential equations to my data, so I can assess whether this set up would work to forecast weather in the short-term (like in this example). However, as it happens with real data, you can see that there are some missing values in the CSV, that is, for a particular hour, there are no measurements taken.

I was wondering whether the differential equations library can actually work with gaps in the time-series, or it is absolutely mandatory to provide a value(s) at each time step (i.e. each hour in this case). I could impute or interpolate the missing values, but these would not be the real ones and that could be misleading during the fitting process.

Is it possible to pass data with gaps in such a set up?
Can you provide any hint, if any, on how to do this?

Thanks for your help!

Assuming you’re using DiffEqFlux, the missings should only show up in the data in the loss value that you write, so you just need to be careful there. For example, subtract the solution from your data and you’ll get missings, and then drop missings because you do sum(abs2,x) for sum squared error and it would do what you’re looking for on that kind of loss. For other losses you’d do similar missing dropping after doing the subtraction etc. against data.

1 Like

Hi Chris,

Thanks for your reply! Ok, so it seems there is a way for DiffEqFlux to proceed with the fitting regardless of the missing values. Since I am very new to Julia+DiffEqFlux, I have a couple of extra questions, just to have it clear in my mind:

  • Loss function: If I got this correctly, the key would be defining a loss function that can handle these ‘missing’ or ‘NaN’ for a given timestamp. In this example from DiffEqFlux, a L2 loss function is defined as follows:
function loss_n_ode(p)
    pred = predict_n_ode(p)
    loss = sum(abs2,ode_data .- pred) # L2 loss, I guess

So in the event that this function receives a missing / Nan, and we add the robust loss function you kindly suggested, it could be rewritten as:

function loss_n_ode(p)
    clean = filter(!isnan, p) <--
    pred = predict_n_ode(clean) <--
    loss = sum(abs2, pred) <--

So this would not crash in the event one of my weather values is NaN.

  • Delete empty rows of DataFrame? By writing the previous item, I realized that the missing value is also an empty row in my DataFrame: a gap in the temporal axis. Maybe the function loss_n_node should not receive any NaN, because this row should be removed. Thus, in the event that I remove all these empty rows from my DataFrame, would still be ok to use DiffEqFlux with an irregularly-sampled temporal axis? I thought the temporal axis had to be “segmented” in equal units, but perhaps I am wrong on that.

Could you provide any hint/advice about this?
Sorry for the lengthy message, as you can see I have a lot to learn and reflect! :grinning:
Thanks a lot for your help!

Sorry, JuliaCon got me behind.

Yup that should be good.

It should be fine with DiffEqFlux, I just don’t know all of the DataFrame commands so someone else might need to help there, but you might want to do this in the data iterator since this kind of data mangling shouldn’t need to be differentiated.