Fill gaps in DataFrame

What is an idiomatic way how to fill gaps in DataFrame? “Gaps” for me are some unknown data. One example is - I need to draw a line, but I have values only for some points. I need to supply some values (e.g. zero) for those X points.

Let’s say I have something like this:

using Dates
df = DataFrame(Date = [Date(2021, 05, 07), Date(2021, 05, 09), Date(2021, 05, 11)], Price = [10, 13, 8], Quantity=[1049, 1145, 901])
allDates = Date(2021, 05, 06):Dates.Day(1):Date(2021, 05, 11)

And I’d like to have DataFrame that contains all the dates from allDates. And if the date is not from original df, then Price and Quantity is 0.

Wanted result:

julia> df = DataFrame(
           Date = [Date(2021,05,06), Date(2021,05,07), Date(2021,05,08), Date(2021,05,09), Date(2021,05,10), Date(2021,05,11)],
           Price = [0, 10, 0, 13, 0, 8],
           Quantity = [0, 1049, 0, 1145, 0, 901])
6×3 DataFrame
 Row │ Date        Price  Quantity 
     │ Date        Int64  Int64    
─────┼─────────────────────────────
   1 │ 2021-05-06      0         0
   2 │ 2021-05-07     10      1049
   3 │ 2021-05-08      0         0
   4 │ 2021-05-09     13      1145
   5 │ 2021-05-10      0         0
   6 │ 2021-05-11      8       901

I can create a DataFrame in a loop row by row, but that’s probably not the easiest way…
What is the best approach here, please?

Just make a DataFrame with all dates as a column, and then leftjoin the data you actually have onto that. Where no data is available, price and quantity will be missing, but you can coalesce with zero afterwards.

5 Likes

Thank you, looks good. So the result code is

@pipe DataFrames.leftjoin(DataFrame(Date=allDates), df, on = :Date) .|>
     coalesce(_, 0) |> 
     sort(_, :Date)
2 Likes

If you don’t mind @nilshg , one more question came on my mind… What if I have a DataFrame with Date and Amount of apples. I have records only about days when somebody took an apple(s), so the amount changed. And I need to fill also the remaining days (when the amount of apples didn’t change). So I’d like to fill the gaps with “previous” value if needed.

applesAmount = DataFrame(
                    Date = [Date(2021, 05, 07), Date(2021, 05, 09), Date(2021, 05, 13)], 
                    Amount = [10, 8, 2])
allDates = Date(2021, 05, 06):Dates.Day(1):Date(2021, 05, 13)

# wanted:
# either with value from first known date
df = DataFrame(
           Date = allDates,
           Amount = [10, 10, 10, 8, 8, 8, 8, 2])
#or with zero before the first known date
df = DataFrame(
           Date = allDates,
           Amount = [0, 10, 10, 8, 8, 8, 8, 2])
# it depends on usecase

Ugly, but working solution. Everything solves a loop :smiley:
But please… anything better than this?

function fillZerosWithNochangeValues!(indata, column)
    lastvalue = nothing
    for row = eachrow(indata)
        value = row[column]
        if ismissing(value)
            if isnothing(lastvalue)
                row[column] = 0
                lastvalue = 0
            else
                row[column] = lastvalue
            end
        else
            lastvalue = value
        end
    end
end


applesAmount = DataFrame(
                   Date = [Date(2021, 05, 07), Date(2021, 05, 09), Date(2021, 05, 13)],
                   Amount = [10, 8, 2])
                   
allDates = Date(2021, 05, 06):Dates.Day(1):Date(2021, 05, 13)

allApplesAmount = sort(DataFrames.leftjoin(DataFrame(Date=allDates), applesAmount, on = :Date), :Date)

fillZerosWithNochangeValues!(allApplesAmount, :Amount)
allApplesAmount

Have a look at the Impute package, which I believe offers a fill forward imputation.

Indeed Impute has locf for this (though it won’t replace the first value if it’s missing).

Here’s a short (and probably slower) implementation if you want to do it without Impute:

allApplesAmount.Amount =
    let prev = 0
        map(x -> prev = coalesce(x, prev), allApplesAmount.Amount)
    end

Edit: I looked for a solution with map as an exercise but for production code I would use a for loop like you did: less hackish than a map with side effect and easier to adapt e.g. to backward imputation. Here’s a version that works for different number types:

function impute_forward(v; default=zero(eltype(v)))
    out = similar(v, nonmissingtype(eltype(v)))
    prev = default
    for (i, x) in pairs(v)
        prev = out[i] = coalesce(x, prev)
    end
    return out
end

allApplesAmount.Amount = impute_forward(allApplesAmount.Amount)
1 Like

Thanks a lot both for the answers. I had a brief look at Impute, but hoped there is something that doesn’t require external package.

That trick with coalesce is nice. I even haven’t seen let keyword in julia, so something new again! Great :slight_smile:

Impute is just as external as DataFrames, so there’s no real reason not to use it.

In general the Julia ecosystem is built on a number of small, modular packages which do a limited number of things very well, so trying to avoid using many packages is a losing strategy IMHO when working in Julia.

3 Likes

True, but there is still value in avoiding a dependency when the solution is just a few lines of code. Each dependency has a cost that is easy to forget, in terms of maintenance, security, etc. See this article by Russ Cox (discussed here).

Even with Julia’s ecosystem which makes heavy use of dependencies, I think it makes sense to avoid them in trivial cases. Consider examples at both ends of the spectrum:

  • If you limit yourself to flagship packages like DataFrames and GLM, when something goes wrong, like a change in one breaking the other, you can expect that it will be fixed quickly. Same thing when something breaks in a small dependency of DataFrames you don’t even know about.

  • If you use a dozen low-profile packages and you add another one that forces an update that triggers a bad interaction between two others… it’s probably on you.

1 Like

Right, using packages isn’t free of charge. I know that package dependency might be big pain in general.
I use Julia from time to time, mainly for some basic data processing of large CSVs/DataFrames, some analysis (joins, basic statistics) and plotting. So I haven’t invested enough time to studying how the package development, dependency management etc. works. Several days ago I saw some errors saying that I can’t add package because of some issues, probably because of dependencies. It might be easy to solve, I admit (as I said I still don’t know much about it). But that drives me to using just basic, idiomatic constructs as much as possible.
And this general approach has paid off in long term during all my career.