Fill gaps in DataFrame

stej · May 10, 2021, 8:11pm

What is an idiomatic way how to fill gaps in DataFrame? “Gaps” for me are some unknown data. One example is - I need to draw a line, but I have values only for some points. I need to supply some values (e.g. zero) for those X points.

Let’s say I have something like this:

using Dates
df = DataFrame(Date = [Date(2021, 05, 07), Date(2021, 05, 09), Date(2021, 05, 11)], Price = [10, 13, 8], Quantity=[1049, 1145, 901])
allDates = Date(2021, 05, 06):Dates.Day(1):Date(2021, 05, 11)

And I’d like to have DataFrame that contains all the dates from allDates. And if the date is not from original df, then Price and Quantity is 0.

Wanted result:

julia> df = DataFrame(
           Date = [Date(2021,05,06), Date(2021,05,07), Date(2021,05,08), Date(2021,05,09), Date(2021,05,10), Date(2021,05,11)],
           Price = [0, 10, 0, 13, 0, 8],
           Quantity = [0, 1049, 0, 1145, 0, 901])
6×3 DataFrame
 Row │ Date        Price  Quantity 
     │ Date        Int64  Int64    
─────┼─────────────────────────────
   1 │ 2021-05-06      0         0
   2 │ 2021-05-07     10      1049
   3 │ 2021-05-08      0         0
   4 │ 2021-05-09     13      1145
   5 │ 2021-05-10      0         0
   6 │ 2021-05-11      8       901

I can create a DataFrame in a loop row by row, but that’s probably not the easiest way…
What is the best approach here, please?

nilshg · May 10, 2021, 8:14pm

Just make a DataFrame with all dates as a column, and then leftjoin the data you actually have onto that. Where no data is available, price and quantity will be missing, but you can coalesce with zero afterwards.

stej · May 10, 2021, 8:24pm

Thank you, looks good. So the result code is

@pipe DataFrames.leftjoin(DataFrame(Date=allDates), df, on = :Date) .|>
     coalesce(_, 0) |> 
     sort(_, :Date)

stej · May 10, 2021, 8:37pm

If you don’t mind @nilshg , one more question came on my mind… What if I have a DataFrame with Date and Amount of apples. I have records only about days when somebody took an apple(s), so the amount changed. And I need to fill also the remaining days (when the amount of apples didn’t change). So I’d like to fill the gaps with “previous” value if needed.

applesAmount = DataFrame(
                    Date = [Date(2021, 05, 07), Date(2021, 05, 09), Date(2021, 05, 13)], 
                    Amount = [10, 8, 2])
allDates = Date(2021, 05, 06):Dates.Day(1):Date(2021, 05, 13)

# wanted:
# either with value from first known date
df = DataFrame(
           Date = allDates,
           Amount = [10, 10, 10, 8, 8, 8, 8, 2])
#or with zero before the first known date
df = DataFrame(
           Date = allDates,
           Amount = [0, 10, 10, 8, 8, 8, 8, 2])
# it depends on usecase

stej · May 10, 2021, 10:08pm

Ugly, but working solution. Everything solves a loop
But please… anything better than this?

function fillZerosWithNochangeValues!(indata, column)
    lastvalue = nothing
    for row = eachrow(indata)
        value = row[column]
        if ismissing(value)
            if isnothing(lastvalue)
                row[column] = 0
                lastvalue = 0
            else
                row[column] = lastvalue
            end
        else
            lastvalue = value
        end
    end
end


applesAmount = DataFrame(
                   Date = [Date(2021, 05, 07), Date(2021, 05, 09), Date(2021, 05, 13)],
                   Amount = [10, 8, 2])
                   
allDates = Date(2021, 05, 06):Dates.Day(1):Date(2021, 05, 13)

allApplesAmount = sort(DataFrames.leftjoin(DataFrame(Date=allDates), applesAmount, on = :Date), :Date)

fillZerosWithNochangeValues!(allApplesAmount, :Amount)
allApplesAmount

nilshg · May 11, 2021, 5:52am

Have a look at the Impute package, which I believe offers a fill forward imputation.

sijo · May 11, 2021, 7:42am

Indeed Impute has locf for this (though it won’t replace the first value if it’s missing).

Here’s a short (and probably slower) implementation if you want to do it without Impute:

allApplesAmount.Amount =
    let prev = 0
        map(x -> prev = coalesce(x, prev), allApplesAmount.Amount)
    end

Edit: I looked for a solution with map as an exercise but for production code I would use a for loop like you did: less hackish than a map with side effect and easier to adapt e.g. to backward imputation. Here’s a version that works for different number types:

function impute_forward(v; default=zero(eltype(v)))
    out = similar(v, nonmissingtype(eltype(v)))
    prev = default
    for (i, x) in pairs(v)
        prev = out[i] = coalesce(x, prev)
    end
    return out
end

allApplesAmount.Amount = impute_forward(allApplesAmount.Amount)

stej · May 11, 2021, 6:57pm

Thanks a lot both for the answers. I had a brief look at Impute, but hoped there is something that doesn’t require external package.

That trick with coalesce is nice. I even haven’t seen let keyword in julia, so something new again! Great

nilshg · May 11, 2021, 8:07pm

Impute is just as external as DataFrames, so there’s no real reason not to use it.

In general the Julia ecosystem is built on a number of small, modular packages which do a limited number of things very well, so trying to avoid using many packages is a losing strategy IMHO when working in Julia.

sijo · May 12, 2021, 11:50am

True, but there is still value in avoiding a dependency when the solution is just a few lines of code. Each dependency has a cost that is easy to forget, in terms of maintenance, security, etc. See this article by Russ Cox (discussed here).

Even with Julia’s ecosystem which makes heavy use of dependencies, I think it makes sense to avoid them in trivial cases. Consider examples at both ends of the spectrum:

If you limit yourself to flagship packages like DataFrames and GLM, when something goes wrong, like a change in one breaking the other, you can expect that it will be fixed quickly. Same thing when something breaks in a small dependency of DataFrames you don’t even know about.
If you use a dozen low-profile packages and you add another one that forces an update that triggers a bad interaction between two others… it’s probably on you.

stej · May 13, 2021, 5:32am

Right, using packages isn’t free of charge. I know that package dependency might be big pain in general.
I use Julia from time to time, mainly for some basic data processing of large CSVs/DataFrames, some analysis (joins, basic statistics) and plotting. So I haven’t invested enough time to studying how the package development, dependency management etc. works. Several days ago I saw some errors saying that I can’t add package because of some issues, probably because of dependencies. It might be easy to solve, I admit (as I said I still don’t know much about it). But that drives me to using just basic, idiomatic constructs as much as possible.
And this general approach has paid off in long term during all my career.

Topic		Replies	Views
Fill up and fill down rows New to Julia dataframes	11	931	February 28, 2023
Identify and add missing rows dynamically in dataframe General Usage dataframes , transformers	6	855	April 9, 2024
Replacing missing and NaN values in dataframe New to Julia question , dataframes , missing-values	6	4079	March 29, 2022
Insert dummy 0 value rows into dataframe General Usage dataframes	7	561	May 23, 2021
Julia: how to fill na in data frame like as data frame in python Community	3	3615	April 13, 2018

Fill gaps in DataFrame

Related topics