How to reindex in DataFrames?

I am looking for a reindex function for DataFrames that allows to reindex a column with respect to a new column.

Suppose I build a data frame like this

julia> ind = today():Day(2):today()+Day(10)
Date("2020-12-02"):Day(2):Date("2020-12-12")

julia> vals = rand(length(ind))
6-element Array{Float64,1}:
 0.9818950590284061
 0.2577122322179928
 0.1523866292941043
 0.25006650363899485
 0.14594687364012926
 0.8152297041275853

julia> df = DataFrame("ind" => ind, "vals" => vals)
6×2 DataFrame
│ Row │ ind        │ vals     │
│     │ Date       │ Float64  │
├─────┼────────────┼──────────┤
│ 1   │ 2020-12-02 │ 0.981895 │
│ 2   │ 2020-12-04 │ 0.257712 │
│ 3   │ 2020-12-06 │ 0.152387 │
│ 4   │ 2020-12-08 │ 0.250067 │
│ 5   │ 2020-12-10 │ 0.145947 │
│ 6   │ 2020-12-12 │ 0.81523  │

As you can see, the ind column takes every other day.
And now I want to automatically reindex the vals entries to not every other day, but every day.
So, I would like to get a dataframe that looks like this (I arbitrarily set the missing values to zero)

11×2 DataFrame
│ Row │ ind        │ vals     │
│     │ Date       │ Float64  │
├─────┼────────────┼──────────┤
│ 1   │ 2020-12-02 │ 0.23982  │
│ 2   │ 2020-12-03 │ 0.0      │
│ 3   │ 2020-12-04 │ 0.308453 │
│ 4   │ 2020-12-05 │ 0.0      │
│ 5   │ 2020-12-06 │ 0.86393  │
│ 6   │ 2020-12-07 │ 0.0      │
│ 7   │ 2020-12-08 │ 0.596054 │
│ 8   │ 2020-12-09 │ 0.0      │
│ 9   │ 2020-12-10 │ 0.932166 │
│ 10  │ 2020-12-11 │ 0.0      │
│ 11  │ 2020-12-12 │ 0.415919 │

Is there such a reindexing function?

Thanks!

I am not sure what the general case would be, but for your specific case there is the following specific solution:

julia> df2 = DataFrame("ind" => (today()+Day(1)):Day(2):(today()+Day(9)), "vals" => zeros(5))
5×2 DataFrame
│ Row │ ind        │ vals    │
│     │ Date       │ Float64 │
├─────┼────────────┼─────────┤
│ 1   │ 2020-12-03 │ 0.0     │
│ 2   │ 2020-12-05 │ 0.0     │
│ 3   │ 2020-12-07 │ 0.0     │
│ 4   │ 2020-12-09 │ 0.0     │
│ 5   │ 2020-12-11 │ 0.0     │

julia> df3 = vcat(df, df2)
11×2 DataFrame
│ Row │ ind        │ vals     │
│     │ Date       │ Float64  │
├─────┼────────────┼──────────┤
│ 1   │ 2020-12-02 │ 0.780801 │
│ 2   │ 2020-12-04 │ 0.386663 │
│ 3   │ 2020-12-06 │ 0.812419 │
│ 4   │ 2020-12-08 │ 0.335888 │
│ 5   │ 2020-12-10 │ 0.198539 │
│ 6   │ 2020-12-12 │ 0.74872  │
│ 7   │ 2020-12-03 │ 0.0      │
│ 8   │ 2020-12-05 │ 0.0      │
│ 9   │ 2020-12-07 │ 0.0      │
│ 10  │ 2020-12-09 │ 0.0      │
│ 11  │ 2020-12-11 │ 0.0      │

julia> sort!(df3, :ind)
11×2 DataFrame
│ Row │ ind        │ vals     │
│     │ Date       │ Float64  │
├─────┼────────────┼──────────┤
│ 1   │ 2020-12-02 │ 0.780801 │
│ 2   │ 2020-12-03 │ 0.0      │
│ 3   │ 2020-12-04 │ 0.386663 │
│ 4   │ 2020-12-05 │ 0.0      │
│ 5   │ 2020-12-06 │ 0.812419 │
│ 6   │ 2020-12-07 │ 0.0      │
│ 7   │ 2020-12-08 │ 0.335888 │
│ 8   │ 2020-12-09 │ 0.0      │
│ 9   │ 2020-12-10 │ 0.198539 │
│ 10  │ 2020-12-11 │ 0.0      │
│ 11  │ 2020-12-12 │ 0.74872  │
2 Likes

Thanks @Henrique_Becker! Sure, there are always ways around, but I am interested in the general case. :slight_smile:

I ask simply because there is such a reindex function for pandas

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

I think what @Henrique_Becker has proposed is actually the best solution and also quite modular. (Depending on your use case, maybe cols=:union would help as well.) I don’t know of a better way to do it.

1 Like

My problem with a more generic solution is how many things are left to decide by the user, to the point that just passing all information to the generic solution is not much different from writing specific code. For example, from where the generic code will take the values for the other columns? It should make all of them Union{PreviousType, Missing} and fill with missing, or it should take a default value as argument for each column? The gaps in the index column should be automatically recognized and filled, or they will always be explicitly passed on? If this is a very common action then maybe it is worth creating a function for it, but for one or two uses I would follow that pattern and adapt it to the specifics of that occurrence.

1 Like

I think the comparison with pandas leads you astray - there is nothing like the idea of an “index” in pandas in DataFrames (and I’d say that’s for the better given how often I’ve struggled with this back when I was a pandas user).

I would naturally use leftjoin here - create a DataFrame with the indexing column you want, and then leftjoin onto that. This will give missing where no match is found, and you can coalesce if you want to replace the missings with some value.

4 Likes

This sounds like a job for leftjoin

julia> ind = today():Day(2):today()+Day(10);

julia> vals = rand(length(ind));

julia> df = DataFrame("ind" => ind, "vals" => vals);

julia> full_inds = today():Day(1):today()+Day(10);

julia> df_full = DataFrame(ind = full_inds);

julia> leftjoin(df_full, df, on ="ind")
11×2 DataFrame
 Row │ ind         vals            
     │ Date        Float64?        
─────┼─────────────────────────────
   1 │ 2020-12-02        0.483128
   2 │ 2020-12-03  missing         
   3 │ 2020-12-04        0.403967
   4 │ 2020-12-05  missing         
   5 │ 2020-12-06        0.0540637
   6 │ 2020-12-07  missing         
   7 │ 2020-12-08        0.951807
   8 │ 2020-12-09  missing         
   9 │ 2020-12-10        0.0238036
  10 │ 2020-12-11  missing         
  11 │ 2020-12-12        0.51351
7 Likes

This sounds like a job for leftjoin

It does! Thanks a lot for extending the example!

This is the generic solution I was looking for.

1 Like

un unidiomatic one:

sort(vcat(df,DataFrame(ind=setdiff(df_full.ind,df.ind),vals=0)))

using the variables already defined

dff = DataFrame(ind = full_inds, vals=0.);
sort(combine(groupby(vcat(df,dff),:ind), :vals=>sum=>:vals))

or

dff1 = DataFrame(ind = full_inds, vals=missing);
sort(combine(groupby(vcat(df,dff1),:ind), :vals=>(x->(coalesce(x...)))=>:vals))

or (*)

j=1
for i in 1:length(dff.ind)
    global j
    if df.ind[j]==dff.ind[i]
        dff.vals[i]=df.vals[j]
        if j<length(df.ind)
            j=j+1
        end
    end        
end

or, as very last resource :wink:,

map(x-> x.ind in df.ind ? x.vals= df[df.ind.==x.ind,:vals][1]  : x.vals, eachrow(dff))

(*)
I tried to check this code (to see how, step after step, things are going) using debug under vscode environment. But failed.
this was my first attempt at using debugging.
Where can I get help on this topic?