How to reindex in DataFrames?

timueh · December 2, 2020, 1:21pm

I am looking for a reindex function for DataFrames that allows to reindex a column with respect to a new column.

Suppose I build a data frame like this

julia> ind = today():Day(2):today()+Day(10)
Date("2020-12-02"):Day(2):Date("2020-12-12")

julia> vals = rand(length(ind))
6-element Array{Float64,1}:
 0.9818950590284061
 0.2577122322179928
 0.1523866292941043
 0.25006650363899485
 0.14594687364012926
 0.8152297041275853

julia> df = DataFrame("ind" => ind, "vals" => vals)
6×2 DataFrame
│ Row │ ind        │ vals     │
│     │ Date       │ Float64  │
├─────┼────────────┼──────────┤
│ 1   │ 2020-12-02 │ 0.981895 │
│ 2   │ 2020-12-04 │ 0.257712 │
│ 3   │ 2020-12-06 │ 0.152387 │
│ 4   │ 2020-12-08 │ 0.250067 │
│ 5   │ 2020-12-10 │ 0.145947 │
│ 6   │ 2020-12-12 │ 0.81523  │

As you can see, the ind column takes every other day.
And now I want to automatically reindex the vals entries to not every other day, but every day.
So, I would like to get a dataframe that looks like this (I arbitrarily set the missing values to zero)

11×2 DataFrame
│ Row │ ind        │ vals     │
│     │ Date       │ Float64  │
├─────┼────────────┼──────────┤
│ 1   │ 2020-12-02 │ 0.23982  │
│ 2   │ 2020-12-03 │ 0.0      │
│ 3   │ 2020-12-04 │ 0.308453 │
│ 4   │ 2020-12-05 │ 0.0      │
│ 5   │ 2020-12-06 │ 0.86393  │
│ 6   │ 2020-12-07 │ 0.0      │
│ 7   │ 2020-12-08 │ 0.596054 │
│ 8   │ 2020-12-09 │ 0.0      │
│ 9   │ 2020-12-10 │ 0.932166 │
│ 10  │ 2020-12-11 │ 0.0      │
│ 11  │ 2020-12-12 │ 0.415919 │

Is there such a reindexing function?

Thanks!

Henrique_Becker · December 2, 2020, 1:55pm

I am not sure what the general case would be, but for your specific case there is the following specific solution:

julia> df2 = DataFrame("ind" => (today()+Day(1)):Day(2):(today()+Day(9)), "vals" => zeros(5))
5×2 DataFrame
│ Row │ ind        │ vals    │
│     │ Date       │ Float64 │
├─────┼────────────┼─────────┤
│ 1   │ 2020-12-03 │ 0.0     │
│ 2   │ 2020-12-05 │ 0.0     │
│ 3   │ 2020-12-07 │ 0.0     │
│ 4   │ 2020-12-09 │ 0.0     │
│ 5   │ 2020-12-11 │ 0.0     │

julia> df3 = vcat(df, df2)
11×2 DataFrame
│ Row │ ind        │ vals     │
│     │ Date       │ Float64  │
├─────┼────────────┼──────────┤
│ 1   │ 2020-12-02 │ 0.780801 │
│ 2   │ 2020-12-04 │ 0.386663 │
│ 3   │ 2020-12-06 │ 0.812419 │
│ 4   │ 2020-12-08 │ 0.335888 │
│ 5   │ 2020-12-10 │ 0.198539 │
│ 6   │ 2020-12-12 │ 0.74872  │
│ 7   │ 2020-12-03 │ 0.0      │
│ 8   │ 2020-12-05 │ 0.0      │
│ 9   │ 2020-12-07 │ 0.0      │
│ 10  │ 2020-12-09 │ 0.0      │
│ 11  │ 2020-12-11 │ 0.0      │

julia> sort!(df3, :ind)
11×2 DataFrame
│ Row │ ind        │ vals     │
│     │ Date       │ Float64  │
├─────┼────────────┼──────────┤
│ 1   │ 2020-12-02 │ 0.780801 │
│ 2   │ 2020-12-03 │ 0.0      │
│ 3   │ 2020-12-04 │ 0.386663 │
│ 4   │ 2020-12-05 │ 0.0      │
│ 5   │ 2020-12-06 │ 0.812419 │
│ 6   │ 2020-12-07 │ 0.0      │
│ 7   │ 2020-12-08 │ 0.335888 │
│ 8   │ 2020-12-09 │ 0.0      │
│ 9   │ 2020-12-10 │ 0.198539 │
│ 10  │ 2020-12-11 │ 0.0      │
│ 11  │ 2020-12-12 │ 0.74872  │

timueh · December 2, 2020, 1:57pm

Thanks @Henrique_Becker! Sure, there are always ways around, but I am interested in the general case.

I ask simply because there is such a reindex function for pandas

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

fbanning · December 2, 2020, 2:07pm

I think what @Henrique_Becker has proposed is actually the best solution and also quite modular. (Depending on your use case, maybe cols=:union would help as well.) I don’t know of a better way to do it.

Henrique_Becker · December 2, 2020, 2:22pm

My problem with a more generic solution is how many things are left to decide by the user, to the point that just passing all information to the generic solution is not much different from writing specific code. For example, from where the generic code will take the values for the other columns? It should make all of them Union{PreviousType, Missing} and fill with missing, or it should take a default value as argument for each column? The gaps in the index column should be automatically recognized and filled, or they will always be explicitly passed on? If this is a very common action then maybe it is worth creating a function for it, but for one or two uses I would follow that pattern and adapt it to the specifics of that occurrence.

nilshg · December 2, 2020, 2:30pm

I think the comparison with pandas leads you astray - there is nothing like the idea of an “index” in pandas in DataFrames (and I’d say that’s for the better given how often I’ve struggled with this back when I was a pandas user).

I would naturally use leftjoin here - create a DataFrame with the indexing column you want, and then leftjoin onto that. This will give missing where no match is found, and you can coalesce if you want to replace the missings with some value.

pdeffebach · December 2, 2020, 2:33pm

This sounds like a job for leftjoin

julia> ind = today():Day(2):today()+Day(10);

julia> vals = rand(length(ind));

julia> df = DataFrame("ind" => ind, "vals" => vals);

julia> full_inds = today():Day(1):today()+Day(10);

julia> df_full = DataFrame(ind = full_inds);

julia> leftjoin(df_full, df, on ="ind")
11×2 DataFrame
 Row │ ind         vals            
     │ Date        Float64?        
─────┼─────────────────────────────
   1 │ 2020-12-02        0.483128
   2 │ 2020-12-03  missing         
   3 │ 2020-12-04        0.403967
   4 │ 2020-12-05  missing         
   5 │ 2020-12-06        0.0540637
   6 │ 2020-12-07  missing         
   7 │ 2020-12-08        0.951807
   8 │ 2020-12-09  missing         
   9 │ 2020-12-10        0.0238036
  10 │ 2020-12-11  missing         
  11 │ 2020-12-12        0.51351

timueh · December 2, 2020, 7:14pm

This sounds like a job for leftjoin

It does! Thanks a lot for extending the example!

This is the generic solution I was looking for.

rocco_sprmnt21 · December 29, 2020, 8:46pm

un unidiomatic one:

sort(vcat(df,DataFrame(ind=setdiff(df_full.ind,df.ind),vals=0)))

rocco_sprmnt21 · December 30, 2020, 7:29am

using the variables already defined

dff = DataFrame(ind = full_inds, vals=0.);
sort(combine(groupby(vcat(df,dff),:ind), :vals=>sum=>:vals))

or

dff1 = DataFrame(ind = full_inds, vals=missing);
sort(combine(groupby(vcat(df,dff1),:ind), :vals=>(x->(coalesce(x...)))=>:vals))

or (*)

j=1
for i in 1:length(dff.ind)
    global j
    if df.ind[j]==dff.ind[i]
        dff.vals[i]=df.vals[j]
        if j<length(df.ind)
            j=j+1
        end
    end        
end

or, as very last resource ,

map(x-> x.ind in df.ind ? x.vals= df[df.ind.==x.ind,:vals][1]  : x.vals, eachrow(dff))

(*)
I tried to check this code (to see how, step after step, things are going) using debug under vscode environment. But failed.
this was my first attempt at using debugging.
Where can I get help on this topic?

Topic		Replies	Views
Impute by index New to Julia indexing	8	531	May 11, 2020
Change column to row using conditions New to Julia dataframes	4	456	June 23, 2022
Creating one column from multiple DataFrames General Usage dataframes	3	391	March 18, 2020
DataFrame array subset General Usage array	18	559	August 11, 2020
String Index for DataFrames Data question	1	2308	September 4, 2019

How to reindex in DataFrames?

Related topics