Duplicates in IndexedTables/JuliaDB

statistics
data
indexing

#1

Hi, with a 2-D ndsparse table (with one value column), I am trying to select the max value for each of the 1st dimension of the index, but I am not really sure how to do that (groupby removes the second dimension, which I need to keep). For example, with a table like this:

2-d NDSparse with 8 values (Float64):
a  b │
─────┼──────────
1  1 │ 0.196028
1  2 │ 0.0369555
1  3 │ 0.630628
2  4 │ 0.521345
2  5 │ 0.495389
2  6 │ 0.300472
3  7 │ 0.212149
3  8 │ 0.816977

I would like to produce the following:

2-d NDSparse with 8 values (Float64):
a  b │
─────┼──────────
1  3 │ 0.630628
2  4 │ 0.521345
3  8 │ 0.816977

In pandas, you can usually filter out duplicates based on one index of a multi-indexed dataframe, but I’m not sure how to closely approximate that here.
Thank you!


#2
groupreduce(max, df)

where df is your NDSparse should do the trick. You can find up to date documentation here (the API changed quite a bit recently).


#3

Thanks, however, that doesn’t seem to do the selection based on only the first index while preserving whatever the second index is at that point. For example:

julia> groupreduce(max, testtbl)
2-d NDSparse with 8 values (Float64):
a  b │
─────┼──────────
1  1 │ 0.196028
1  2 │ 0.0369555
1  3 │ 0.630628
2  4 │ 0.521345
2  5 │ 0.495389
2  6 │ 0.300472
3  7 │ 0.212149
3  8 │ 0.816977

If I try to select the first dimension to act upon, then I lose the second dimension:

julia> groupreduce(max, testtbl, :a)
1-d NDSparse with 3 values (Float64):
a │
──┼─────────
1 │ 0.630628
2 │ 0.521345
3 │ 0.816977

#4

Interesting, this just about works, I have to move the 3rd column here back into the index:

julia> groupreduce(max, testtbl, :a; select=(3,:b))
1-d NDSparse with 3 values (2-tuples):
a │ 2         3
──┼────────────
1 │ 0.630628  3
2 │ 0.521345  4
3 │ 0.816977  8

#5

Sorry, I misunderstood the question. The way to do what you want would probably be with mapslices, but it’s a bit verbose. I believe something like:

function get_max(dd)
   data = values(dd)
   n = indmax(data)
   NDSparse([keys(dd,2)[n]], [data[n]])
end
mapslices(get_max, df, 2)

should work.

I’m not sure if there’s a simple way to change which columns are index and which are data, though that’s certainly interesting. If you need that functionality, maybe you’re better off with a simple Table


#6

Thanks! I wasn’t aware of mapslices, I will definitely check that out.
Edit: That seems to throw an exception:

julia> mapslices(get_max, df, 2)
ERROR: MethodError: all(::IndexedTables.##195#200, ::Array{Any,1}) is ambiguous. Candidates:
  all(f, A::AbstractArray{#s3,N} where N where #s3>:Nulls.Null) in Nulls at /Users/christopheralexander/.julia/v0.6/Nulls/src/Nulls.jl:221
  all(f, A::AbstractArray{#s3,N} where N where #s3>:Missings.Missing) in Missings at /Users/christopheralexander/.julia/v0.6/Missings/src/Missings.jl:411
Possible fix, define
  all(::Any, ::AbstractArray{#s3,N} where N where #s3>:Union{Missings.Missing, Nulls.Null})

Edit -> just needed an update, works now!


#8

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.