Query newest entry with Query.jl

ohmsweetohm1 · July 20, 2020, 4:42pm

I have a table with multiple entries.
For each entry I have the timestamp it was fetched and the timestamp it represents, e.g. forecasts for a specific timestamp computed at different times.
I would like to query the newest entry for a given vector of timestamps.

Is there a way to do this with Query.jl?

In SQL I would do something like:

SELECT MAX(t_fetch), * 
FROM data
WHERE date
BETWEEN t_from AND t_to
GROUP BY date ORDER BY date ASC

With the following i get all forecasts that where made for the given timestamp

df2 = df |> @query(d, begin
    @orderby d.date, descending(d.fetch)
    @where d.date == timestamp
    @select d
end) |> DataFrame

from this dataframe I would have to select the first element and do this for all timestamps in the given vector.

df3 = df2 |> @take(1) |> DataFrame

Can I combine this with two nested queries somehow?

ohmsweetohm1 · July 22, 2020, 12:40pm

I have also tried:

df |> 
    @filter(_.date == time[1]) |>  
    @orderby_descending(_.fetch) |> 
    @take(1) |> 
    DataFrame

I would have to repeat this for all t in time and combine the result in a single DataFrame.

This is what I think I should do:

df |> 
    @groupby(_.date) |> 
    @orderby_descending(_.fetch) |> 
    @take(1) |> 
    @map(DataFrames.DataFrame(_)) |>
    collect |> first

However, I get the wrong result. I now have every fetch for the first date.

pdeffebach · July 22, 2020, 1:02pm

in base DataFrames I think what you want is

ulia> df = DataFrame(a = [1, 1, 2, 2], b = rand(4))
4×2 DataFrame
│ Row │ a     │ b        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.319191 │
│ 2   │ 1     │ 0.400976 │
│ 3   │ 2     │ 0.919093 │
│ 4   │ 2     │ 0.507223 │

julia> sort!(df, [:a, :b])
4×2 DataFrame
│ Row │ a     │ b        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.319191 │
│ 2   │ 1     │ 0.400976 │
│ 3   │ 2     │ 0.507223 │
│ 4   │ 2     │ 0.919093 │

julia> combine(first, groupby(df, :a))
2×2 DataFrame
│ Row │ a     │ b        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.319191 │
│ 2   │ 2     │ 0.507223 │

But I’m not 100% sure thats what you want to do. Not sure how to do it with Query, sorry.

ohmsweetohm1 · July 22, 2020, 1:04pm

I found a working solution

dfq = vcat([
    (
        df |>
        @filter(_.date == t) |>
        @orderby_descending(_.fetch) |>
        @take(1) |>
        DataFrame
    ) for t in time
]...)

but it is not really elegant.

ohmsweetohm1 · July 22, 2020, 1:14pm

And with the solution from @pdeffebach

dfq2 =
    df |>
    x -> DataFrames.sort(x, ["fetch", "date"], rev = (true, false)) |>
    x -> DataFrames.combine(first, DataFrames.groupby(x, "date")) |>
    x -> Dataframes.filter("date" => x -> time[1] <= x <= time[end], x) |> 
    DataFrames.DataFrame

Topic		Replies	Views
JuliaDB filter: select one of multiple rows New to Julia	0	909	March 9, 2018
How would I sort user transactions by date, and only keep the N most recent entries General Usage question	8	911	February 6, 2020
Simplifying this dataframe aggregation operation General Usage	19	2264	September 9, 2018
A nice use case for DataFrames.jl - flexible dedup General Usage dataframes , tables , splitapplycombine	4	599	July 16, 2021
DataFrame group by first column, and sort by last column Data dataframes	4	1894	October 16, 2018

Query newest entry with Query.jl

Related topics