I have a table with multiple entries.
For each entry I have the timestamp it was fetched and the timestamp it represents, e.g. forecasts for a specific timestamp computed at different times.
I would like to query the newest entry for a given vector of timestamps.
Is there a way to do this with Query.jl?
In SQL I would do something like:
SELECT MAX(t_fetch), *
FROM data
WHERE date
BETWEEN t_from AND t_to
GROUP BY date ORDER BY date ASC
With the following i get all forecasts that where made for the given timestamp
df2 = df |> @query(d, begin
@orderby d.date, descending(d.fetch)
@where d.date == timestamp
@select d
end) |> DataFrame
from this dataframe I would have to select the first element and do this for all timestamps in the given vector.
df3 = df2 |> @take(1) |> DataFrame
Can I combine this with two nested queries somehow?
I have also tried:
df |>
@filter(_.date == time[1]) |>
@orderby_descending(_.fetch) |>
@take(1) |>
DataFrame
I would have to repeat this for all t in time
and combine the result in a single DataFrame
.
This is what I think I should do:
df |>
@groupby(_.date) |>
@orderby_descending(_.fetch) |>
@take(1) |>
@map(DataFrames.DataFrame(_)) |>
collect |> first
However, I get the wrong result. I now have every fetch
for the first date
.
in base DataFrames I think what you want is
ulia> df = DataFrame(a = [1, 1, 2, 2], b = rand(4))
4×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.319191 │
│ 2 │ 1 │ 0.400976 │
│ 3 │ 2 │ 0.919093 │
│ 4 │ 2 │ 0.507223 │
julia> sort!(df, [:a, :b])
4×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.319191 │
│ 2 │ 1 │ 0.400976 │
│ 3 │ 2 │ 0.507223 │
│ 4 │ 2 │ 0.919093 │
julia> combine(first, groupby(df, :a))
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.319191 │
│ 2 │ 2 │ 0.507223 │
But I’m not 100% sure thats what you want to do. Not sure how to do it with Query, sorry.
I found a working solution
dfq = vcat([
(
df |>
@filter(_.date == t) |>
@orderby_descending(_.fetch) |>
@take(1) |>
DataFrame
) for t in time
]...)
but it is not really elegant.
And with the solution from @pdeffebach
dfq2 =
df |>
x -> DataFrames.sort(x, ["fetch", "date"], rev = (true, false)) |>
x -> DataFrames.combine(first, DataFrames.groupby(x, "date")) |>
x -> Dataframes.filter("date" => x -> time[1] <= x <= time[end], x) |>
DataFrames.DataFrame