ANN: LightQuery tutorial

I’d be happy to collaborate with @davidanthoff, or whoever else.

Took a look at the tutorial and seems like a very nice contribution to the ecosystem.

1 Like

@mkborregaard I just pushed a version with three new shortcut functions: group_by, summarize, and left_join. These work basically the way things work in dplyr so they cut down on syntax by a lot. For some reason, it takes a bit of work to get my browser to not show a cached version of github pages, but I show them off in the latest docs.

1 Like

Query and LightQuery seem to be the pandas/dplyr of Julia. Is there any Julia version of R data.table?

I think all the things @bramtayl is doing there is fantastic! The more packages, the merrier.

In my mind both packages are at quite different stages in their lifetime: Query.jl (and the whole Queryverse.jl) have been around for a while, have an active user base and so my focus for those is really mostly on adding features, not redesigning core architecture things. Backwards compatibility is something I care a lot about! We also have an endless list of features we want to implement that all work with the current design (i.e. they are all additive to the codebase), and I think our focus is really on pursuing that list right now.

I think it is great, though, that packages like LightQuery.jl can experiment with other designs. I certainly am learning a lot from those efforts. In my mind a good strategy there is to just see how those ideas work out, and once we have more data on that, we could think how some of this stuff could be integrated.

9 Likes

Yay! Great to have both of these packages in the ecosystem.
@bramtayl sweet! amazing with the swift response, will check it out straight away.

1 Like

Usage and performance notes · LightQuery.jl new tutorial. can i haz starz

Fixed a bunch of performance issues related to https://github.com/JuliaLang/julia/issues/31909 and https://github.com/JuliaLang/julia/issues/31826

2 Likes

@bramtayl, after some work with JuliaDB I wanted to try your package (and then finally Query.jl), but I can’t find any of those convenience functions you mentioned (e.g. group_by) in the tutorial (or are there docs that are not the tutorial?). Coming from JuliaDB I’ve accustomed myself to things like join(t1, t2)… I can read through the tutorial a lot more carefully, but after a few quick scans I can’t seem to find a simple join of two tables.
Please, let me know what I’ve missed :slight_smile:

@yakir12 Sounds like I need better docs… I just pushed a version with a table of contents, does that help?

I’m in the process of changing some things around to incorporate SQL, but as thing stand currently:

Group by is here:

https://bramtayl.github.io/LightQuery.jl/latest/#LightQuery.Group-Tuple{By}

with an example in the tutorial here:

https://bramtayl.github.io/LightQuery.jl/latest/#Flights-validation-1

Right now the join function is called mix… Base has already taken both join and merge. I’m open to suggestions about a better name tho.

Usage and performance notes · LightQuery.jl}

With an example in the tutorial here:

https://bramtayl.github.io/LightQuery.jl/latest/#Joining-flights-and-weather-1

@yakir12 also added an index to the Interface section. Does that help?

Sorry for the super slow response.

I appreciate what you’re doing in the docs, walking through a cohesive long example, but I’d really appreciate some super simple standalone examples. Like, how do I inner join two tables matching on specific keys?

For instance, naively, I’d think this would work:

using CSV, LightQuery
table1 = CSV.read("table1.csv")
table2 = CSV.read("table2.csv")
table12 = @name @> mix(:inner,
                 By(table1, _.key1),
                 By(table2, _.key2)
                )

But it doesn’t.

Again, I suspect that if I followed your documented monolithic example I’d “get it”. But that’s a lot harder to do than just search “join” and find a self-contained simple example on how to join two tables. My final 2-cents are, don’t remove anything from the docs, but have a couple of these ultra simple examples for the most common things lying around so people don’t have to read through the whole thing… Again, thank you for your work!!!

Feeling pretty dense here…

table1 and table2 are read in as dataframes, right? You have to convert them to a named_tuple (which will be a named tuple of columns) and then use to_rows to view them as rows.

table1 = to_rows(named_tuple(DataFrame(key1 = [1])))
table2 = to_rows(named_tuple(DataFrame(key2 = [1])))
@name mix(:inner, By(table1, :key1), By(table2, :key2))

However, CSV.read will copy/allocate, so I would suggest using CSV.File instead.

Maybe I should add a note about working with DataFrames ?

No, they are from csv files I load with CSV.File (thanks for the suggestion).

So I need to to_rows(named_tuple(CSV.File("file.csv")))?

OK, I now realized that the keys themselves need to be sorted before any join can happen… I feel a few convenience functions would be appropriate here.

So how do I sort?

# generate some example data
[(key1 = k, v1 = v) for (k,v) in zip(1:3, 'a':'c')] |> CSV.write("tmp1.csv")
[(key2 = k, v2 = v) for (k,v) in zip([1, 11, 3], 'x':'z')] |> CSV.write("tmp2.csv")
# join
x = @name mix(:inner, By(to_rows(named_tuple(CSV.File("tmp1.csv"))), :key1), By(to_rows(named_tuple(CSV.File("tmp2.csv"))), :key2))
# check the join
for i in x
    @show i
end

@yakir12 Short answer: if your data is out of order, just read it all in and sort it in memory first.

using LightQuery
import CSV

CSV.write("temp1.csv", [(key1 = key, value1 = value) for (key, value) in zip(1:3, 'a':'c')])
CSV.write("temp2.csv", [(key1 = key, value2 = value) for (key, value) in zip(1:3, 'a':'c')])

source_1 =
    @name @> "temp1.csv" |>
    CSV.read(_) |>
    named_tuple |>
    to_rows |>
    order(_, :key1) |>
    By(_, :key1)

source_2 = 
    @name @> "temp2.csv" |>
    CSV.read(_) |>
    named_tuple |>
    to_rows |>
    order(_, :key1) |>
    By(_, :key1)

@name mix(:inner, source_1, source_2)

Long answer: how is your actually data stored? I designed order very carefully so that it should be able to support out of memory sorting with a few shims.

Awesome! Short answers are great.

The data isn’t sorted so well, but because it’s pretty persistent now, I could sort it and save it sorted already. Cool! Thanks a lot!!!

Eh… It seems like it matches only on one occurrence of matching values, not all. Consider this example:

CSV.write("temp1.csv", [(key1 = key, value1 = value) for (key, value) in zip(1:6, [1,1,1,2,2,4])])
CSV.write("temp2.csv", [(key1 = key, value2 = value) for (key, value) in zip(1:3, 'a':'c')])
source_1 =
    @name @> "temp1.csv" |>
    CSV.read(_) |>
    named_tuple |>
    to_rows |>
    order(_, :value1) |>
    By(_, :value1)
source_2 = 
    @name @> "temp2.csv" |>
    CSV.read(_) |>
    named_tuple |>
    to_rows |>
    order(_, :key1) |>
    By(_, :key1)
x = @name mix(:inner, source_1, source_2)
Tables.rowtable(x)

which results in:

2-element Array{NamedTuple{(Symbol("1"), Symbol("2")),Tuple{Tuple{Tuple{Name{:key1},Int64},Tuple{Name{:value1},Int64}},Tuple{Tuple{Name{:key1},Int64},Tuple{Name{:value2},String}}}},1}:
 (1 = ((`key1`, 1), (`value1`, 1)), 2 = ((`key1`, 1), (`value2`, "a")))
 (1 = ((`key1`, 4), (`value1`, 2)), 2 = ((`key1`, 2), (`value2`, "b")))

I would have expected a table with 5 rows not 2.

@yakir12 Yes that’s true. You need to group multiple matches first.

CSV.write("temp1.csv", [(key1 = key, value1 = value) for (key, value) in zip(1:6, [1,1,1,2,2,4])])
CSV.write("temp2.csv", [(key1 = key, value2 = value) for (key, value) in zip(1:3, 'a':'c')])

source_1 =
    @name @> "temp1.csv" |>
    CSV.read(_) |>
    named_tuple |>
    to_rows |>
    Group(By(_, :value1)) |>
    By(_, first)

source_2 = 
    @name @> "temp2.csv" |>
    CSV.read(_) |>
    named_tuple |>
    to_rows |>
    By(_, :key1)
    
@name mix(:inner, source_1, source_2)

I might put the package to the test and try to support the tables API instead of DataFrames for the internals of my package. I will report any issues I encounter during the process.