ANN: LightQuery tutorial

bramtayl · January 31, 2019, 8:51pm

I’d be happy to collaborate with @davidanthoff, or whoever else.

Nosferican · January 31, 2019, 10:32pm

Took a look at the tutorial and seems like a very nice contribution to the ecosystem.

bramtayl · February 1, 2019, 12:57am

@mkborregaard I just pushed a version with three new shortcut functions: group_by, summarize, and left_join. These work basically the way things work in dplyr so they cut down on syntax by a lot. For some reason, it takes a bit of work to get my browser to not show a cached version of github pages, but I show them off in the latest docs.

Yifan_Liu · February 1, 2019, 1:32am

Query and LightQuery seem to be the pandas/dplyr of Julia. Is there any Julia version of R data.table?

davidanthoff · February 1, 2019, 2:04am

I think all the things @bramtayl is doing there is fantastic! The more packages, the merrier.

In my mind both packages are at quite different stages in their lifetime: Query.jl (and the whole Queryverse.jl) have been around for a while, have an active user base and so my focus for those is really mostly on adding features, not redesigning core architecture things. Backwards compatibility is something I care a lot about! We also have an endless list of features we want to implement that all work with the current design (i.e. they are all additive to the codebase), and I think our focus is really on pursuing that list right now.

I think it is great, though, that packages like LightQuery.jl can experiment with other designs. I certainly am learning a lot from those efforts. In my mind a good strategy there is to just see how those ideas work out, and once we have more data on that, we could think how some of this stuff could be integrated.

mkborregaard · February 1, 2019, 8:01am

Yay! Great to have both of these packages in the ecosystem.
@bramtayl sweet! amazing with the swift response, will check it out straight away.

bramtayl · February 14, 2019, 3:33am

Usage and performance notes · LightQuery.jl new tutorial. can i haz starz

bramtayl · May 3, 2019, 2:35pm

Fixed a bunch of performance issues related to https://github.com/JuliaLang/julia/issues/31909 and https://github.com/JuliaLang/julia/issues/31826

yakir12 · June 11, 2019, 7:15am

@bramtayl, after some work with JuliaDB I wanted to try your package (and then finally Query.jl), but I can’t find any of those convenience functions you mentioned (e.g. group_by) in the tutorial (or are there docs that are not the tutorial?). Coming from JuliaDB I’ve accustomed myself to things like join(t1, t2)… I can read through the tutorial a lot more carefully, but after a few quick scans I can’t seem to find a simple join of two tables.
Please, let me know what I’ve missed

bramtayl · June 12, 2019, 4:27am

@yakir12 Sounds like I need better docs… I just pushed a version with a table of contents, does that help?

I’m in the process of changing some things around to incorporate SQL, but as thing stand currently:

Group by is here:

https://bramtayl.github.io/LightQuery.jl/latest/#LightQuery.Group-Tuple{By}

with an example in the tutorial here:

https://bramtayl.github.io/LightQuery.jl/latest/#Flights-validation-1

Right now the join function is called mix… Base has already taken both join and merge. I’m open to suggestions about a better name tho.

Usage and performance notes · LightQuery.jl}

With an example in the tutorial here:

https://bramtayl.github.io/LightQuery.jl/latest/#Joining-flights-and-weather-1

bramtayl · June 13, 2019, 11:35am

@yakir12 also added an index to the Interface section. Does that help?

yakir12 · July 10, 2019, 12:32pm

Sorry for the super slow response.

I appreciate what you’re doing in the docs, walking through a cohesive long example, but I’d really appreciate some super simple standalone examples. Like, how do I inner join two tables matching on specific keys?

For instance, naively, I’d think this would work:

using CSV, LightQuery
table1 = CSV.read("table1.csv")
table2 = CSV.read("table2.csv")
table12 = @name @> mix(:inner,
                 By(table1, _.key1),
                 By(table2, _.key2)
                )

But it doesn’t.

Again, I suspect that if I followed your documented monolithic example I’d “get it”. But that’s a lot harder to do than just search “join” and find a self-contained simple example on how to join two tables. My final 2-cents are, don’t remove anything from the docs, but have a couple of these ultra simple examples for the most common things lying around so people don’t have to read through the whole thing… Again, thank you for your work!!!

Feeling pretty dense here…

bramtayl · July 10, 2019, 4:31pm

table1 and table2 are read in as dataframes, right? You have to convert them to a named_tuple (which will be a named tuple of columns) and then use to_rows to view them as rows.

table1 = to_rows(named_tuple(DataFrame(key1 = [1])))
table2 = to_rows(named_tuple(DataFrame(key2 = [1])))
@name mix(:inner, By(table1, :key1), By(table2, :key2))

However, CSV.read will copy/allocate, so I would suggest using CSV.File instead.

Maybe I should add a note about working with DataFrames ?

yakir12 · July 10, 2019, 5:24pm

No, they are from csv files I load with CSV.File (thanks for the suggestion).

So I need to to_rows(named_tuple(CSV.File("file.csv")))?

yakir12 · July 10, 2019, 5:43pm

OK, I now realized that the keys themselves need to be sorted before any join can happen… I feel a few convenience functions would be appropriate here.

So how do I sort?

# generate some example data
[(key1 = k, v1 = v) for (k,v) in zip(1:3, 'a':'c')] |> CSV.write("tmp1.csv")
[(key2 = k, v2 = v) for (k,v) in zip([1, 11, 3], 'x':'z')] |> CSV.write("tmp2.csv")
# join
x = @name mix(:inner, By(to_rows(named_tuple(CSV.File("tmp1.csv"))), :key1), By(to_rows(named_tuple(CSV.File("tmp2.csv"))), :key2))
# check the join
for i in x
    @show i
end

bramtayl · July 10, 2019, 7:17pm

@yakir12 Short answer: if your data is out of order, just read it all in and sort it in memory first.

using LightQuery
import CSV

CSV.write("temp1.csv", [(key1 = key, value1 = value) for (key, value) in zip(1:3, 'a':'c')])
CSV.write("temp2.csv", [(key1 = key, value2 = value) for (key, value) in zip(1:3, 'a':'c')])

source_1 =
    @name @> "temp1.csv" |>
    CSV.read(_) |>
    named_tuple |>
    to_rows |>
    order(_, :key1) |>
    By(_, :key1)

source_2 = 
    @name @> "temp2.csv" |>
    CSV.read(_) |>
    named_tuple |>
    to_rows |>
    order(_, :key1) |>
    By(_, :key1)

@name mix(:inner, source_1, source_2)

Long answer: how is your actually data stored? I designed order very carefully so that it should be able to support out of memory sorting with a few shims.

yakir12 · July 10, 2019, 7:22pm

Awesome! Short answers are great.

The data isn’t sorted so well, but because it’s pretty persistent now, I could sort it and save it sorted already. Cool! Thanks a lot!!!

yakir12 · July 10, 2019, 8:26pm

Eh… It seems like it matches only on one occurrence of matching values, not all. Consider this example:

CSV.write("temp1.csv", [(key1 = key, value1 = value) for (key, value) in zip(1:6, [1,1,1,2,2,4])])
CSV.write("temp2.csv", [(key1 = key, value2 = value) for (key, value) in zip(1:3, 'a':'c')])
source_1 =
    @name @> "temp1.csv" |>
    CSV.read(_) |>
    named_tuple |>
    to_rows |>
    order(_, :value1) |>
    By(_, :value1)
source_2 = 
    @name @> "temp2.csv" |>
    CSV.read(_) |>
    named_tuple |>
    to_rows |>
    order(_, :key1) |>
    By(_, :key1)
x = @name mix(:inner, source_1, source_2)
Tables.rowtable(x)

which results in:

2-element Array{NamedTuple{(Symbol("1"), Symbol("2")),Tuple{Tuple{Tuple{Name{:key1},Int64},Tuple{Name{:value1},Int64}},Tuple{Tuple{Name{:key1},Int64},Tuple{Name{:value2},String}}}},1}:
 (1 = ((`key1`, 1), (`value1`, 1)), 2 = ((`key1`, 1), (`value2`, "a")))
 (1 = ((`key1`, 4), (`value1`, 2)), 2 = ((`key1`, 2), (`value2`, "b")))

I would have expected a table with 5 rows not 2.

bramtayl · July 11, 2019, 2:33am

@yakir12 Yes that’s true. You need to group multiple matches first.

CSV.write("temp1.csv", [(key1 = key, value1 = value) for (key, value) in zip(1:6, [1,1,1,2,2,4])])
CSV.write("temp2.csv", [(key1 = key, value2 = value) for (key, value) in zip(1:3, 'a':'c')])

source_1 =
    @name @> "temp1.csv" |>
    CSV.read(_) |>
    named_tuple |>
    to_rows |>
    Group(By(_, :value1)) |>
    By(_, first)

source_2 = 
    @name @> "temp2.csv" |>
    CSV.read(_) |>
    named_tuple |>
    to_rows |>
    By(_, :key1)
    
@name mix(:inner, source_1, source_2)

Nosferican · July 11, 2019, 3:09am

I might put the package to the test and try to support the tables API instead of DataFrames for the internals of my package. I will report any issues I encounter during the process.

Topic		Replies	Views
ANN: LightQuery General Usage	2	661	January 28, 2019
LightQuery 0.7.0 Package Announcements data	11	746	August 20, 2020
“Write You A Query Language" —— To introduce the power of MLStyle.jl Package Announcements announcement , data , dataframes	1	896	February 8, 2019
Query.jl v0.11 released Package Announcements announcement	7	1115	February 4, 2019
[ANN] RowTables.jl Data announcement	6	1118	July 26, 2018

ANN: LightQuery tutorial

Related topics