Speeding Up Query with a big dataset


#1

Hi,

I am currently processing a dataset of 31 columns and 100.000 rows. Loading the data, describe them etc… is ok however while running some Query to match some rows is very slow. 100K rows is big but still not very big (might be mistaken here).
At first, I thought I am in Jupyter notebook and/or Jupyter lab therfore i am experimenting latency but then i run a *.jl script and it is still the same.
I then have started julia as julia -p 16 (in a server with 16 cores) and then run the script but still very slow.

I am currently using Query.jl to do my queries but am assuming it is going to be the same for others (need to see).

I have not played with multiprocessing on julia jet (thought it will do a bit of magic for me automatically)…

Anybody haveing a solution on how to use multiprocessing with tables/dataframes ?

My query is as follows:

@from row in df begin
@where ((row.X< 10 && row.Y > 15) || (row.X > 15 && row.Y < 10)) && (row.O > 10)
@select row
@collect DataFrame
end

and it does not seem to be a complex one…

Elvis


#2

JuliaDB has support for parallel computing (see the docs) and you can use the JuliaDB “backend” via macros from JuliaDBMeta. For example your query would look like:

@filter df (:X < 10 && :Y > 15) || (:X > 15 && :Y < 10) && (:O > 10)

Where df is your IndexedTable

Already on a single process this should be pretty fast (not sure why Query would be slow here to be honest, maybe the columns of the dataset are note concretely typed?).


#3

100,000 rows isn’t that large, I’m not sure you need multiprocessing for this. Can you quantify “slow”? Is the second run faster than the first one? The compilation cost may be significant.


#4

Completely agree, this seems reasonably fast on a single core as well:

julia> N = 100_000

julia> df = table((X=rand(N).*20, Y=rand(N)./20, O=rand(N).*20))
Table with 100000 rows, 3 columns:
X        Y           O
────────────────────────────
14.6601  0.00236458  17.2513
12.6361  0.0322326   18.9277
0.89148  0.0034154   14.4774
12.3102  0.00139073  13.7773
14.6083  0.0474375   19.7088
19.9985  0.0484838   13.7814
12.2598  0.0491725   17.5342
14.135   0.0467684   4.94136
⋮
17.374   0.00520244  2.33823
12.4243  0.0455707   8.35212
15.1256  0.0348024   19.3554
6.90056  0.0466855   17.7509
3.72646  0.0210652   7.7554
2.17125  0.0213715   10.6633

julia> @time @filter df (:X < 10 && :Y > 15) || (:X > 15 && :Y < 10) && (:O > 10) # on the second run
  0.049482 seconds (129.79 k allocations: 6.774 MiB)

#5

I also came in the meantime to JuliaDB which seems doing the job, actually I run the same query… and it comes back within the second.
The columns of the dataset are correctly Typed. At the moment not sure why the previous one was so slow… Hoever JuliaDB seems doing the job. Having both IndexedTable and DataFrame seem useful as DataFrame has useful functions to easily explore/summarize the data.

Thanks


#6

Using JuliaDB , for this kind of dataset, which I agree is not big, was fast the same. I did not experiment a difference between laptop (single core) and server.

Slow means that I actually did not even got the answer… the Kernel was running and I had to interrupt it. I did try on multiple machines and rerun but with same result.


#7

I did repeate your example and is very fast… i also did repeat again the previous one with Query.jl and it works now… I do not know what actually went wrong. :roll_eyes:


#8

I think until the last DataFrames version we printed the full data frame to HTML in Jupyter IIRC, so the hang may just have been due to displaying the result.


#9

Does that refer back to your original query? I.e. mostly wondering whether this is a regression/performance case I should look into, or whether everything is ok for Query.jl here :slight_smile:


#10

@davidanthoff
No this is not related to my previous post. Query.jl seems working and performing OK now. I need to reproduce the results and provide more data before an action/issue can be raised. Thank you.

@nalimilan,
Might have been a display issue. However, i did try the experiment in my working laptop and in a comp-server and in both I had the issue.