Speeding Up Query with a big dataset

elshera · January 11, 2019, 9:37am

Hi,

I am currently processing a dataset of 31 columns and 100.000 rows. Loading the data, describe them etc… is ok however while running some Query to match some rows is very slow. 100K rows is big but still not very big (might be mistaken here).
At first, I thought I am in Jupyter notebook and/or Jupyter lab therfore i am experimenting latency but then i run a *.jl script and it is still the same.
I then have started julia as julia -p 16 (in a server with 16 cores) and then run the script but still very slow.

I am currently using Query.jl to do my queries but am assuming it is going to be the same for others (need to see).

I have not played with multiprocessing on julia jet (thought it will do a bit of magic for me automatically)…

Anybody haveing a solution on how to use multiprocessing with tables/dataframes ?

My query is as follows:

@from row in df begin
@where ((row.X< 10 && row.Y > 15) || (row.X > 15 && row.Y < 10)) && (row.O > 10)
@select row
@collect DataFrame
end

and it does not seem to be a complex one…

Elvis

piever · January 11, 2019, 11:59am

elshera:

Hi,

I am currently processing a dataset of 31 columns and 100.000 rows. Loading the data, describe them etc… is ok however while running some Query to match some rows is very slow. 100K rows is big but still not very big (might be mistaken here).
At first, I thought I am in Jupyter notebook and/or Jupyter lab therfore i am experimenting latency but then i run a *.jl script and it is still the same.
I then have started julia as julia -p 16 (in a server with 16 cores) and then run the script but still very slow.

I am currently using Query.jl to do my queries but am assuming it is going to be the same for others (need to see).

I have not played with multiprocessing on julia jet (thought it will do a bit of magic for me automatically)…

Anybody haveing a solution on how to use multiprocessing with tables/dataframes ?

JuliaDB has support for parallel computing (see the docs) and you can use the JuliaDB “backend” via macros from JuliaDBMeta. For example your query would look like:

@filter df (:X < 10 && :Y > 15) || (:X > 15 && :Y < 10) && (:O > 10)

Where df is your IndexedTable

Already on a single process this should be pretty fast (not sure why Query would be slow here to be honest, maybe the columns of the dataset are note concretely typed?).

nalimilan · January 11, 2019, 12:49pm

100,000 rows isn’t that large, I’m not sure you need multiprocessing for this. Can you quantify “slow”? Is the second run faster than the first one? The compilation cost may be significant.

piever · January 11, 2019, 4:11pm

Completely agree, this seems reasonably fast on a single core as well:

julia> N = 100_000

julia> df = table((X=rand(N).*20, Y=rand(N)./20, O=rand(N).*20))
Table with 100000 rows, 3 columns:
X        Y           O
────────────────────────────
14.6601  0.00236458  17.2513
12.6361  0.0322326   18.9277
0.89148  0.0034154   14.4774
12.3102  0.00139073  13.7773
14.6083  0.0474375   19.7088
19.9985  0.0484838   13.7814
12.2598  0.0491725   17.5342
14.135   0.0467684   4.94136
⋮
17.374   0.00520244  2.33823
12.4243  0.0455707   8.35212
15.1256  0.0348024   19.3554
6.90056  0.0466855   17.7509
3.72646  0.0210652   7.7554
2.17125  0.0213715   10.6633

julia> @time @filter df (:X < 10 && :Y > 15) || (:X > 15 && :Y < 10) && (:O > 10) # on the second run
  0.049482 seconds (129.79 k allocations: 6.774 MiB)

elshera · January 11, 2019, 8:30pm

I also came in the meantime to JuliaDB which seems doing the job, actually I run the same query… and it comes back within the second.
The columns of the dataset are correctly Typed. At the moment not sure why the previous one was so slow… Hoever JuliaDB seems doing the job. Having both IndexedTable and DataFrame seem useful as DataFrame has useful functions to easily explore/summarize the data.

Thanks

elshera · January 11, 2019, 8:33pm

Using JuliaDB , for this kind of dataset, which I agree is not big, was fast the same. I did not experiment a difference between laptop (single core) and server.

Slow means that I actually did not even got the answer… the Kernel was running and I had to interrupt it. I did try on multiple machines and rerun but with same result.

elshera · January 11, 2019, 8:45pm

I did repeate your example and is very fast… i also did repeat again the previous one with Query.jl and it works now… I do not know what actually went wrong.

nalimilan · January 11, 2019, 9:43pm

I think until the last DataFrames version we printed the full data frame to HTML in Jupyter IIRC, so the hang may just have been due to displaying the result.

davidanthoff · January 11, 2019, 9:57pm

Does that refer back to your original query? I.e. mostly wondering whether this is a regression/performance case I should look into, or whether everything is ok for Query.jl here

elshera · January 12, 2019, 9:01am

@davidanthoff
No this is not related to my previous post. Query.jl seems working and performing OK now. I need to reproduce the results and provide more data before an action/issue can be raised. Thank you.

@nalimilan,
Might have been a display issue. However, i did try the experiment in my working laptop and in a comp-server and in both I had the issue.

Topic		Replies	Views
Query.jl extremely slow General Usage	6	739	February 23, 2019
Serious group-by performance issue with Query.jl Data	26	2332	October 13, 2019
How is the data ecosystem right now for large datasets? Data	35	6698	July 13, 2017
JuliaDB Questions/Issues New to Julia package	13	2528	July 3, 2019
JuliaDB Parallel/Distributed Computing General Usage parallel , distributed , juliadb	13	1736	July 5, 2019

Speeding Up Query with a big dataset

Related topics