Running mulitple SQL Scripts

jjAtNxtPt · February 23, 2024, 3:56pm

Hi, all. New to Julia and trying to set up a way to run queries on SQL Server and store results into a DuckDB. I’ve got the basic program down that reads all of the SQL files from a directory, running each one in turn, and saving to a DuckDB. I believe it works but seems a but “clunky” in structure. Since I’m new to Julia I don’t know all of the ins and outs of the language yet and would like some advice on if the attached logic can be improved.

Basics:
Read SQL Server script files from a directory.
Execute each, saving to DuckDB.

MultiSQLRun.jl (4.6 KB)

Thanks!

Jeff_Emanuel · February 23, 2024, 4:24pm

If this takes long waiting for sql queries and duck writes, you might try doing your queries and writes asynchronously Asynchronous Programming · The Julia Language

jjAtNxtPt · February 23, 2024, 4:38pm

I’ve thought of that, but haven’t tried it yet. I’m limited on memory, 16GB system, so that’s a worry. I’ve watched the memory usage go up to 3 GB for one query as a test and I’m not sure the machine can handle all of the queries running at the same time, or even multiple if limit it to just a few. It’s something on my to-do list to test out. My current test has 7 query scripts to run, but eventually there will be 20 or more, if I can get Julia up and running without issues.

Jeff_Emanuel · February 23, 2024, 5:03pm

lastresult = nil
for result in results
  lastresult = result
end
df = lastresult!=nil ? DataFrame(lastresult) : nil

To avoid the intermediate garbage DataFrames.

rdavis120 · February 23, 2024, 5:09pm

It might be simpler to use an extension and connect to sql server directly within duckdb without going through Julia, but I’m not sure if an extension exists that will work in your setup. Then you can issue queries to copy the data or query as a view without having to copy the data through Julia.

If you need to read/write to duckdb concurrently with many tasks, its very easy in Julia, and an example is explained here.

The latest version of duckdb 0.10 has the ability to set a temp directory to spill to disk in case the query exceeds you memory. If 0.10 has not been released to the package manager you’ll have to clone and build the c++ repo to get the latest package.

Also if you want you could try using the appender api to write to duckdb instead of registering a dataframe. That would allow you iterate over a stream of sqlserver results and append to duckdb.

jjAtNxtPt · February 23, 2024, 5:12pm

Thanks! I’ll add that logic in there. I hadn’t thought of it.
Do the intermediate garbage DataFrames I had in there cause a lot of overhead?

jjAtNxtPt · February 23, 2024, 5:20pm

Thanks for the advice, rdavis120.

I’ve looked for an extension for SQL Server, it doesn’t look like there is one yet. All that I’ve seen to connect to SQL Server so far are ODBC calls. That’s fine, since that’s what I’m used to doing already. I would like something a bit fasfter.

I do have the latest duckdb release so I’ll be looking into the temp directory option. However, since the queries aren’t being run through DuckDB, I don’t think this will help with the memory usage prior to getting the data into Duck.

I’ll take a look at the duckdb concurrency page you gave and the Appender api. The former is something I was looking at testing when I test out async queries but I wasn’t sure I’d get any benefit from the appender api. Do you think it would reduce the memory footprint?

Jeff_Emanuel · February 23, 2024, 5:29pm

If the result set is 3 Gb, and you are creating a DataFrame for each row but you only retain the last row’s DataFrame, then that’s a lot of allocation and garbage collection for no reason. Those intermediate DataFrames don’t reuse memory.

greatpet · February 23, 2024, 5:35pm

From the attached code:

GC.gc(); GC.gc(); GC.gc()

Manually calling GC is rarely necessary, and when you do want to do it, one call to GC.gc() is enough. Running it more than once will not release any more memory, unless you have some code that allocates memory between the GC calls.

rdavis120 · February 23, 2024, 5:36pm

I’m not familiar with the Julia odbc package, but since it supports Tables.jl you should be able iterate over the odbc query result cursor and write to duckdb using the appender api inside the results loop so you never materialize a data.frame.

If you could partition your dataset and issue a task for each partition then should be faster if it allows some tasks to query while others are appending.

There is an option in the duckdb configuration to preserve the insertion order, and if you turn that off the writes will be faster especially with concurrent writers.

jjAtNxtPt · February 23, 2024, 5:43pm

Good info. I’ll see if I can figure that out in my next testing steps.

jjAtNxtPt · February 23, 2024, 5:44pm

Yeah, I get it. I only did it this way after seeing more than a few lines like this in forums after when looking up questions about freeing memory.

rdavis120 · February 23, 2024, 5:52pm

If your data was in Postgres or csv, it would be easy just to use an extension and process the data without ever having to go through Julia. But in this case where you want to process through Julia, it would be great if you could report back on how it scales with Julia tasks, because I haven’t found the upper limit of scaling Julia async tasks for processing data through duckdb - I’ve found that you can easily partition your data by day and spawn 10000 tasks - each with its own connection.

jjAtNxtPt · February 23, 2024, 7:17pm

Great to hear about the scaling. I’ll let you know once I actually get to the task testing.

Topic		Replies	Views
How to execute .sql file Data	10	301	December 11, 2024
Write a table into SQLite and DuckDB in efficient manner Performance help-database , sqlite , sql , database , duckdb	24	3479	December 8, 2023
Does DuckDB not allow multiple connections? Data	3	348	May 15, 2025
Very Slow `DBInterface.execute` on large duckdb relative to R Performance juliadb , database , duckdb	4	143	October 30, 2024
DuckDB.jl does not free memory General Usage	1	448	August 24, 2023

Running mulitple SQL Scripts

Related topics