Hi, all. New to Julia and trying to set up a way to run queries on SQL Server and store results into a DuckDB. I’ve got the basic program down that reads all of the SQL files from a directory, running each one in turn, and saving to a DuckDB. I believe it works but seems a but “clunky” in structure. Since I’m new to Julia I don’t know all of the ins and outs of the language yet and would like some advice on if the attached logic can be improved.
Basics:
Read SQL Server script files from a directory.
Execute each, saving to DuckDB.
MultiSQLRun.jl (4.6 KB)
Thanks!
If this takes long waiting for sql queries and duck writes, you might try doing your queries and writes asynchronously Asynchronous Programming · The Julia Language
I’ve thought of that, but haven’t tried it yet. I’m limited on memory, 16GB system, so that’s a worry. I’ve watched the memory usage go up to 3 GB for one query as a test and I’m not sure the machine can handle all of the queries running at the same time, or even multiple if limit it to just a few. It’s something on my to-do list to test out. My current test has 7 query scripts to run, but eventually there will be 20 or more, if I can get Julia up and running without issues.
lastresult = nil
for result in results
lastresult = result
end
df = lastresult!=nil ? DataFrame(lastresult) : nil
To avoid the intermediate garbage DataFrames.
1 Like
It might be simpler to use an extension and connect to sql server directly within duckdb without going through Julia, but I’m not sure if an extension exists that will work in your setup. Then you can issue queries to copy the data or query as a view without having to copy the data through Julia.
If you need to read/write to duckdb concurrently with many tasks, its very easy in Julia, and an example is explained here.
The latest version of duckdb 0.10 has the ability to set a temp directory to spill to disk in case the query exceeds you memory. If 0.10 has not been released to the package manager you’ll have to clone and build the c++ repo to get the latest package.
Also if you want you could try using the appender api to write to duckdb instead of registering a dataframe. That would allow you iterate over a stream of sqlserver results and append to duckdb.
Thanks! I’ll add that logic in there. I hadn’t thought of it.
Do the intermediate garbage DataFrames I had in there cause a lot of overhead?
Thanks for the advice, rdavis120.
I’ve looked for an extension for SQL Server, it doesn’t look like there is one yet. All that I’ve seen to connect to SQL Server so far are ODBC calls. That’s fine, since that’s what I’m used to doing already. I would like something a bit fasfter.
I do have the latest duckdb release so I’ll be looking into the temp directory option. However, since the queries aren’t being run through DuckDB, I don’t think this will help with the memory usage prior to getting the data into Duck.
I’ll take a look at the duckdb concurrency page you gave and the Appender api. The former is something I was looking at testing when I test out async queries but I wasn’t sure I’d get any benefit from the appender api. Do you think it would reduce the memory footprint?
If the result set is 3 Gb, and you are creating a DataFrame for each row but you only retain the last row’s DataFrame, then that’s a lot of allocation and garbage collection for no reason. Those intermediate DataFrames don’t reuse memory.
1 Like
From the attached code:
GC.gc(); GC.gc(); GC.gc()
Manually calling GC is rarely necessary, and when you do want to do it, one call to GC.gc()
is enough. Running it more than once will not release any more memory, unless you have some code that allocates memory between the GC calls.
I’m not familiar with the Julia odbc package, but since it supports Tables.jl you should be able iterate over the odbc query result cursor and write to duckdb using the appender api inside the results loop so you never materialize a data.frame.
If you could partition your dataset and issue a task for each partition then should be faster if it allows some tasks to query while others are appending.
There is an option in the duckdb configuration to preserve the insertion order, and if you turn that off the writes will be faster especially with concurrent writers.
Good info. I’ll see if I can figure that out in my next testing steps.
Yeah, I get it. I only did it this way after seeing more than a few lines like this in forums after when looking up questions about freeing memory.
If your data was in Postgres or csv, it would be easy just to use an extension and process the data without ever having to go through Julia. But in this case where you want to process through Julia, it would be great if you could report back on how it scales with Julia tasks, because I haven’t found the upper limit of scaling Julia async tasks for processing data through duckdb - I’ve found that you can easily partition your data by day and spawn 10000 tasks - each with its own connection.
Great to hear about the scaling. I’ll let you know once I actually get to the task testing.