Using an SQLite table as a Query.jl data source (or, best way to left-join a DataFrame or Vector with a very large sqlite table)

I have a vector (of strings) that I want to left-join with an SQLite table (to find an “id” corresponding to each string). The SQLite table is too big to load into memory - but the left join should return only as many rows as my smaller vector of strings. I think Query.jl can do this, but I’m getting stuck trying to use the SQLite table in the query.

I have:

cursor = DBInterface.execute(db, "SELECT * from paths")  # schema has (id INTEGER, path TEXT) 
q = @from p1 in my_paths begin    # my_paths is a Vector{String}
    @left_outer_join p2 in cursor on p1 equals p2.path
    @select {p2.id, p2.path}
    @collect DataFrame
end

Using @allocated, it looks like the DBInterface.execute query here doesn’t immediately load the table into memory (which is good). But my Query.jl block fails with:

The keys in the join clause have different types, String and Any

It seems the SELECT query returns columns of type Any, instead of the types in the SQLite schema

How do I get the SQLite cursor to return more specifically-typed data? Or, get Query.jl to convert the columns in the cursor from Any, Any to Int64, String ?

I think you might be able to use TidierDB.jl to do this via SqLite or duckdb.

If you convert the vector of strings to a df and copy it to your database, you should then be able to do the left join.

1 Like

The easiest is to just do N queries like "select * from paths where path = yourpath1". SQLite is specifically designed to do this performantly, compared to client-server databases.