SQLite: query with duplicate column names |> DataFrame

Hi guys,

The following query with duplicate column names works however when converting to a DataFrame I have a "not unique" error.

using SQLite
db = SQLite.DB();
Q = q -> DBInterface.execute(db, q);

DataFrame(rand(10,3),:auto) |> SQLite.load!(db, "a", ifnotexists=false);
DataFrame(rand(10,2),:auto) |> SQLite.load!(db, "b", ifnotexists=false);

Q("select * from a join b on a.x1 > b.x2")  |> DataFrame

ERROR: duplicate field name in NamedTuple: "x1" is not unique

This forces me to name each duplicate in the query but I wonder if there is a direct way to rename duplicates in the result. I have tried DataFrame(Q(…); makeunique=true) with no luck and I can’t find anything in the SQLite or DataFrames documentation. Any ideas?

with some metaprogramming tricks - it is not so hard - on the SQLite part

my proposal:

You can query the table columns - with
select name from pragma_table_info('texample1')
so you can replace the * with the column lists

So the expanded variable list (*) can be created :

  • in Julia
  • or with a simple SQL

example - variable lists with 2 tables in sql:

select group_concat(name,',')
from
(
-- columns from table1
select ' texample1.'||name as name from pragma_table_info('texample1')
union all
-- columns from table2 AND not in table1
select ' texample2.'||name as name from pragma_table_info('texample2') where name not in ( select name from pragma_table_info('texample1') )
union all
-- columns from table2 AND in table1 --> renaming 
select ' texample2.'||name||' as prefix_'||name as name from pragma_table_info('texample2') where name in ( select name from pragma_table_info('texample1') )
)
;

with these tables:

CREATE TABLE texample1 (
   t_id INTEGER ,
   name TEXT,
   x1   INTEGER,
   x2   INTEGER
);

CREATE TABLE texample2 (
   t_id INTEGER ,
   category TEXT,
   x1   INTEGER,   
   m1   INTEGER,
   m2   INTEGER
);

the select_variables results - for replacing *
texample1.t_id, texample1.name, texample1.x1, texample1.x2, texample2.category, texample2.m1, texample2.m2, texample2.t_id as prefix_t_id, texample2.x1 as prefix_x1

And your final query will be

select
  -- generated variable list start
      texample1.t_id
    , texample1.name
    , texample1.x1
    , texample1.x2
    , texample2.category
    , texample2.m1
    , texample2.m2
    , texample2.t_id as prefix_t_id
    , texample2.x1   as prefix_x1
  -- generated variable list end
from texample1 join texample2 on texample2.t_id = texample2.t_id
;

IMHO:Sometimes it is not safe to rely on DataFrame for renaming
The next version of SQLite may return the sql result in a different column order, and your results will be different.

so rename in the SQL side !

This is a good point.


I do not have SQLite database at hand to test, but something like this should work:

q = Q(...)
DataFrame([c => Tables.getcolumn(q, i) for (i, c) in enumerate(Tables.columnnames(q))], makeunique=true)

on DataFrames.jl side (we need to use another constructor as DataFrames(Q(...)) just passes the job to Tables.jl which does not have deduplication support).

1 Like

Thanks @ImreSamu and @bkamins

The SQL solution it’s quite elaborated and though I was aware of such possibilities I was truly hoping for a solution within Tables.jl or DataFrames.jl or, worst case scenario, some SQLite command that would allow for deduplication.

Unfortunately the DataFrame solution does not work:

julia> DataFrame([c => Tables.getcolumn(q, i) for (i, c) in enumerate(Tables.columnnames(q))],
                 makeunique=true)
5Γ—5 DataFrame
 Row β”‚ x1                              x2     x3      x1_1                     x2_1                         
     β”‚ Stmt…                           Int32  Symbol  Type                     Dict…                        
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚ Stmt(SQLite.DB(":memory:"), 4)    101  x1      Union{Missing, Float64}  Dict(:x2=>5, :x3=>3, :x1=>4)
   2 β”‚ Stmt(SQLite.DB(":memory:"), 4)    101  x2      Union{Missing, Float64}  Dict(:x2=>5, :x3=>3, :x1=>4)
   3 β”‚ Stmt(SQLite.DB(":memory:"), 4)    101  x3      Union{Missing, Float64}  Dict(:x2=>5, :x3=>3, :x1=>4)
   4 β”‚ Stmt(SQLite.DB(":memory:"), 4)    101  x1      Union{Missing, Float64}  Dict(:x2=>5, :x3=>3, :x1=>4)
   5 β”‚ Stmt(SQLite.DB(":memory:"), 4)    101  x2      Union{Missing, Float64}  Dict(:x2=>5, :x3=>3, :x1=>4)

At this point I am not sure if SQLite.jl has implemented Tables.jl entirely or just part of it and that’s the reason why @bkamins solution fails.

I’ve been going around this for a long time now and I can’t find a solution -other than going full meta-programming in the database which I truly want to avoid. As things stand I do not even know whose phone to ring and I wonder:

  • Should Tables.jl provide deduplication functionality?
  • Is SQLite.jl implementing Tables.jl properly?
  • Is DataFrames.jl and anybody implementing Tables expected to deduplicate Tables.jl results?
  • A combination of the three above?

If you would find it useful please open an issue in Tables.jl.

Yes. I have just checked. You need to make sure q = Tables.columns(Q(...)). That should be enough in my solution.

Currently yes. In general both Tables.jl and DataFrames.jl mostly assume that column names are not duplicated. The reason is that typical tables in Julia (e.g. Dict, NamedTuple or DataFrame) do not allow duplicates by design.

I have checked the source code in DataFrames.jl and we could add makeunique kwarg also to Tables.jl table constructor. Please open an issue if you would find it useful. (note though that as @ImreSamu commented above - this is in general not a good idea to use data with duplicate column names, so normally I think it should be fixed on source side).

1 Like

Why not just create a simple Sqlite3 wrapper view from the select * query - and read from the view.
The Sqlite3 automatically renaming the β€œduplicated” variable names - see β†’ "x1:1"

  • <varname>:1 , <varname>:2
    it is not perfect … but you have a unique column names
    β†’ view3(t_id,name,x1,x2,"t_id:1",category,"x1:1",m1,m2)
sqlite> CREATE VIEW view3 as
   ...> select
   ...>   *
   ...> from texample1 join texample2 
   ...> on texample2.t_id = texample2.t_id
   ...> ;

sqlite> .schema view3
CREATE VIEW view3 as
select
  *
from texample1 join texample2 
on texample2.t_id = texample2.t_id
/* view3(t_id,name,x1,x2,"t_id:1",category,"x1:1",m1,m2) */;
sqlite> 

sqlite> .header on
sqlite> PRAGMA table_info(view3);
cid|name|type|notnull|dflt_value|pk
0|t_id|INTEGER|0||0
1|name|TEXT|0||0
2|x1|INTEGER|0||0
3|x2|INTEGER|0||0
4|t_id:1|INTEGER|0||0
5|category|TEXT|0||0
6|x1:1|INTEGER|0||0
7|m1|INTEGER|0||0
8|m2|INTEGER|0||0
sqlite> 

1 Like

I want to thank you for the effort in your answers. Right now as a workaround I am accessing SQLite functionality via RCall since it’s well integrated in R, a SQlite3 wrapper would be another workaround and, as workaround goes, I am just find accessing SQLite via RCall until better times comes.

Could you please tell me which versions of Tables and SQLite are you using? It still does not work for me ( I am using SQLite v1.1.4 ../../dev/SQLite#master and Tables v1.4.4 on Julia 1.6.2)`

julia> query = Q("select * from a join b on a.x1 > b.x2")
SQLite.Query(SQLite.Stmt(SQLite.DB(":memory:"), 7), Base.RefValue{Int32}(100), [:x1, :x2, :x3, :x1, :x2], Type[Union{Missing, Float64}, Union{Missing, Float64}, Union{Missing, Float64}, Union{Missing, Float64}, Union{Missing, Float64}], Dict(:x2 => 5, :x3 => 3, :x1 => 4))

julia> q = Tables.columns(query)
ERROR: duplicate field name in NamedTuple: "x1" is not unique
Stacktrace:
 [1] (NamedTuple{(:x1, :x2, :x3, :x1, :x2), T} where T<:Tuple)(args::NTuple{5, Vector{Float64}})

Unfortunately every major Database engine I have used do allow for duplicates when joining tables. This is actually a very common scenario. I guess this is why R/data.frame do allow for duplicates too.

For a few columns it would not matter, but sometimes we have dozens or even hundreds of duplicates after a SQL join, if we want to transfer those join results to a DataFrame it would be really useful to have makeunique working for Tables constructors. I’ll open an issue in DataFrames.jl as you suggest.

I do not use SQLite - that is why I had problem with reproducing. The fact that:

julia> q = Tables.columns(query)
ERROR: duplicate field name in NamedTuple: "x1" is not unique

errors clearly indicates that Tables.jl does not allow duplicates in this case.

used do allow for duplicates when joining tables.

Yes - joins are problematic. In DataFrames.jl we provide an option to de-duplicate automatically.


In conclusion - given your comments we should fix it in DataFrames.jl. I will open a PR to discuss there.

1 Like

That’d be great! Thank you!

I have opened Add makeunique to Tables.jl DataFrame constructor by bkamins Β· Pull Request #2818 Β· JuliaData/DataFrames.jl Β· GitHub. Can you please comment there? (preferably with a reproducible example so that we can test it). Thank you!

3 Likes

Sorry for the slow response here; I’ve been really swamped w/ things lately. I don’t think changing things in Tables.jl/DataFrames.jl is the right way to go here; it would make those cases specifically more flexible, but there’s a whole host of sources/sinks out there that would never catch up and the inconsistency in whether makeunique was supported would cause more problems than it solves, IMO.

I think the ultimate solution here is to fix SQLite.jl to ensure queries always return a list of unique column names by default, with an option to return the β€œraw” column names if desired (but that obviously wouldn’t work to put in a DataFrame).

I’ve opened an issue here: Ensure unique column names in query result by default Β· Issue #253 Β· JuliaDatabases/SQLite.jl Β· GitHub.

1 Like

Thank you @quinnj! That would work, however, I found this workaround for SQLite and works just fine with DataFrames.jl (@bkamins), perhaps would be best to keep the default SQLite behavior.

julia> t = Q("select * from (select * from a join b on a.x1 > b.x2)") |> DataFrame
56Γ—5 DataFrame
 Row β”‚ x1         x2        x3         x1:1       x2:1      
     β”‚ Float64    Float64   Float64    Float64    Float64   
─────┼──────────────────────────────────────────────────────
   1 β”‚ 0.569232   0.677381  0.637279   0.145199   0.345888
   2 β”‚ 0.569232   0.677381  0.637279   0.507727   0.217739
   [...]

I understand the reasons not to allow makeunique in Tables.jl, would it be possible though to allow the renaming of columns for a SQLite.Query or in general any type implementing Tables interface?

julia> t.names = [:a,:b,:c,:d,:e]
ERROR: setfield! immutable struct of type Query cannot be changed
1 Like

Side question: Is there a sqlite> REPL mode? If so, how to access it? (and why isn’t it mentioned in the docs?)

my examples were based on sqlite3 ; a simple command line shell
( or sqlite3.exe on Windows )

user@ubuntu:~$ sqlite3
SQLite version 3.31.1 2020-01-27 19:55:54
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> 

The Sqlitebrowser tool is also my favorite: https://sqlitebrowser.org/

Alright, gotcha. I just saw the sqlite> and thought REPL mode :grinning:

1 Like