Graph theory and SQL
I have recently finish a project on the intersection of Graph Theory and SQL. And I would like to share my experience for others.
find a valid usecase to combine traditional SQL datawarehouse with a graph layer. Whereby a valid usecase is defined as one in which the user’s understanding, the user’s reasoning and the user’s interaction is undoubtly superior to the SQL implication of said usercase.
In more laymens terms, I am looking for a situation in which a graph implementation of the data is pleasant to work with and the best SQL implementation of that usecase basically sucks…
Make a proof-of-idea concerning the inner workings needed for an touch-sensitive graph environment which can directly generate valid SQL code.
I thereby explore if there are any algorithmic or engineering blockages that would fundamentally prevent such a graph environment to be built. Let me reiterate, the focus is not on building the interactive environment but to work through the necessary code logic which enables the correctness of the to generate SQL query. Without losing sight of the research scope.
Make a proof-of-idea that has the following:
- Find an upload path that can automatically take in SQL metadata (table names and columns) and put it into a graph.
- Give the user a visually interactive space to add joins of newly uploaded tables in the right graph.
- Calculate the steiner tree of that user input.
- Give a valid SQL query back that can be directly injected into the editor.
1. Find an upload path that can automatically take in SQL metadata and put it into a graph
The testing has been done on a oracle SQL environment in this dialect of SQL there is a query “select * from user_tab_columns where xxx”, which I can use to generate the needed table name and column names. This process can be automated so that the user can provide a valid table name in the Julia environment and the system will generate the specific SQL code. Inject it in the editor take the output format it and place it as a .csv whereby it will be turned into a new star graph where the table name is in the middle and the columns are the satellites. See picture.
• A star graph structure represents a column structure best. (unless someone can give me another better idea).
• Lightgraph and Metagraph works well but sometimes seem very strangely distance from each other, at times you feel that these are distinct packages, but it shouldn’t since you need both to become productive in the graph space in julia.
• I really like the flexibility that Metagraphs give you in building your one business logic however I wish there was a more elegant way to retrieve the information than constantly iterating through the whole graph, memorization is an option but still feels a little like dodging the problem.
2. Give the user a visually interactive space to add joins of newly uploaded tables in the right graph.
This is a tricky one. What I was looking for was a interactive graphs package that is an intelligently stripped down version of neo4j. the package Makie.jl is thus far the closest but still very far off since it does not have lightgraphs implemented. This part of the project just does not existed and should be build up if I feel that the project warrants it (spoiler: I do but not for this usecase).
3. Calculate the steiner tree of that user input.
I thought that this part would be the hardest because because I had to build it. However, there already was an implementation and it works pretty good. Though a little slower than I would like and the output (sparsematrix) is not really handy but can be fixed. Another insight I gained was that because the use needs to manually make the join (automating this will not achieve the accuracy you need) you get a lot of that there will exist a 1-to-many relationship. For instance, you will likely connect an article column to the existing article column in the graph. But if multiple columns are connected to the same existing article column in the graph you can inference that they are all the same. This is important because the steiner algorithm searches for the shortest path. If you do not make those ‘inferenced’ connections explicit you will not get the right solution. See picture, node 18 and 6 are both connected to node 10. Thus there should be an edge between 18 and 6 since they are the same. A specific function adjusts this for the algorithm. Same goes for node 3 and 12 that are connected to node 21.
This seems trivial for 3 connects but it because unruly fast.
• In graph theory you only have the choice between a directed or undirected edge. This is fine until you get duplicates in your outcome if you use undirected edges, since undirected is the only logical way to represent the data (as far as I can see). This leads to annoying duplications which you have to let disappear. This is far from impossible but I wish there was a way to eliminate this redundant data before it is generated.
• To help the algorithm and you sanity it is wise to strip the satellites/leaves of the stargraph and only give the algorithm the tables of the columns you want to join. This also reduces the calculation time.
• I wasn’t planning to build in ‘inference’ and the above is by no means robust enough to be called that but some simple inference is needed to help the algorithm.
4. Give a valid SQL query back that can be directly injected into the editor.
There are 2 parts that you need to be successful here. A) a list of the columns that you want to join. This part is already provided by the user. B) generate a list of all the joins that are needed to make the join structure work. Then you plug these into a template string. The steiner tree takes care of the correct path and secondly understandable metadata. Before I explain what I mean I wouldIt would be much cooler if all the metagraph instantiations would be searchable because they exist.
To give you a feeling of why metagraphs are interesting, consider these two node. These are the only two node that can exist in an SQL represented graph.
:sourcename = this is the oracle notation for a source. E.g. “Table.Column”. If it’s a table I do a self reference but technically there is no :sourcename for tables.
:type = this give the type of node it is. Extremely handy to filter out certain nodes.
:name = the name of the column or the table.
:source = this refers to the table of which a column is connect to. Because the steiner algo calculates from table to table I need a quick reference to that.
And this is from a normal edge and an edge that is actually a join(again the only two that exist):
:joinname = this is either null of it’s an edge between a table and a column. But is it’s a connection between “Column” to “Column” it takes the :sourcename information of the connected nodes and concatenates them. This is needed to make the SQL join.
:connection = shows what kind of connection it is.
As you can see you can build certain logic inside the metadata that will help you build out more logic. You can in theory also use the eval() function in this if you wanted to. Not sure for which application this should be used but it should be possible.
Even though it has passed the proof-of-idea (sort of) I will not pursue this avenue further. This is not because there is no readily available interactive graphing environment. Though that does kinda suck. It is because it does not satisfy the research question. Building touch sensitive graph environment where you can make a join is a fun exercise but is kind of a superfluous added layer on the direct and clear join syntax that you can do in SQL.
Luckily, all is not lost. I learned a lot during this process and can probable use all the part (except mayble the steiner part) for my next proof-of- idea, I talk about this a little later. The most important thing here is that I got to take Julia and Graphs for a serious spin and that really is rewarding in and of itself.
Now what is next? The knowledge that I have gain I want to apply on an existing problem in the datawarehousing world. Simply put its lineage or the-art-of-figuring-out-where-it-comes-from. In most datawarehouses you have the option to add ETL’s. However, what tends to happen is that it becomes a warehouse filled with SQL View of which you can’t directly see which Views are relevant for your current datamodel (in your BI tool).
My next project will be to develop a parse (which is in essence a graph model) that takes an SQL statement and breaks it down. So it makes the relevant SQL Graphs (as discussed above) and does this recursively until you have a lineage chain up until the @datalink which introduces the external table to the data warehouse environment. This is going to be a deep dive and very specific, Thus right up my ally hopefully .
The detail proof-of-idea will come later but I will probably be using an python parser to start with (so pycall) and will be using this book “Pasing Techniques:A Practical Guide (Monographs in Computer Science)”.
If you have any ideas or comments or constructive critic than let me know !!!
In closing, I would like to advice any newby Julia user like me, the book “Design Patterns and best Practices with Julia” by Tom Kwong. He really helped answer some fundamental questions I had concerning Julia and the layers it has. Also if you are interested in graphs and just want to understand more of it “Introduction to graph Theory” by R.J. Trudeau is a funny intro.