@stillyslalom
Nice to meet you, I have seen your name on Github. I would like to the defend the need of my software and database here:
I think the first to mention is my data structure which is perhaps different from yours. I might need you to tell me more about all the components of your software.
self.network[root_node] = {
"node_value": Node(
root_node,
self.__NODES__[root_node].get_smiles(),
self.__NODES__[root_node].get_smarts()
),
"children": [],
"parents": [],
"name": root_node
}
Data Selection Philosophy
Under the hood the node structure looks like this with a series of children and parents so each node is aware but only one level. Each node is a resource pertaining to it’s relevance. For example, “RingsInDrugs” is the most popular ring systems that passed FDA phase 3 trials. What I did is redraw each ring in chemdraw and got the SMILES, and then performed some curation according to a set of rules until it makes sense and it’s readable to someone like me who is an Organic Chemist. I cannot read CAS, InChl, or anything. I can write SMILES though. I then reached out to lots of graduate students who are my friends in academia who have been studying as much as I have and we connected all our most relevant functional groups for our respective fields. We wrote SMILES manually as we got better from the papers and checked. We started finding a common language or a slang that was preferred for the general audience which resonated with people.
Food, Education, Environment, Drug Formulations, Interstellar Space, Materials, Medicinal Chemistry, Peptides, War, Cannabis, Lubricants, Sexual Enhancements, moving into Makeup, Vegan Meet, Pre-workout supplements, etc.
That is how our data is constructed. Manually, and since then we started to expand into industries because it turned into products, ingredients, and so on reaching a general audience. We posted on Reddit and other social medias, things relevant to where I work, which is a trauma hospital we learned what was relevant to people to record.
Table 1 outlines the data.
I work in force field development, more specifically Lennard-Jones Parameters. One of the force fields I help maintain is the CHARMM General Force Field (CGenFF).
We wrote a lot of tools to help us select compounds that could help improve the chemical space coverage of our force field. In doing so we wrote a lot of data visualization and tools to help us select.
How we arrived at our selection is in a open source paper that we are writing together.
Table 2 shows all our features.
https://sulstice.gitbook.io/globalchem-your-chemical-graph-network/
Here is some software documentation to show all the features implemented.
@longemen3000
I didn’t do web scraping. I have a commit history showing my contributions over the years. It has taken me a long time because most of the files have been PDF documents or tables that I have to abstract out into a format that is readable and can be queried easily.
SMILES can be unique, and that is the idea behind canonicalism. Morgan’s original algorithms on bond order perception. I have some slides too somewhere I can share to you for a more simplified version. Since his stuff can be dense for a newcomer. I know it. The first figure he talks about is perception.
I do keep track of other databases as well, that comes into us.
I think if you take a closer look into everything that has been implemented including connections into a lot of different software that were not maintained, we restored and then distributed etc.
The full database can be found here in a tsv file with the relationships mapped out.
Business wise, we are hoping to move into the Linux Foundation the more our industry and academic network grows.
What do you think?