Porting Python Global-Chem Knowledge Graph into Julia

Hi Y’all,

My name is Sul and I am a PhD Student at the University of Maryland, Baltimore. We are writing the molecules to common chemical names that the general public use and keeping a record for all of us to download and maintain together. We can all manage it as an open source governing community and use tools to declare whether these chemicals are safe or not for general use.

The back-end has no dependencies and I could see this being a 4 month long project (with testing) to port into Julia provided we have to learn the platform. A lot of folk use that more than the front-end. That would be 10 to 20K. This part has an range of 7K-11K downloads per month

The front-end has all all the layering from 1106 different software dependencies, all different restored scientific python packages that we absorbed. It would take a year or so to rewrite the code. That’s a decent amount of work 150K, since we would have to include other features from other software. This part has an range of 5K-7K downloads per month.

I think this would help link our two communities as long as we are using the same data regardless of the software implementation.

To finance this project, perhaps it would be best to go through the Google Summer of Code and have a julia engineer work with both communities including myself into porting or perhaps Julia would be interested in sponsoring through GitCoin.

Website: https://www.globalchemistry.org/
Grant: Global-Chem: A Dictionary of Chemical Names to Molecules | Grants | Gitcoin

What does the community think?

Thank you,
-Sul

4 Likes

I‘ve read your post, went to the web site you cited, found the same wording, and, to be honest, understand nothing.

We are writing the molecules to common chemical names that the general public use and keeping a record for all of us to download and maintain together. We can all manage it as an open source governing community and use tools to declare whether these chemicals are safe or not for general use.

Could probably you translate this text into simple English?

1 Like

Hmm I should maybe change it up a bit. Keep rephrasing this.

We wrote a dictionary of common chemical names to molecules like:

Vitamin C: C(C(C1C(=C(C(=O)O1)O)O)O)O

We then organized it into a knowledge graph “pythonic” database that’s built to scale and downloadable. To analyze the data we made a bunch of software tools and methods so that everyone can use this data for various purposes.

We then put it on Github and open sourced it all. Accessible Chemical Data relevant to different industries isn’t as accessible especially to the general public. We hope to establish a governing aspect to the data as we add more lists, aggregate more data, and build a community.

Does that make sense so far?

Thank you, it is more understandable now, or at least looks more understandable for me. So, you build a dictionary of common names to - to what? To molecules (or, I’d prefer to say, to chemical compounds)? Or to substances (that is not the same)? Only names, or anything else?

Now you gave a specific example, we can use it.

For me, Vitamin C is ascorbic acid. If I check Wikipedia, I find, first and foremost, the CAS No. 50-81-7, a bunch of other codes I know nothing about, and a lot of physical, chemical, and medico-biological information about the compound. The data is quite accessible to general public, though I’m afraid most laymen lack the necessary education to understand them properly.

So my questions: What kind of data do you have for e.g. ascorbic acid? What kind of safety data you are going to provide, which are not already easily available? Why your database is easier to “for general public use” than Wikipedia or, say, SigmaAldrich? Where are the data coming from? You say on your webpage, currently you have 3506 chemicals recorded - that is not really a lot. In short - what ist the added value of your project?

And one more thing - could you probably translate the following phrase, too?

1 Like

It’s worth mentioning the existence of @longemen3000’s ChemicalIdentifiers.jl, which performs identifier-to-CAS search for about ~20k chemicals using Caleb Bell’s database. My own package PyThermo.jl offers access to the same database (and safety info) via PythonCall.jl:

julia> using PyThermo

julia> acetone = Species("Acetone")
Species(Acetone, 298.1 K, 1.013e+05 Pa)

julia> acetone.Carcinogen
PythonCall.PyDict{PythonCall.Py, PythonCall.Py} with 2 entries:
  'International Agency for Research on Cancer'            => 'Unlisted'
  'National Toxicology Program 13th Report on Carcinogens' => 'Unlisted'

julia> acetone.STEL # short-term exposure limit
(750.0, "ppm")

julia> acetone.LFL # lower flammability limit
0.025
1 Like

looking on chemical databases… is harder than it seems.
one of the main reasons that CAS numbers, chempub ID, etc identifiers were created was to uniquely identify a molecule. those two indexes are released by organizations and their availability for new compounds depends on them assigning new identifiers.

The other option is to use SMILES (SMARTS as an extension) or InChI. those have their own issues (for large molecules, the string length is prohibitive, SMILES aren’t unique. InChI are somewhat better in this regard). the problem with those is that are not user-friendly if you are not into them. (hello, give me some C(C(CO)O)O please (glycerol)).

Caleb Bell’s chemicals package has just a plain text database, a (sort of) csv that has only identifier data (plus a lot of synonyms) , internally it chooses CAS numbers as it’s primary index. my package is just a glorified index search over that identifier database :sweat_smile: (and identifier only, thermo and PyThermo.jl load all the component data). one thing that helped me port that db easily was the fact that the identifiers were stored in plain text (at the moment, i saw that the identifiers in global-chem are stored directly in python code, is that alright? while it does help in development, in a future an open format could help there?). all that database was done with web scraping + corrections, so it is not ideal in terms of strict correctness, but it helps anyway.

In that regard, if a curated open source chemical identifier database is something interesting that could help the chemical community. i will keep an eye on that project.

@stillyslalom

Nice to meet you, I have seen your name on Github. I would like to the defend the need of my software and database here:

I think the first to mention is my data structure which is perhaps different from yours. I might need you to tell me more about all the components of your software.

self.network[root_node] = {
  "node_value": Node(
        root_node,
        self.__NODES__[root_node].get_smiles(),
        self.__NODES__[root_node].get_smarts()
   ),
   "children": [],
   "parents": [],
   "name": root_node
 }

Data Selection Philosophy

Under the hood the node structure looks like this with a series of children and parents so each node is aware but only one level. Each node is a resource pertaining to it’s relevance. For example, “RingsInDrugs” is the most popular ring systems that passed FDA phase 3 trials. What I did is redraw each ring in chemdraw and got the SMILES, and then performed some curation according to a set of rules until it makes sense and it’s readable to someone like me who is an Organic Chemist. I cannot read CAS, InChl, or anything. I can write SMILES though. I then reached out to lots of graduate students who are my friends in academia who have been studying as much as I have and we connected all our most relevant functional groups for our respective fields. We wrote SMILES manually as we got better from the papers and checked. We started finding a common language or a slang that was preferred for the general audience which resonated with people.

Food, Education, Environment, Drug Formulations, Interstellar Space, Materials, Medicinal Chemistry, Peptides, War, Cannabis, Lubricants, Sexual Enhancements, moving into Makeup, Vegan Meet, Pre-workout supplements, etc.

That is how our data is constructed. Manually, and since then we started to expand into industries because it turned into products, ingredients, and so on reaching a general audience. We posted on Reddit and other social medias, things relevant to where I work, which is a trauma hospital we learned what was relevant to people to record.

Table 1 outlines the data.

I work in force field development, more specifically Lennard-Jones Parameters. One of the force fields I help maintain is the CHARMM General Force Field (CGenFF).

We wrote a lot of tools to help us select compounds that could help improve the chemical space coverage of our force field. In doing so we wrote a lot of data visualization and tools to help us select.
How we arrived at our selection is in a open source paper that we are writing together.

Table 2 shows all our features.

Here is some software documentation to show all the features implemented.

@longemen3000

I didn’t do web scraping. I have a commit history showing my contributions over the years. It has taken me a long time because most of the files have been PDF documents or tables that I have to abstract out into a format that is readable and can be queried easily.

SMILES can be unique, and that is the idea behind canonicalism. Morgan’s original algorithms on bond order perception. I have some slides too somewhere I can share to you for a more simplified version. Since his stuff can be dense for a newcomer. I know it. The first figure he talks about is perception.

I do keep track of other databases as well, that comes into us.
I think if you take a closer look into everything that has been implemented including connections into a lot of different software that were not maintained, we restored and then distributed etc.

The full database can be found here in a tsv file with the relationships mapped out.

Business wise, we are hoping to move into the Linux Foundation the more our industry and academic network grows.

What do you think?

1 Like