Hi everyone! I’d like to implement Crystal Graph Convolutional Neural Networks (CGCNNs) in Julia, in particular using the GeometricFlux package. CGCNN’s are a method in computational materials science for representing crystal structures as undirected graphs and then predicting materials properties by training graph convolutional neural nets on data from experiments, online repositories, etc. In particular, there are features for each node (atom) in the graph corresponding to various properties such as atomic number, ionization energy, etc. as well as features of the edges (bonds) of the graph, in this case just the length of the bond.
My starting point is the Python implementation here. I have a few questions related to featurization and was directed that folks in this community might be able to offer answers.
In the Python implementation, the feature vectors are built by discretizing each property to be represented into bins. For categorial data (e.g. is the atom in the s-, p-, d-, or f-blocks of the periodic table?) this makes perfect sense, but for continuous variables (e.g. electronegativity or atomic radius) it sacrifices information. However, this “binning” allows the atomic feature vector to be represented as a long vector of zeros and ones rather than a shorter vector of floats. Using default settings, features are binned into roughly ten categories, so in particular the vector is ~10x longer, meaning the weight matrices have ~100x as many entries compared to a float-based implementation.
So my first question is, is there a big efficiency boost to be gained from storing/manipulating these atomic features in this way that’s worth longer vectors/larger matrices and also the information loss due to coarse-graining? Or would it make more sense to just use the float values in a Julia implementation?
On a related note, the featurization of the graph edges (bonds) is also done in a very particular way. The bonds only have one distinct feature – their length. However, they are also featurized in a binning process, which is additionally passed through a Gaussian filter such that the vector ends up looking like mostly zeros and then at the slots corresponding to the bond length and those near it, growing and then shrinking (float) values. I assume something about this representation improves the way in which information about neighboring atoms/nodes “propagates” along graph edges/bonds, but I can’t exactly understand why. So my second question is, can someone explain this and convince me whether I should keep this in my Julia implementation as well?
Thanks in advance! Please let me know if anything is unclear (I can add code for instance, but this felt like a more conceptual question so it didn’t seem necessary). For more information on CGCNN’s, you can see the non-paywalled preprint paper here: https://arxiv.org/pdf/1710.10324.pdf (it’s published in Physical Review Letters so you can find the final version there if you have academic credentials).