For a chemicals manufacturer having new substances to produce, being able to write a quick commercial proposal is a major issue. Indeed, for a new molecule, the price depends on many factors and is sometimes difficult to assess. We attempted to train a deep neural network able to give a quick quotation for any new molecule. Hereinafter, the preliminary results are presented.
State of the art
In a recent article , researchers noticed a correlation between the molecular complexity and the price, based on a classic statistical analysis. These correlations appear especially for homogeneous classes of chemicals. In order to make prediction, we developed a prototype based on a deep neural network implemented with Tensorflow.
We used the same database as researchers in reference. The 2.2 millions chemicals dataset in SDF format has been parsed reduced to clean improper datas.
|Neural network layers||3|
|Optimizer||Stochastic gradient descent with learning rate 0.01|
|Loss||Mean Squared Error|
|Training set||779092 molecules|
|Validation set||194773 molecules|
We get interesting preliminary results with a price prediction at 30% correct for more than 85% of the molecules.
This preliminary result is interesting however several points have to be checked. The dataset of prices used by reference researchers comes from a private company and is not fully documented and referenced. So the relevancy of the training set is not clearly stated at the present time. The molecules described in the dataset ar only made of C,H,O,N,Cl,S,Br,F,I,B,K Se atoms. This is sufficient for many organic chemicals, but some price ranges are also under represented (fig2). Furthermore, we experimented a large variety of features to describe the molecules, including semantic analysis tokenization. However, these rapid approaches are not so relevant than the one we used in our reference chemical deep learning framework AlchemAI. All these points have to be clarified to go further. Nevertheless, this type of deep learning approach can be easily used on reduced price dataset, like the one made of previous price quotations by a specific fine chemicals manufacturer, then retrained to fit more closely to the price obtained on real manufacturing facilities.
The most simple neural network we developed should be able to start prediction from a dataset of at least 100-150 molecules with price or production costs and being fully efficient from 2000 examples of price quotation.