Predicting the price of chemicals by deep neural network

analogicus / Pixabay

For a chemicals manufacturer having new substances to produce, being able to write a quick commercial proposal is a major issue. Indeed, for a new molecule, the price depends on many factors and is sometimes difficult to assess. We attempted to train a deep neural network able to give a quick quotation for any new molecule. Hereinafter, the preliminary results are presented.

State of the art

In a recent article [1], researchers noticed a correlation between the molecular complexity and the price, based on a classic statistical analysis. These correlations appear especially for homogeneous classes of chemicals. In order to make prediction, we developed a prototype based on a deep neural network implemented with Tensorflow.

Results

We used the same database as researchers in reference. The 2.2 millions chemicals dataset in SDF format has been parsed reduced to clean improper datas.

Tab1.Prototype specifications
Specification point
Neural network layers 3
Optimizer Stochastic gradient descent with learning rate 0.01
Dropout 0.2
X normalization MinMax[0,1]
Y normalization Y/100
Loss Mean Squared Error
Training set 779092 molecules
Validation set 194773 molecules

We get interesting preliminary results with a price prediction at 30% correct for more than 85% of the molecules.

Discussion

Fig1: Results accuracy vs error in quotation

This preliminary result is interesting however several points have to be checked. The dataset of prices used by reference researchers comes from a private company and is not fully documented and referenced. So the relevancy of the training set is not clearly stated at the present time. The molecules described in the dataset ar only made of C,H,O,N,Cl,S,Br,F,I,B,K Se atoms. This is sufficient for many organic chemicals, but some price ranges are also under represented (fig2). Furthermore, we experimented a large variety of features to describe the molecules, including semantic analysis tokenization. However, these rapid approaches are not so relevant than the one we used in our reference chemical deep learning framework AlchemAI. All these points have to be clarified to go further. Nevertheless, this type of deep learning approach can be easily used on reduced price dataset, like the one made of previous price quotations by a specific fine chemicals manufacturer, then retrained to fit more closely to the price obtained on real manufacturing facilities.

Fig2. Price distribution (prices rounded at 250 €/10g)

The most simple neural network we developed should be able to start prediction from a dataset of at least 100-150 molecules with price or production costs and being fully efficient from 2000 examples of price quotation.

 

 

 

 

[1] J. Polanski, U. Kucia, R. Duszkiewicz, A. Kurczyk, T. Magdziarz, et J. Gasteiger, « Molecular descriptor data explain market prices of a large commercial chemical compound library », Sci Rep, vol. 6, juin 2016.
Télécharger cet article au format PDF ou ePub

Laisser un commentaire

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *

Ce site utilise Akismet pour réduire les indésirables. En savoir plus sur comment les données de vos commentaires sont utilisées.