Benchmarking of linear and non-linear approaches for QSPR studies of metal complexation with ionophoresIgor V. Tetko,1,2* Vitaly P. Solov'ev,3 Alexey V. Antonov,1 Xiaojun Yao,4 Jean Pierre Doucet,4 Botao Fan,4 Frank Hoonakker,5 Denis Fourches,5 Piere Jost,5 Nicolas Lachiche,5 and Alexandre Varnek,5 1- GSF - National Centre for Environment and Health, Institute for Bioinformatics(MIPS), 85764 Neuherberg, Germany2- Institute of Bioorganic & Petrochemistry, National Ukrainian Academy of Sciences, 02094, Kyiv, Ukraine, http://www.vcclab.org3- Institute of Physical Chemistry, Russian Academy of Sciences, Leninskiy prospect 31a, 119991 Moscow, Russia4- Université Paris 7-Denis Diderot, ITODYS-CNRS UMR 7086, 1, rue Guy de la Brosse, Paris 75005, France5- Laboratoire d'Infochimie, UMR 7551 CNRS, Université Louis Pasteur, 4, rue B. Pascal, Strasbourg 67000, FranceData SetsDescriptors+ 3+161 molecules 241 moleculeslogK (Ag ) logK (Eu ) 1 11O+ 3+ OH2log (Ag ) 112 molecules log (Eu ) 81 molecules 3 1'2 2Objectives OO 6 6' 4'4 2'H OOO HN C N N5 OH7 5' 7' 3'HOOH P COH NN OHOO Can we predict complexation constants NOOO S O OH OHOOSCl O OH using QSPR? OHO SEQUENCES AUGMENTEDATOMSOIIN IWhat are the best descriptors? ATOMS andBONDS(AB)HO N N HO OO=C-C-N; C-C-N; C-N; O=C-C; C=O; C-C C (-C) (-O) (=O)What are the best methods? HO N N ATOMS(A)NH2 N OH C (C) (O) (O) orO C C N; C C N; C N; O C C; C O; C C (Hy) C (C )(O )(O )sp2 sp3 sp3 sp2Do non-linear methods add ...
Benchmarking of linear and nonlinear approaches for QSPR studies of metal complexation with ionophores
Igor V. Tetko,1,2* Vitaly P. Solov'ev,3 Alexey V. Antonov,1 Xiaojun Yao,4 Jean Pierre Doucet,4 Botao Fan,4 Frank Hoonakker,5 Denis Fourches,5 Piere Jost,5 Nicolas Lachiche,5 and Alexandre Varnek,5
1 GSF National Centre for Environment and Health, Institute for Bioinformatics(MIPS), 85764 Neuherberg, Germany 2 Institute of Bioorganic & Petrochemistry, National Ukrainian Academy of Sciences, 02094, Kyiv, Ukraine, http://www.vcclab.org 3 Institute of Physical Chemistry, Russian Academy of Sciences, Leninskiy prospect 31a, 119991 Moscow, Russia 4 Université Paris 7Denis Diderot, ITODYSCNRS UMR 7086, 1, rue Guy de la Brosse, Paris 75005, France 5 Laboratoire d'Infochimie, UMR 7551 CNRS, Université Louis Pasteur, 4, rue B. Pascal, Strasbourg 67000, France ObjectiveslogK1(Ag+)161 molecDulaetsa SletKs1(Eu3+)241 moleculesDescriptors og O1 +2OH lo2(Ag )112 moleculeslo2(Eu3+)81 molecules3 1' O O6 6'4' 4 2' O NHOHOC7N55N'3'OH 7' OH HO OH PCHO N NO xation constantsO S O OO Can we predict compleOH OHO ON SPR?SO using QO OSEQUENCES AUGMENTED ATOMS Cä O OH OH II What are the best descriptors?NATOMSandBONDS AB HO N N HO O What are the best methods?O=CCN CCN CN O=CC C=O CC C C O =O HO N NATOMS A NH2N OHC C or O O Do nonlinear methods add some value?O C C N C C N C N O C C C O C C (Hy) Csp2(Csp3)(Osp3)(Osp2) HOHO O BONDS B Can we compare results of different OHO OH HO == = = C HO OH O O O O methods in an objective way?O(C) Estate indices (D) Atomtype Estate indices and counts atom index name value count values counts name index index OOno no O O 11.091 dO 44.35 4 1 1 SdO 1 dO(acid) 11.09 1 2 SdO(acid) 44.35 4 OHO 3 SdssC 4.08 42 dssC 1.02 1 HO HO OH O O3 sOH 4 1 9.08 36.30 4 SsOH S 5 SsOH(acid) 1 36.30 4 9.083 sOH(acid) HO O OH 1 6 SssCH2 1.524 ssCH2 0.229 12 O OH OH O O 4 6.57 7 SsssN 1 1.645 sssN 5 sssN(al) 1.64 1 8 SsssN(al) 6.57 4 O O HO S6 ssCH2 0.305 1 O O O7 ssCH2 0.305 1 O O O Analyzed approachesHO OH OHOHOHHO SAisnsgoucliaartiVvaeluNeeuDreacloNmetpwoosirtkio(nA(SMNLNR)Ah/ttSpV://Dw)whtwtp.v:/c/icnlfaobc.hoirmg/l.aubs/tarsansnbg.fr/recherche/isida/Traditional plot Regression Error Curve Radial Basis Function Ne BF twork (R N) http://www.cs.waikato.ac.nz/~ml/wekaeu k1 count _ Maximal Margin Linear Programming Method (MMLP) http://mips.gsf/proj/mdcs kNearest Neighbor Method (kNN) Support Vectors Machine http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Data Analysis: double 5fold crossvalidation G e n e r a t i o n o f d e s c r i p to r s
Testing of Statistical Significance
aÄëÉêêçê DREC allows to compare results from several methods on one plot F r a g m e n t a l E s t a t e c o u n t s E s t a t e v a l u e s Statistical assesment of results 4 / 5 tr a in i n g s e t METHOD REC RMSE MAE R B F N S V M...K A S N N}rntecralsoslivaledoelesoitcni,nmaditnoKolmogorovSmirnov (KS).211.0NNre651.554.3331.0AScklab6.02SDVd N N SVM 0.11 2.46 1.65 green Experimental versus predicted values for models oKNN 0.124 2.79 1.85 cyan 1 /5 t e s t s e t p r e d i c t i o n logK1(Ag+) and lo2(Eu3+). Despite apparent difference in blue 07 1.98RBFN 0.132 3 . quality of both models, the outlying molecules in each model can be easil observed. brown 2.22 3.89MMLP 0.142 AVERAGE 0.274 Sta tistic al e valu atio n 5.19 4.13 gray BOOSTRAP: asnn > mmlp average p<0.001 Bootstrap significance testerag>av0.00ep<SORT1OBnknPA:1BOOavpagerp<e000.ARTSs:P>mvlmmBOOSTRAP: weka > average p<0.001 BOOSTRAP: svd > average p<0.001 BOOSTRAP: mmlp > average p<0.001 Comparison of Methods Comparison of DescriptorsKS: svd != asnn 0.0081 11 svd != svm 0.0147 1 0 0 1 0 00ÉíE-ëí1a 9 029 02îaäìÉë svd != weka 0.0258 90 80 78 0054378 003SMF svd p<0.0001 != average 70 6 066 04äìÑÉêëaÉÖãîÉaåíë-Eíaíë60KS: asnn != average p<0.0001 54 00 4 5 0 05ëÉãíå50ÑêaÖSMFKS: svm != mmlp 0.0258 40 3 070302EE-ëë-íaíÉÅìçaíÉíåí svm != average p<0.0001 03 210081 0802 ÅçìåíS: != p<0.0001 average K knn 0 0 p<0.0001KS: weka != average 01 1 2 3 4 5 6 1 2 3 4 5 6 7 8 0 p<0.0001KS: mmlp != average SMF fragments Estate counts1 2 3 4 5 6 7 8 9 01 1 0 01Percentage of best models (y axis) calculated usingStatistical analysis provides an objective comparison of different methods 9 02corresponding descriptor system and all methods. 8 0 7 0 6 03 5 0 4230005 4Conclusions 1 08 0 1 2 3 4 5 6Models based on fragments (SMF, Estate counts) > Estate indices Estate indices all descriptorsNonlinear approaches > multiple linear regression (MLRA) (p<0.05) But ensemble of several MLRA ≈ non-linear approaches Percentage of models (y axis) as a function of the number ofntFooprreaanckheddastiagnsietficwaentsemleocdteeldsn(xdpercandcounnestdomsletsebatydateachperep.detceles)sixaNosignificant differences in performance of nonlinear models ofmodelscontributedusingeachmethod.CalculatteionswereSVM and ASNN provided largest number of "best" models perform ed using MLRA (1), RBF NN (2), kNN (3), MMLP (4),kNN was the fastest method ASNN (5), averaging of all ISIDA models (6), averaging of fi ve first ranked ISIDA m odels (7) and SVM (8).
Acknowledgement IVT was supported with Invited Professor position from Université Louis Pasteur. The part of this work has been performed in the framework of FrenchRussian collaborative project GDRE “SupraChem”.