Contribution à l’étude et implantation de systèmes intelligents modulaires auto-organisateurs, Contribution to the Study and Implementation of Intelligent Modular Self-organizing Systems
Sous la direction de Kurosh Madani Thèse soutenue le 08 décembre 2009: Paris Est Les problèmes de la classification ont reçu une attention considérable dans des différents champs d’ingénierie comme traitement des images biomédicales, identification a partir de la voix, reconnaissance d'empreinte digitale etc. Les techniques d’intelligence artificielles, incluant les réseaux de neurones artificiels, permettent de traiter des problèmes de ce type. En particulier, les problèmes rencontrés nécessitent la manipulation de bases de données de tailles très importantes. Des structures de traitement adaptatives et exploitant des ensembles de classificateurs sont utilisées. Dans cette thèse, nous décrivons principalement le développement et des améliorations apportées à un outil de classification désigné par le terme Tree-like Divide to Simplify ou T-DTS. Nos efforts se sont portés sur l’un des modules de cet outil, le module d’estimation de complexité. L’architecture de l’outil T-DTS est très flexible et nécessite le choix d’un nombre important de paramètres. Afin de simplifier l’exploitation de T-DTS, nous avons conçu et développé une procédure automatique d’optimisation d’un de ces plus importants paramètres, le seuil de décision associé à la mesure de complexité. La contribution principale de cette thèse concerne le développement de modules pouvant s’implanté sur une architecture de calcul matérielle parallèle. Ce ceci permet de se rapproché d’une implantation purement matérielle de l’outil T-DTS -Traitement de l’information -Systèmes complexes -Intelligence artificielle -Systèmes d’apprentissages artificiels modulaires -Cassification -Estimation de la complexité Classification problems deal with separating group of objects into sets of smaller classes; this set of problems have received considerable attention in diverse engineering fields such as biomedical imaging, speaker identification, fingerprint recognition, etc. Several effective approaches for automated classification were suggested based on artificial intelligence techniques, including neural networks. Still, one of the major challenges faced by these approaches is a large scale of data required for successful classification. In this thesis, we explore a possible solution to this problem based on a module-based Tree-like Divide to Simplify (T-DTS) classification model. We focus on enhancing the key module of this approach - complexity estimation module. Furthermore, we provide an automated procedure for optimizing key complexity estimation parameters of the T-DTS model; this considerably improves usability and allows for a more effective configuration of decomposition reasoning of the approach. Another major contribution of this work employs further development of T-DTS modules that could be implemented using parallel computer architecture, thereby allowing T-DTS to utilize an underlying hardware to the fullest extent -Information processing -Complex systems -Artificial intelligence -Modular artificial learning systems -Classification -Complexity estimating Source: http://www.theses.fr/2009PEST0013/document
Thèse Présentée pour l’obtention du titre de DOCTEUR DE L’UNIVERSITÉ PARIS-EST Spécialité: Sciences Informatiques par Ivan BUDNYK Contribution à l’Étude et Implantation de Systèmes Intelligents Modulaires Auto-Organisateurs Soutenue publiquement 8 décembre 2009 devant la commission d’examen composée de
Rapporteur Prof. Gilles BERNARD Université Paris 8 Rapporteur Prof. Vladimir GOLOVKO Brest State Technical University Examinateur Prof. Hichem MAAREF Université d’Évry-Val d’Essonne Examinateur Dr. Abdennasser CHEBIRA Université Paris-Est (Paris 12) Invité Dr. Semen GOROKHOVSKYI Université nationale de Kiev-Mohyla-Académie Directeur de thèse Prof. Kurosh MADANI Université Paris-Est (Paris 12) LISSI Laboratoire Images, Signaux et Systèmes Intelligents EA 3956 Université Paris-Est (Paris 12), l’IUT Sénart-Fontainebleau, Département GEII Bât A., Av. P. Point, F-77127 Lieusaint ; Tél : +33(0)164134486, Fax : +33(0)164134503, http://lissi.univ-paris12.fr
Thesis Presented to obtain the degree of DOCTOR OF UNIVERSITY PARIS-EST Topic: Computer Sciences by Ivan BUDNYK Contribution to the Study and Implementation of Intelligent Modular Self-organizing Systems Defended on 8 December 2009 in presence of commission composed by
Rapporteur Prof. Gilles BERNARD University Paris 8 Rapporteur Prof. Vladimir GOLOVKO Brest State Technical University Examiner Prof. Hichem MAAREF University of Évry Val d’Essonne Examiner Dr. Abdennasser CHEBIRA University Paris-Est (Paris 12) Invited Dr. Semen GOROKHOVSKYI National University of Kyiv-Mohyla Academy Doctorate director Prof. Kurosh MADANI University Paris-Est (Paris 12) LISSI Laboratory of Images, Signals and Intelligent Systems EA 3956 University Paris-Est (Paris 12), IUT Senart-Fontainebleau, Department GEII Bat A., Av. P. Point, F-77127 Lieusaint ; Tel : +33(0)164134486, Fax : +33(0)164134503, http://lissi.univ-paris12.fr
Abstract Classification problems deal with separating group of objects into sets of smaller classes; this set of problems have received considerable attention in diverse engineering fields such as biomedical imaging, speaker identification, fingerprint recognition, etc. Several effective approaches for automated classification were suggested based on artificial intelligence techniques, including neural networks. Still, one of the major challenges faced by these approaches is a large scale of data required for successful classification. In this thesis, we explore a possible solution to this problem based on a module-based Tree-like Divide to Simplify (T-DTS) classification model. We focus on enhancing the key module of this approach -complexity estimation module. Furthermore, we provide an automated procedure for optimizing key complexity estimation parameters of the T-DTS model; this considerably improves usability and allows for a more effective configuration of decomposition reasoning of the approach. Another major contribution of this work employs further development of T-DTS modules that could be implemented using parallel computer architecture, thereby allowing T-DTS to utilize an underlying hardware to the fullest extent. Key words: Information processing, complex systems, artificial intelligence, modular artificial learning systems, classification, complexity estimating. Résume Les problèmes de la classification ont reçu une attention considérable dans des différents champs dingénierie comme traitement des images biomédicales, identification a partir de la voix, reconnaissance d'empreinte digitale etc. Les techniques dintelligence artificielles, incluant les réseaux de neurones artificiels, permettent de traiter des problèmes de ce type. En particulier, les problèmes rencontrés nécessitent la manipulation de bases de données de tailles très importantes. Des structures de traitement adaptatives et exploitant des ensembles de classificateurs sont utilisées. Dans cette thèse, nous décrivons principalement le développement et des améliorations apportées à un outil de classification désigné par le terme Tree-like Divide to Simplify ou T-DTS. Nos efforts se sont portés sur lun des modules de cet outil, le module destimation de complexité . Larchitecture de loutil T-DTS est très flexible et nécessite le choix dun nombre important de paramètres. Afin de simplifier lexploitation de T-DTS, nous avons conçu et développé une procédure automatique doptimisation dun de ces plus importants paramètres, le seuil de décision associé à la mesure de complexité. La contribution principale de cette thèse concerne le développement de modules pouvant simplanté sur une architecture de calcul matérielle parallèle. Ce ceci permet de se rapproché dune implantation purement matérielle de loutil T-DTS. Mots clés: Traitement de linformation, systèmes complexes, intelligence artificiel, systèmes dapprentissages artificiels modulaires, classification, estimation de la complexité .
i
AcknowledgementI would like to express my deepest gratitude to Prof. Kurosh Madani for his continuous support and patience. His help and guidance throughout my studies at Laboratoire Images Signaux et Systèmes Intelligents (LISSI / EA 3956) were invaluable. His experience was truly of great benefit to me and his advice in academia and otherwise has undoubtedly left a mark on me. I would also like to thank Dr. Abdennasser Chebira for his valuable input and meticulous efforts in reviewing much of this research, as well as his patience while I was preparing this thesis. I would also like to thank my thesis committee members Prof. Gilles Bernard, Prof. Vladimir Golovko and Prof. Hichem Maaref for the useful comments, suggestions, and the references. I am grateful to the faculty members of lIUT de Sénart including Dr. Veronique Amarger, Dr. Christophe Sabourin and Dr. Amine Chohra for their assistance. My deepest thanks also go to the members and ex-members of LISSI including Dr. Nadia Kanaoui, Dr. El Khier Sofiane Bouyoucef, Dr. Mariusz Rybnik, Dr. Lamine Thiaw, Dr. Matthieu Voiry, Weiwei Yu, Dalel Kanzari, Ting Wang, Dominik Maximilián Ramík and Arash Bahrammirzaee. I also thank Frances leading international exchange operator ÉGIDE, who gave me the initial opportunity to pursue research in France. Numerous other people and lifelong friends with whom I have worked and lived in France, Czech Republic and Ukraine, deserve many thanks as well, especially Dr. Maksym Petrenko and Dr. Semen Gorokhovskyi. Finally, I would like to thank my parents, Georgiy and Ganna Budnyk, because without them I would not be the person I am today.
ii
Table of contents
List of tables............................................................................................................................v List of figures ........................................................................................................................ vi Indexofsymbol......................................................................................................................xGeneral introduction ...............................................................................................................1 I State-of-the-art of classification approaches ........................................................................6 I.1 Concepts of classification............................................................................................7 I.2 Clustering methods ......................................................................................................8 I.3 Main classification methods ......................................................................................17 I.4 T-DTS (Tree-like Divide To Simplify)approach.......................................................37 I.5 Conclusion .................................................................................................................42 II Complexity concepts .........................................................................................................44 II.1 Introduction to complexity concepts ........................................................................44 II.2 Computational complexity measurement.................................................................48 II.3 ANN-structure based classification complexity estimator.......................................75 II.4 Conclusion................................................................................................................85 III T-DTS software architecture............................................................................................86 III.1 T-DTS concept: self-tuning procedure ...................................................................94 III.2 T-DTS software architecture and realization ........................................................101 III.3Conclusion............................................................................................................118IV Validation aspects ..........................................................................................................120 IV.1 ANN-structure based complexity estimators........................................................120 IV.1.1 Hardware-based validation .........................................................................121 IV.1.2 Software-based validation ...........................................................................133 IV.1.3 Summary......................................................................................................149 IV.2 T-DTS ...................................................................................................................152 IV.2.1 ANN-structure based complexity estimator validation ...............................152 IV.2.2 T-DTS self-tuning procedure validation......................................................160 IV.2.3 Summary......................................................................................................172 IV.3 Conclusion ............................................................................................................173 General conclusion and perspectives ..................................................................................175
List of figures I.1 SVM: space mapping using linear hyperplane ................................................................21 I.2 SVM: space mapping using different space kernel functions Φ .....................................22 I.3 General block scheme diagram of the T-DTS structure constructing .............................40 II.1 Description and interpretation process ...........................................................................52 II.2 Retained adherence subset for two classes near the boundary.......................................69 II.3 Taxonomy of classification complexity (separability) measures ...................................71 II.4 Examples of Voronoy polyhedron for 2D and 3D classification problems ...................77 II.5 Q(m) indicator function behaviour.................................................................................78 III.1 Diagram of T-DTS implementation for classification tasks .........................................87 III.2 Scheme of T-DTS learning concept..............................................................................90 III.3 T-DTS operating ...........................................................................................................90 III.4 An example of maximal possible decomposition tree ..................................................95 III.5 An example of distribution of the clusters number over [Amin; Amax] complexity interval .................................................................................................................................97 III.6 T-DTS software architecture.......................................................................................102 III.7 Principal T-DTS v. 2.50 Matlab software architecture...............................................104 III.8 Detailed T-DTS v. 2.50 Matlab software architecture ...............................................107 III.9 Matlab T-DTS software realization v. 2.50, Control panel ........................................108 III.10 T-DTS GUI: Results : 2 stripe-like benchmark (576 prototypes).............................113 III.11 GUI of T-DTS, decomposition clusters chart..........................................................114 III.12 GUI of T-DTS, decomposition tree charts................................................................115 III.13 GUI of T-DTS, decomposition tree charts in 3D......................................................115 III.14 Menu of T-DTS, Configuration ................................................................................116 III.15 Menu of T-DTS, Set Constant ..................................................................................117 III.16 Menu of T-DTS, Set EC Options ..............................................................................117 III.17 Menu of T-DTS, Analysis .........................................................................................117 IV.1 Stripe classification benchmarks ...............................................................................122 IV.2 Stripe classification benchmarks : Q i (m) behavior versus learning database size m , LSUPZISC®-036mode.....................................................................................................124IV.3 Stripe classification benchmarks : Q i (m) behavior versus learning database size m , L1 ZISC®-036 mode................................................................................................................124 IV.4 Benchmarks classification rates behavior versus learning database size m , LSUP ZISC®-036 mode................................................................................................................126
vi
IV.5 Benchmarks classification rates behavior versus learning database size m , L1 ZISC®-036mode............................................................................................................................126IV.6 Q k (m) evaluation for DNA splice-junction classification problem using different ZISC®-036 k -MIF parameters ( k : 55,56, 4096) for: Q 55 (m), Q 56 (m), Q 4096 (m), m k corresponds to calculated m 0 for each k -curve....................................................................129 IV.7 Quality check of RCE-kNN-like Voronoy polyhedron construction based on its generalization ability performed for k -MIF parameter k=55 ..............................................129 IV.8 Quality check of RCE-kNN-like Voronoy polyhedron construction based on its generalization ability performed for k -MIF parameter k=56 ..............................................131 IV.9 Quality check of RCE-kNN-like Voronoy polyhedron construction based on its generalization ability performed for k -MIF parameter k=4096 ..........................................131 IV.10 Square classification benchmarks, 2 classes, 2000 prototypes.................................134 IV.11 ANN-structure based complexity estimator evaluation for: Square benchmarks, 2 classes, 2000 prototypes, MIF = 1024, 3 distance modes...................................................135 IV.12 ANN-structure based complexity estimator evaluation for: 8 Stripe benchmark, 2 classes, 2000 prototypes, LSUP distance mode..................................................................136 IV.13 ANN-structure based complexity estimator evaluation for: 8 Stripe benchmark, 2 classes (4&4 stripes), LSUP distance mode .......................................................................137 IV.14 Grid classification benchmarks in D1 D2 and D3 dimension ..................................138 IV.15 ANN-structure based complexity estimator evaluation for: Grid benchmark, 2 classes, EUCL distance mode .............................................................................................139 IV.16 ANN-structure based complexity estimator evaluation for: Grid and 8-stripe-benchmarks, 2 classes, EUCL distance mode, MIF=1024 .................................................140 IV.17 Four spiral classification benchmarks, 2 classes, 2000 prototypes ..........................142 IV.18 ANN-structure based complexity estimator evaluation for: 4 Spiral benchmark, 2 classes, EUCL distance mode .............................................................................................142 IV.19 Six classification benchmarks [from left to right, from top to down: 2 Stripes (2ST), 2 Grids (2GR), 2 Squares (2SQ), 2 Sinusoids (2SN), 2 Spirals (2SP), 2 Circles (2CR) with small overlapping zone] ......................................................................................................143 IV.20 ANN-structure based complexity estimator evaluation for: 6 classification benchmarks, 2 classes, 2000 prototypes, MIF=1024..........................................................144 IV.21 Validation Matlab implementation of ANN-structure based complexity estimator embedded into T-DTS framework: 2 Stripe benchmark, 2 classes, generalization database size 1000 prototypes, learning database size 1000 prototypes, DU CNN, PU LVQ1 ..153 IV.22 Validation ANN-structure based complexity estimator embedded into T-DTS framework: 10 Stripe benchmark, 2 classes, generalization database size 1600 prototypes, learning database size 400 prototypes, DU CNN, PU LVQ1 .......................................154 IV.23 Validation Matlab implementation of ANN-structure based complexity estimator embedded into T-DTS framework: Tic-tac-toe endgame problem, 2 classes, generalization database size 479 prototypes, learning database size 479 prototypes, DU CNN, PU MLP FF GDM...................................................................................................................157 _ _
vii
IV.24 Validation ANN-structure based complexity estimator embedded into T-DTS framework: Tic-tac-toe endgame problem, 2 classes, generalization database size 766 prototypes, learning database size 192 prototypes, DU CNN, PU MLP_FF_GDM ....158 IV.25 Validation ANN-structure based complexity estimator embedded into T-DTS framework: Splice-junction DNA sequence classification problem, 3 classes, generalization database size 1520 prototypes, learning database size 380 prototypes, DU CNN, PU MLP FF GDM...................................................................................................................158 _ _ IV.26 Validation T-DTS self-tuning threshold procedure, Average learning rate (including its corridor of the standard deviations) as a function of θ - threshold: 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU CNN, PU PNN, Fisher measure based complexity estimator ................................160 IV.27 Validation T-DTS self-tuning threshold procedure, Average generalization rate (including its corridor of the standard deviations) as a function of θ -threshold: 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU CNN, PU PNN, Fisher measure based complexity estimator ......161 IV.28 Validation T-DTS self-tuning threshold procedure, Average clusters number as a function of θ -threshold: 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU CNN, PU PNN, Fisher measure based complexity estimator.................................................................................................161 IV.29 Validation T-DTS self-tuning threshold procedure, Performance estimating function P( θ ): 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU CNN, PU PNN, Fisher measure based complexity estimator..............................................................................................................................162 IV.30 Validation T-DTS self-tuning threshold procedure, Clusters number distribution: 4 Spiral benchmark, 2 classes, learning database size 500 prototypes, DU CNN, Collective entropy based complexity estimator ...................................................................................163 IV.31 Validation T-DTS self-tuning threshold procedure: 10 stripe benchmark, 2 classes, generalization database size 1000 prototypes, learning database size 1000 prototypes, DU CNN, PU LVQ1, 4 complexity estimators ......................................................................164 IV.32 Validation T-DTS self-tuning threshold procedure, Clusters number distribution: Tic-tac-toe endgame problem, 2 classes, DU CNN, Collective entropy complexity estimator..............................................................................................................................166 IV.33 Validation T-DTS self-tuning threshold procedure, Clusters number distribution: Tic-tac-toe endgame problem, 2 classes, DU CNN, ANN-structure based complexity estimator..............................................................................................................................166 IV.34 Validation T-DTS self-tuning threshold procedure, Splice-junction DNA sequences classification problem, 3 classes, generalization database size 1520 prototypes, learning database size 380 prototypes, DU CNN, PU MLP_FF_GDM, 3 complexity estimators .............................................................................................................................................168 IV.35 Validation T-DTS self-tuning threshold procedure, Clusters number distribution: Splice-junction DNA sequences classification problem, 3 classes, learning database size 1595 prototypes, DU CNN, Purity PRISM based complexity estimator.........................170 IV.36 Validation T-DTS self-tuning threshold procedure, Clusters number distribution: Splice-junction DNA sequences classification problem, 3 classes, learning database size