Tree clustering for layout-based document image retrievalSimone Marinai, Emanuele Marino, Giovanni SodaDipartimento di Sistemi e InformaticaUniversita` di FirenzeVia S.Marta, 3 - 50139 Firenze - Italymarinai@dsi.unifi.itAbstract vector spaces is still the subject of active research. Somemethods (e.g. [2] [15]) have been proposed to search inWe describe a system for the retrieval on the basis of lay- high dimensional spaces, however these methods degener-out similarity of document images belonging to collections ate to the linear complexity when dealing with spaces hav-stored in digital libraries. Layout regions are extracted and ing more than a few dozens of dimensions. One interestingrepresented with the XY tree. The proposed indexing method feature of the Cluster Tree approach [15] is the use of a clus-combines a new tree clustering algorithm (based on Self Or- tering algorithm in the indexing phase, so as to capture theganizing Maps) with Principal Component Analysis. The uneven distribution of patterns in the feature space.combination of these techniques allows us to retrieve the In this paper we tackle the document indexing by explor-most similar pages from large collections without the need ing the effectiveness of a tree clustering method based onfor a direct comparison of the query page with each indexed Self-Organizing Maps (SOM) [10] extending an approachdocument. that we recently proposed for the efficient word retrievalfrom large document ...