Tree clustering for layout-based document image retrieval

9 pages

English

Tree clustering for layout-based document image retrieval

mtoledan

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

9 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Tree clustering for layout-based document image retrievalSimone Marinai, Emanuele Marino, Giovanni SodaDipartimento di Sistemi e InformaticaUniversita` di FirenzeVia S.Marta, 3 - 50139 Firenze - Italymarinai@dsi.uniﬁ.itAbstract vector spaces is still the subject of active research. Somemethods (e.g. [2] [15]) have been proposed to search inWe describe a system for the retrieval on the basis of lay- high dimensional spaces, however these methods degener-out similarity of document images belonging to collections ate to the linear complexity when dealing with spaces hav-stored in digital libraries. Layout regions are extracted and ing more than a few dozens of dimensions. One interestingrepresented with the XY tree. The proposed indexing method feature of the Cluster Tree approach [15] is the use of a clus-combines a new tree clustering algorithm (based on Self Or- tering algorithm in the indexing phase, so as to capture theganizing Maps) with Principal Component Analysis. The uneven distribution of patterns in the feature space.combination of these techniques allows us to retrieve the In this paper we tackle the document indexing by explor-most similar pages from large collections without the need ing the effectiveness of a tree clustering method based onfor a direct comparison of the query page with each indexed Self-Organizing Maps (SOM) [10] extending an approachdocument. that we recently proposed for the efﬁcient word retrievalfrom large document ...

Sujets

Algorithmique

Informations

Publié par	mtoledan
Publié le	24 juin 2011
Nombre de lectures	57
Langue	English
Poids de l'ouvrage	1 Mo

Extrait

Tree clustering for layout-based document image retrieval

Simone Marinai, Emanuele Marino, Giovanni Soda

Dipartimento di Sistemi e Informatica

Universit`a di Firenze

Via S.Marta, 3 - 50139 Firenze - Italy

marinai@dsi.unifi.it

Abstract

We describe a system for the retrieval on the basis of lay-

out similarity of document images belonging to collections

stored in digital libraries. Layout regions are extracted and

represented with the XY tree. The proposed indexing method

combines a new tree clustering algorithm (based on Self Or-

ganizing Maps) with Principal Component Analysis. The

combination of these techniques allows us to retrieve the

most similar pages from large collections without the need

for a direct comparison of the query page with each indexed

document.

1. Introduction

The use of Document Image Retrieval (DIR) techniques

is essential to build successful Digital Libraries [1]. Among

several DIR paradigms, the layout-based retrieval aims at

identifying relevant documents on the basis of similarities

in the page layout. We recently addressed the layout-based

DIR by means of a XY tree-based layout representation

and subsequent tree comparison [13]. To compare the trees

we used both tree edit distance algorithms and similarities

based on a fixed-length encoding of the tree structure by

means of N-gram inspired techniques that allow us to de-

scribe the tree with a 388-dimensional feature vector. In

these methods we used tree grammars to model the varia-

tions in the trees of similar pages which are due to segmen-

tation errors or to structural differences.

Both approaches, and in particular the method based on

the tree edit distance, have significant computational costs

when dealing with large document collections. In the re-

lated problem of exact indexing of Dynamic Time Warping

a fast lower bounding function that helps prune sequences

that could not possibly be the best matches has been re-

cently proposed [9]. However, these techniques have not

yet been adapted to the tree edit distance computation.

On the other hand, efficient search in high dimensional

vector spaces is still the subject of active research. Some

methods (e.g. [2] [15]) have been proposed to search in

high dimensional spaces, however these methods degener-

ate to the linear complexity when dealing with spaces hav-

ing more than a few dozens of dimensions. One interesting

feature of the Cluster Tree approach [15] is the use of a clus-

tering algorithm in the indexing phase, so as to capture the

uneven distribution of patterns in the feature space.

In this paper we tackle the document indexing by explor-

ing the effectiveness of a tree clustering method based on

Self-Organizing Maps (SOM) [10] extending an approach

that we recently proposed for the efficient word retrieval

from large document collections [11].

The method described in this paper groups trees con-

sidering the similarity between vectorial encodings of their

structure. To reduce the search complexity we project the

data in each cluster with Principal Component Analysis

(PCA) and then perform an efficient search in the projected

space. In the retrieval phase the user provides to the sys-

tem a prototype document (with a query by example mech-

anism) and a XY tree is extracted from the query page. A

similarity measure is then computed between the query tree

and the indexed pages by using the SOM clustering and pro-

jected spaces.

The paper is organized as follows: in Section 2 we ana-

lyze some alternatives to compute the tree clustering. The

approaches proposed to perform document indexing and re-

trieval are discussed in Sections 3 and 4, respectively. Some

experimental results are reported in Section 5, whereas a fi-

nal discussion is drawn in Section 6.

2. SOM clustering of page layout

One essential step in the proposed document image re-

trieval system is the clustering of pages on the basis of lay-

out similarity by means of Self Organizing Maps. The Self

Organizing Map (SOM [10]) is a special kind of artificial

neural network that is based on competitive learning. The

purpose of the SOM training is the computation of an opti-

Figure 1. Example of one trained recursive SOM. Empty neurons (not labeled) are activated by inter-

nal or leaf nodes.

mal clustering of a collection of patterns in

Organizing Map the neurons are typically arranged in a two

dimensional lattice: the feature map. Each neuron receives

inputs from the input layer (vectors in

) and from the

other neurons in the map. During the learning the network

performs clustering and the neurons are moved in the lat-

tice so as to reflect cluster similarity by means of distances

in the map. To each element in the map it is associated one

real vector (in

) that can be considered as a prototype of

the patterns in the cluster.

The standard SOM model [10] deals with fixed-size vec-

tors that are mapped into a topological map at the end of the

learning process. To compute this map we might represent

each page with a fixed-size vector (e.g. obtained by com-

puting appropriate features in the regions defined by a grid

superimposed to the page [8, 14]). Pages in digital libraries

are frequently perceived as relevant for a query even when

the layout objects have a similar arrangement, but different

physical positions (e.g. a section title that can be in any

position in text columns). To capture these similarities the

grid representation can be less useful than structural repre-

sentations of the page layout, like the XY tree [4, 6]. In the

latter case the layout similarity can be computed by means

of an appropriate tree similarity. In our framework, we first

build a structural page representation (based on XY trees)

and then compute a SOM clustering either using appropri-

ate models designed to deal with Directed Acyclic Graphs

(DAG) [7], or using the standard algorithm with a suitable

tree encoding. These two approaches are shortly analyzed

in the following.

vl-vL-hs

vl-hs-vL

hs-I-hs

hs-T-T

vl-hs-I

vl-hs-hs

hs-hs-T

A-B-C

D-E-F

MXY tree

Example page

tree-patterns

Balanced

tree-patterns

Chain

Figure 2.

One example page together with the corre-

sponding MXY tree, and the

tree-patterns

occurring in the

tree. For instance there are three

chain tree-patterns

and in

particular the pattern

hs-hs-T

occurs twice.

Artificial neural networks are mostly used to process

items represented with fixed-size feature vectors.

The

Multi Layer Perceptron (MLP) and the Self-Organizing

Map (SOM) are typical examples of these models in the su-

pervised and non-supervised frameworks, respectively. In

recent years, both the MLP and SOM have been extended

to deal with structured data by means of the recursive model

(see [12] for additional references). In particular, the recur-

sive SOM (R-SOM) has been introduced in [7] with the aim

of processing structured data encoded as labeled Directed

Acyclic Graphs (DAGs) in an unsupervised fashion. The

extension from the standard SOM is obtained by taking into

account the unfolding approach initially proposed for recur-

To simplify the following description we will refer to

trees instead of DAGs. In the unfolding framework each

node of the tree is encoded with a real vector containing fea-

tures of the node and information provided by its children.

This vector is used as input to a neural network (an MLP in

the case of recursive MLP) and the network output is then

used as input to its father node in a recursive way. The ex-

tension to the SOM (see [7] for details) allows to cluster

nodes of the whole training set collection into neurons of

the SOM map. At the end of the training, some neurons of

the SOM will be active mostly for leaf or internal nodes,

whereas other neurons will correspond to roots.

Given a trained R-SOM it is therefore possible to retrieve

the most similar pages by identifying the neuron that is acti-

vated by the query tree root after processing all its subtrees.

The indexed trees (correspondingto pages) associated to the

winning node will define the closest pages with respect to

the query.

To evaluate this idea we trained one R-SOM with 6,000

pages that have been manually labeled with a class name

(e.g. “title page”, “image”). The size of the R-SOM is 60 x

70, and it is significantly larger than the SOM described in

Section 2.2 so as to leave space for both internal nodes and

root nodes. At the end of the R-SOM training process, we

associated each tree with the SOM neuron activated by its

root. Lastly, we labeled the neurons with the most frequent

classes among the associated trees. The result of this pro-

cess is shown in Figure 1, where graphical symbols denote

the different classes associated with each root node.

By analyzing Figure 1, we can see that the neurons cor-

responding to each class (identified with the same sym-

bol) are spread in the map and usually are not clustered

together. Moreover, when inspecting the content of each

neuron, we noticed that pages with different layout are fre-

quently mixed together. This is probably due to the unfold-

ing process, that gives more emphasis to the information

provided by nodes close to the root. We therefore adopted a

different approach to perform the tree clustering for layout-

based DIR as described in the next section.

To solve the above mentioned problems, we propose to

build one vector-based tree representation to be used to train

a SOM map in order to cluster the pages on the basis of

layout similarity. In this section we first describe the word

encoding approach, and then analyze the training algorithm.

2.2.1 Vector-based tree encoding

In our DIR system we use an extension of the XY tree rep-

resentation (the MXY tree) that splits the page layout also

Figure 3. Example of a SOM trained with the proposed algorithm. Each neuron is represented with

the thumbnail of the page whose tree coincides with the cluster center. Pointed regions identify

close neurons having homogeneous content:

Title pages;

) Pages with equations and formulas;

) Two column text;

) Single columns text;

) Two column text with a picture on the top;

)

column text with a section title in the middle of one column;

) Similar to

but the section title is at

the top of one column;

) Two column text with one picture in one of the two columns.

along horizontal and vertical ruling lines in addition to the

usual cuts along white spaces [4]. The MXY tree nodes

contain symbolic attributes describing the purpose of the

node. In particular, internal nodes represent the cut strat-

egy considered (we can have horizontal/vertical cuts along

either spaces or lines), whereas leaves correspond to homo-

geneous blocks in the page (text, image, horizontal/vertical

line). The MXY tree data structure is encoded into a fixed-

size feature vector by taking into account occurrences of

tree-patterns

made by three nodes [3]. Trees composed of

three nodes can have two basic structures. The first pat-

tern has the root and two children and is called

balanced

tree-pattern

. The second pattern is made by the root, one

child, and a child of the second node and is called

chain

tree-pattern

(see Figure 2 as an example).

Since node labels are in a fixed number, then the number

of possible

tree-patterns

is fixed as well. Frequently, sim-

ilar pages have some

tree-patterns

in common and contain

the same number of occurrences of a given

tree-pattern

(for

instance table pages usually contain a large number of

tree-

patterns

with lines). Unfortunately, some patterns appear

roughly in every document, and therefore these patterns are

not very useful for measuring page similarities.

These peculiarities are very similar to the use of index

terms in classic Information Retrieval. We extended the

vector model approach (used in IR for textual documents) to

our representation based on

tree-patterns

. The vector values

are weighted so as to provide more importance to most dis-

criminant terms. One common approach relies on the well

known

tf-idf

weighting scheme. The weight assigned to the

-th

tree-pattern

in page

is computed by:

(1)

where

is the frequency of the

-th

tree-pattern

in the

tree corresponding to page

normalized with respect to

the maximum

tree-pattern

frequency in the tree associated

to page

, N is the total number of pages, and

is the

number of pages containing the

-th

tree-pattern

2.2.2 SOM training

By using the approach described in Section 2.2.1 it is possi-

ble to represent each tree in the collection with a fixed-size

feature vector that corresponds to the

tf-idf

weighting of the

occurrences of

tree patterns

in the tree. These vectors can

be used to compute a SOM clustering.

In our model the SOM is used to build a document im-

age database where the clusters contain similar documents

(from the page layout point of view). In Figure 3 we report

one trained SOM obtained with the approach summarized in

this section. The size of the map is 15 x 10 and it is smaller

than the R-SOM described in Section 2.1 since in this case

we do not need to leave space for the internal tree nodes. A

careful inspection of Figure 3 allows us to identify homoge-

neous regions (groups of neurons containing similar pages).

To facilitate this inspection, we marked regions with similar

layout as detailed in the figure caption. The contents of the

clusters are more homogeneous than before, even if in this

case we used a larger dataset (described in Section 5) with

a broad range of books.

3. Document image indexing

After computing a suitable SOM clustering of pages we

can use this map to index the pages. When dealing with

large collections, in order to reduce the SOM training time,

we can compute the SOM by using a sub-set of the pages to

be indexed. A segmentation algorithm is used to extract the

MXY tree from each indexed page. Next, the trees are en-

coded obtaining a vectorial representation whose items are

related to the occurrences of the

tree-patterns

in the MXY

tree. One problem of this approach is the high dimension-

ality of the feature vector (388 dimensions) that is reflected

into a long training time for the SOM. To reduce the cost

of the search in each cluster we also compute, during the

indexing, the projection that best represents the data with

PCA.

The following procedure is repeated for each SOM clus-

ter in order to compute a low dimensional hyperplane by

means of Principal Component Analysis (see e.g.[5], page

568). We first compute the mean vector

and the

co-

variance matrix

of the data in the cluster. The eigenvec-

tors and eigenvalues of

are computed and the eigenvec-

tors are sorted according to decreasing eigenvalue. The first

eigenvectors (

) are combined as columns of

the

matrix

the cluster (for instance one point

) onto the

-th dimen-

sional subspace according to

(2)

Summarizing, the document indexing is composed by

the following steps: 1) layout analysis and tree encoding 2)

SOM training and PCA computation from a sub-set of the

documents to be indexed 3) projection of all the documents

in the lower dimensional space.

4. Document image retrieval

Document images are retrieved by using a “query by ex-

ample” approach: given a sample page, the system com-

putes the layout similarity of it with all the indexed pages.

The pages in the database are afterwards ranked on the basis

of this similarity and shown to the user. The page similar-

ity is evaluated considering the distance between the feature

vectors describing the layout.

Cluster 2

Cluster 1

Cluster 0

Final rank

Figure 4. From top: Final ranking, and contents of the ranked lists of the three closest clusters for

the query page shown in Figure 2. The query is a page inserted during the digitzation in order to

separate different parts of a volume.

Example

14.1

11.9

5.2

13.4

10.1

6.6

2.2

8.7

# queries

Table 1. Results obtained for 35 query pages belonging to four classes.

is the number of correct

pages in the final ranking list;

is the number of subsequent correct pages at the top of the list.

For both values we removed the query page (retrieved as well) from the counts. For each class the

average values of

and

are shown.

Without taking into account efficiency issues, one simple

approach for document image retrieval would rely on the

linear comparison of each indexed page with the query doc-

ument image by measuring the distance between the cor-

responding feature vectors. Unfortunately, this approach is

not feasible when large databases are considered. The pro-

posed method takes into account the main features of SOM

clustering and PCA to efficiently address the document re-

trieval. Let us describe the four main steps of the algorithm.

The query

The query page is encoded in the same way as indexed

pages. Obviously, when the query page belongs to the

indexed one, then the page is retrieved in the first posi-

tion.

Cluster search

In the second step we identify the three clusters hav-

ing prototypes closer to the query so as to restrict the

subsequent search. Due to layout variations and to the

presence of noise in scanned pages, it is not usual that

all the most similar pages are contained in the closest

cluster. To tackle this problem we consider the three

closest clusters. In so doing we reduce on the average

the number of patterns to be processed by the subse-

quent step by a factor 100 (for a map having size of 15

x 20).

Search in clusters

After selecting the three closest clusters we identify

the most similar words by means of PCA. During the

search in the three clusters we look for the

nearest

neighborhoods in the projected spaces.

Final ranking

The previous step allows us to identify the most sim-

ilar pages in the projected spaces of the three closest

clusters. To merge the three lists of top-ranked pages

and refine the final ranking we compute the distance in

the original space between the query tree and the trees

in the three lists. It is worthwhile to remark that the

computation in the original space is performed for a

small number of pages and therefore it is not problem-

atic from the computational point of view.

We show in Figure 4 the first 20 pages in the three clus-

ters closest to the query image shown in Figure 2, as well as

the first 20 pages in the final ranking. In this case the first

cluster contains six matching pages, whereas the other clus-

ters contain five and four matching pages, respectively. The

final ranking (on top of the figure) contains seven matching

pages in the first positions of the ranking.

5. Experimental results

To perform our experiments on a large data-set we ex-

tracted the MXY trees and we indexed the pages from 53

books with a total of 22,523 pages. The size of the SOM

is 10 x 15, therefore on the average each neuron contains

around 150 pages. Moreover, we should notice that there

are several neurons corresponding to most common layouts

and usually those neurons are very populated. It is inter-

esting to summarize the disk space required to store these

Figure 5. Four examples of retrieval results. Each answer set is sorted left-to-right, top-bottom. The

first image is always the query one.

documents. The compressed TIF images need 3.3 GBytes;

the MXY trees 90 MBytes; the cluster information (centers,

projected vectors and original vector-based tree encoding)

47 MBytes.

To perform some preliminary experiments with this data-

set we selected 35 pages whose layout can be grouped into

four main categories. The standard technique to evaluate in-

formation retrieval systems is the computation of precision-

recall plots. However, to build these plots we need to know

all the documents satisfying the query. Since this informa-

tion is not available for this large data-set, we identified the

number of correct answers in the final ranking proposed by

the system. Figure 5 contains four examples of queries to-

gether with the final answer set. Table 1 summarizes the

results obtained for the 35 query pages. Since Table 1 con-

tains average values, it is not possible to describe each indi-

vidual query. For instance, for the queries shown in Figure 5

several correct answers are highlighted. However, in other

cases there are few correct answers since the layout of the

query page is not very common.

6. Conclusions

In this paper we proposed a new approach to perform

document image retrieval relying on XY tree page layout

representation. The most important feature of the proposed

method is the use of Self Organizing Maps to perform doc-

ument layout clustering so as to reduce the computational

cost during retrieval phase. We first empirically compared

two methods to perform the tree clustering and then inte-

grated the most promising one into the query retrieval sys-

tem. Preliminary experimental results with a data-set con-

taining more than 22,000 pages show promising results.

References

[1] H. S. Baird. Digital libraries and document image analysis.

Proc. 7th ICDAR

, pages 2–14, 2003.

[2] S. Berchtold, D. A. Keim, and H.-P. Kriegel. The X-tree:

an index structure for high-dimensional data. In

Proc. Proc.

22nd VLDB

, pages 28–39, 1996.

[3] F. Cesarini, M. Lastri, S. Marinai, and G. Soda. Page classi-

fication for meta-data extraction from digital collections. In

Proc. DEXA 01

, pages 82–91, 2001.

[4] F. Cesarini, S. Marinai, and G. Soda.

Retrieval by lay-

out similarity of documents represented with MXY trees.

Document Analysis Systems V

, pages 353–364. Springer

Verlag, LNCS 2423, 2002.

[

]

Pattern Classification

John Wiley & sons, 2001.

[6] P. Duygulu and V. Atalay.

A hierarchical representation

of form documents for identification and retrieval.

IJDAR

5(1):17–27, November 2002.

[7] M. Hagenbuchner, A. Sperduti, and A. C. Tsoi. A self-

organizing map for adaptive processing of structured data.

IEEE Transactions on Neural Networks

, 14(3):491–505,

2003.

[8] J.Hu, R.Kashi, and G.Wilfong. Comparision and classifica-

tion of documents based on layout similarity.

Information

Retrieval

, 2(2/3):227–243, May 2000.

[9] E. Keogh and C. A. Ratanamahatana. Exact indexing of dy-

namic time warping.

Knowledge and Information Systems

7:358–386, 2005.

[10] T. Kohonen.

Self-organizing maps

. Springer Series in Infor-

mation Sciences, 2001.

[11] S. Marinai, S. Faini, E. Marino, and G. Soda. Efficient word

retrieval by means of SOM clustering and PCA. In

DAS

2006

, pages 336–347. Springer Verlag- LNCS 3872, 2006.

[12] S. Marinai, M. Gori, and G. Soda. Artificial neural networks

for document analysis and recognition.

IEEE Transactions

on PAMI

, 27(1):23–35, 2005.

[13] S. Marinai, E. Marino, and G. Soda. Layout based document

image retrieval by means of XY tree reduction. In

Proc. 8th

ICDAR

, pages 432–436, 2005.

[14] A. Tzacheva, Y. El-Sonbaty, and E. A. El-Kwae. Document

image matching using a maximal grid approach. In

Pro-

ceedings of the SPIE Document Recognition and Retrieval

, pages 121–128, 2002.

[15] D. Yu and A. Zhang. Clustertree: integration of cluster rep-

resentation and nearest-neighbor search for large data sets

with high dimensions.

IEEE Transactions on Knowledge

and Data Discovery

, 15(5):1316–1337, 2003.

Univers
Ebooks
Livres audio
Presse
Podcasts
BD
Documents

Livre audio en ligne - Développement personnel Livre en ligne Tout le catalogue Tous les Intérêts

Tree clustering for layout-based document image retrieval

Algorithmique

YouScribe

Le catalogue

Le service

Les conditions