A Review of Branch and Bound Algorithms for Geometric and Statistical Layout Analysis

pefav

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

8 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

A Review of Branch-and-Bound Algorithms for Geometric and Statistical Layout Analysis Thomas M. Breuel University of Kaiserslautern and DFKI Résumé : Many different approaches to the geometric and statistical analysis of document layouts have been propo- sed in the literature. The development of practical branch- and-bound algorithms for solving geometric matching pro- blems under noise and uncertainty has enabled the formula- tion of new classes of geometric layout analysis methods ba- sed on globally optimal maximum likelihood interpretations for well-defined models of the spatial statistics of document images. I review this approach to geometric layout analysis using text line finding and column finding in the presence of noise and uncertainty as examples and compare the ap- proach with selected other statistical and geometric layout analysis methods. Mots-clés : document layout analysis, geometric matching, text line finding, branch-and-bound algorithms, global opti- mization 1 Introduction In addition to their purely textual content, rendered docu- ments contain a wealth of information in the geometric arran- gement of the text and figures on the page–the page layout. Examples of properties encoded in the page layout are infor- mation about which text corresponds to the title, author, page number, and abstract of a document, the order in which the body text is to be read (the reading order), and major logical divisions in the body text.

global properties

like performance

text lines

structure like

text line

only through

document layout

analysis methods

maximum likelihood

Sujets

Maximum likelihood

Informations

Publié par	pefav
Nombre de lectures	10
Langue	English

Extrait

A Review of BranchandBound Algorithms for Geometric and Statistical Layout Analysis Thomas M. Breuel

University of Kaiserslautern and DFKI tmb@informatik.unikl.de Résumé:Many different approaches to the geometric andretrieval of documents. This information can often only be statistical analysis of document layouts have been propoderived from the layout of the text (as opposed to the textual sed in the literature. The development of practical branchcontent or even font properties). For example, titles and au andbound algorithms for solving geometric matching prothors of scientiﬁc papers tend to be printed at the top of the blems under noise and uncertainty has enabled the formulaﬁrst page, centered, with the title immediately preceding the tion of new classes of geometric layout analysis methods baauthor and separated from the rest of the text by whitespace, sed on globally optimal maximum likelihood interpretationsproperties that are recoverable by document layout analysis. for welldeﬁned models of the spatial statistics of documentAnother application of document layout analysis is image images. I review this approach to geometric layout analysisbased reformatting and reﬂow of documents, a technique that using text line ﬁnding and column ﬁnding in the presenceallows the display of scanned documents on smallscreen de of noise and uncertainty as examples and compare the apvices without OCR errors and while preserving the appea proach with selected other statistical and geometric layoutrance of the original document [BRE 02b]. analysis methods. 2 Layout Primitives Motsclés: document layout analysis, geometric matching, The actual layout of a document is the result of the applica text line ﬁnding, branchandbound algorithms, global opti tion of complex, interacting rules about where to place text mization on the page. Some of those rules are consequences of pro perties of the human visual system and attempts to make 1 Introduction text more readable (e.g., keeping line lengths below a cer In addition to their purely textual content, rendered docu tain number of characters per line), others are the results of ments contain a wealth of information in the geometric arran physical constraints (e.g., page size), constraints imposed by gement of the text and ﬁgures on the page–thepage layout. traditional type setting equipment (e.g., the use of straight Examples of properties encoded in the page layout are infor and parallel text lines), convention (e.g., where page numbers mation about which text corresponds to the title, author, page and titles go), as well as stylistic and artistic considerations. number, and abstract of a document, the order in which the While layouts can become enormously complex, almost all body text is to be read (thereading order), and major logical layouts tend to be composed of a number of recurring pri divisions in the body text. Recovering this information is the mitives. The most important of these are text lines, text co problem ofdocument layout analysis. lumns, sections, and paragraphs. Furthermore, these primi Document layout information has a variety of uses. It is a tives have a number of common geometric relationships bet key step in the conversion of scanned documents into ma ween them, deﬁned by their relative size, spacing, alignment, chine readable form ; that is, what we typically think of “op and justiﬁcation. We call the extraction of these primitives tical character recognition” (OCR) actually comprises both physical document layout analysis. We refer to the extraction layout analysis and recognition of individual characters. In of higherlevel properties of a document (like titles, authors, fact, even the recognition of characters in OCR depends on page numbers, etc.) aslogical document layout analysis. Lo correct document layout analysis, since the interpretation of gical document layout analysis generally makes use of physi certain characters is affected by their position relative to the cal document layout analysis to achieve its goals. This work text line and since statistical language models depend on the deals primarily with physical document layout analysis, that correct reading order of the text. Furthermore, the user of is, the reliable extraction of primitives like text lines and text an OCR system usually expects to obtain not just a vector columns, and the geometric relationship between those pri graphics ﬁle with thousands of characters placed at speciﬁc mitives. location in the image, but instead an editable and structured text ﬁle that contains text in its correct reading order and cor 3 Previous Methods rectly identiﬁes actual line and paragraph breaks. A large number of physical document layout analysis tech But while OCR is perhaps the most important use of docu niques have been proposed in the literature ([CAT 98] pro ment layout analysis, it is not the only one. Document data vides a good overview). In order to perform their function, bases need to extract information that permits indexing and layout analysis techniques make assumptions (explicitly or

Univers
Ebooks
Livres audio
Presse
Podcasts
BD
Documents

A Review of Branch and Bound Algorithms for Geometric and Statistical Layout Analysis

Maximum likelihood

YouScribe

Le catalogue

Le service

Les conditions