association-scalar-clusters-tutorial-1

Phewyung - Dr. E. Garcia

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

8 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Informations

Publié par	Phewyung
Nombre de lectures	90
Langue	English

Extrait

‡
‡
‡
Association and Scalar Clusters Tutorial – Part 1:
Back Mapping Term Clusters to Documents

Dr. E. Garcia
admin@miislita.com

First Published: January 18, 2008. Last Modified: March 9, 2008.
Copyright ? Dr. E. Garcia, 2008. All Rights Reserved. http://www.miislita.com

Abstract

In this tutorial you will learn how to extract association and scalar clusters from a term-document matrix. A
“reaction” equation approach is used to break down the classification problem to a sequence of steps. From the
initial matrix, two similarity matrices are constructed, and from these association and scalar clusters are identified.
A back mapping technique is then used to classify documents based on their degree of pertinence to the clusters.
Matched documents are treated as distributions over topics. Applications to topic discovery, term disambiguation,
and document classification are discussed.

Keywords: association, scalar, clusters, back mapping, topic discovery, similarity matrix, document classification

Background

This tutorial is based on the Association and Scalar Clusters lecture given to graduate students enrolled in Web
Mining, Search Engines, and Business Intelligence (http://irthoughts.wordpress.com/category/web-mining-course/),
Polytechnic University of Puerto Rico (http://www.pupr.edu). Some portions are taken from a test given to the
students and from the January issue of IR Watch – The Newsletter (1, 2008).

Linear algebra calculations are represented as a sequence of “chemical reactions”. In my experience, it is easier for
students to solve linear algebra problems if they can visualize these as stepwise processes. Using reaction
equations requires of less talking, redundant explanations, and helps students organize their own thoughts and
lecture notes. Last, but not least, I like the approach since I also have a background in chemistry.

Before delving into the tutorial we need to explain the difference between association and scalar clusters and define
what is a similarity matrix M. While association clusters are based on comparing nearest neighbors (pairwise
similarity), scalar clusters are based on comparing neighborhoods (neighborhood-induced similarity). We denote
the corresponding similarity matrices by M and M . nn n

In general, a similarity matrix M is any array of the form

Figure 1. A similarity matrix M is a special type of array.

Note that M is an array with the following salient properties:

Equidimensionality # rows = # columns = trace
Non-Negativity: m 0 ij
Reflexivity: m = 1 for all i = j ij
Symmetry: m = m ij ji
Boundedness: 1 m 0 ij

With this in mind, let proceed with the tutorial.

Copyright © 2008 Dr. E. Garcia http://www.miislita.com 1 Problem

A user searches for surfing in a search engine and constructs a document collection from the top five ranked
documents. The following procedure is used.

Firstly, titles are treated as “documents” and indexed. To index the “documents” these undergo linearization,
tokenization, and filtration. Stemming is not applied. A term-document matrix A is constructed from survival terms.

To simplify the discussion, a elements of A are computed using FREQ, loosely known as the term count model, ij

a = L = f (Eq 1) ij ij ij

where

L is the weight of term i in document j. ij
f is the occurrence frequency of term i in document j. ij

Global and normalization weights are not used, but we could include these as well. See Figure 2.

Figure 2. This is a term-document matrix.

The problem now reduces to classifying terms into association and scalar clusters and identifying which documents
are more pertinent to the clusters. However, another part of the problem is how to deal with terms that are
ambiguous. For instance, surfing seems to be mentioned within different contexts across the collection. Should we
include surfing with clusters that mention internet and/or web, or with clusters that mention hawaii and/or beach?
This is a disambiguation problem.

Solution

Firstly, we reduce the problem to the following “reaction” equations:

T
A A Transpose of A
T
AA B Co-Weight Matrix
B C = M Similarity Matrix (nearest neighbor) nn
C D Row-Normalized Matrix
T
D D Transpose of D
T
DD E = M Similarity Matrix (neighborhood) n

C is a similarity matrix populated with Jaccard’s Coefficients. These coefficients estimate the degree of similarity
between pair of neighbors. Since pairs of nearest neighbors are used to generate the clusters, we could refer to C
as a nearest neighbor similarity matrix. We can denote this by stating that C is an M -type matrix; i.e. C = M . nn nn

By contrast, E is a similarity matrix populated with cosine angles (i.e., cosine similarities). Essentially we treat a row
i populated with Jaccard’s Coefficients as the neighborhood of term i. Then, rows are restated as unit vectors and
their Dot Products taken. Since for unit vectors cosine angles equal Dot Products, the result is a matrix of cosine
similarities. E can be called a neighborhood similarity. We denote this by stating that E is an M -type matrix; i.e. E = n
M . n

Clusters are then generated from C and E and these are mapped back to documents. Back mapping is aimed at
identifying in which documents all members of a given cluster co-occur. These documents should be more relevant
to a cluster than any other documents of the collection. The procedure is straightforward and consists of the
following steps.

Copyright © 2008 Dr. E. Garcia http://www.miislita.com 2 Step 1: Compute the transpose of A by changing rows into columns and columns into rows.

T
A A

T
Figure 3. Construction of A .

T
Step 2: Compute the Co-Weight matrix B by premultiplying A by A.

T
A A B

Figure 4. Construction of B.

Step 3: Compute matrix C by transforming co-weights into pairwise similarities. Use Jaccard’s Coefficients.

B C

Figure 5. Construction of C with Jaccard’s Coefficients.

It is easy to show that

J = 3/(3 + 3 – 3) = 1 11
J = 2/(3 + 2 – 2) = 0.67 12
J = 2/(3 + 4 – 2) = 0.40 13

and so forth.

We can recognize C as a similarity matrix.

Copyright © 2008 Dr. E. Garcia http://www.miislita.com 3 Step 4: Transform C into a row-normalized matrix D by converting row vectors into unit vectors.

C D

Figure 6. Row-normalized matrix D.

Step 5: Compute the transpose of D by changing rows into columns and columns into rows.

T
D D

T
Figure 7. Computing the transpose matrix D .

Step 6: Compute matrix E. Since unit vectors are used, Dot Products = Cosine similarity values.

T
D D E

Figure 8. Construction of matrix E, populated with Dot Products equal to cosine similarity values.

If we ignore the small deviations (± 0.01) due to rounding errors, we can recognize E as a similarity matrix.

Extracting Association Clusters

As mentioned, we extract association clusters (ACs) with M and scalar clusters (SCs) with M . To accomplish nn n
this, we treat each row i of C as the neighborhood of term i. Thus, elements of row i can be related to the neighbors
of term i. Our goal is then two-fold: identify the nearest neighbor(s) of term i and compare any two neighborhoods.
The assumptions here are that any two terms are similar to one another if they are nearest neighbors or if they
have similar neighborhoods.

Since row i of C represents the neighborhood of term i, for a given row, the nearest neighbor of term i is a term
other than itself with the largest similarity value. However, it might happen that a term has more than one nearest
neighbor. In that case, we include these in the cluster. For example, from row 5 of C (see Figure 5) the nearest
neighbors of t5 are t3 and t4. These terms conform the AC5 = {t5, t3, t4} association cluster.

Repeating the analysis for all rows,

Copyright © 2008 Dr. E. Garcia http://www.miislita.com 4 t1 t2 AC1 = {t1, t2} = {internet, web}
t2 t1 AC2 = {t2, t1} = {web, internet}
t3 t5 AC3 = {t3, t5} = {surfing, beach}
t4 t5 AC4 = {t4, t5} = {hawaii, beach}
t5 t3 and t4 AC5 = {t5, t3, t4} = {beach, surfing, hawaii}

Note that row 1 gives the AC1 = {t1, t2} cluster and row 2 gives the AC2 = {t2, t1}. If order does not matter,

AC1 = AC2

and we end with just four association clusters:

AC1 = AC2 = AC1 = {t1, t2}
AC3 = {t3, t5}
AC4 = {t4, t5}
AC5 = {t5, t3, t4}

From the limited collection examined, only these association clusters are extractable from A. One might be tempted
to invoke co-occurrence arguments