Document Clustering in Reduced Dimension Vector SpaceKristina LermanUSC Information Sciences Institute4676 Admiralty WayMarina del Rey, CA 90292Email: lerman@isi.eduAbstractDocument clustering is a popular tool for automatically organizing a large collection of texts. Clusteringalgorithms are usually applied to documents represented as vectors in a high dimensional term space. Weinvestigate the use of Latent Semantic Analysis to create a new vector space, that is the optimalrepresentation of the document collection. Documents are projected onto a small subspace of this vectorspace and clustered. We compare the performance of clustering algorithms when applied to documentsrepresented in the full term space and in reduced dimension subspace of the LSA-generated vector space.We report significant improvements in cluster quality for LSA subspaces with optimal dimensionality. Wediscuss the procedure for determining the right number of dimensions for the subspace. Moreover, whenthis number is small, the total running time of the clustering algorithm is comparable to the one that usesthe full term space.IntroductionClustering is used to partition a set of data so objects in the same cluster are more similar to one anotherthan they are to objects in other clusters. In the field of information retrieval (IR), document clustering isused to automatically organize large collection of retrieval results, grouping together documents thatbelongs to the same topic in ...