60 pages

English

icdm-tutorial

Vawyor - Administrator

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

60 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Christian BöhmUniversity for Health Informatics and TechnologyPowerful Database Primitivesto Support High Performance Data MiningTutorial, IEEE Int. Conf. on Data Mining, Dec/09/2002Motivation2120Christian BöhmƒƒƒƒƒƒƒƒƒHigh Performance Data Mining Marketing Fraud Detection CRM Online Scoring OLAPFast decisions require knowledge just in time3120Previous Approaches to Fast Data MiningSamplingApproximations (grid) Loss of qualityDimensionality reduct.Expensive & complexParallelismAll approaches combinable with DB primitivesKDD appl. get parallelism for free4120Christian Böhm Christian BöhmFeature Based Similarity5120Simple Similarity Queries• Specify query object and- Find similar objects – range query- Find the k most similar objects – nearest neighbor q.6120Christian Böhm Christian BöhmÎÎMultidimensional Index Structure (R-Tree)Directory Page: Data Page: Rectangle , Address1 1Point : x , x , x , ...1 11 12 132 2Point : x , x , x , ...2 21 22 23Rectangle , Address3 3Point : x , x , x , ...3 31 32 334 47120Similarity – Range Queries• Given: Query point qMaximum distance ε• Formal definition:• Cardinality of the result set isdifficult to control:ε too small no results8 ε too large complete DB120Christian Böhm Christian BöhmIndex Based Processing of Range Queries9120Similarity – Nearest Neighbor Queries• Given: Query point q• Formal definition:• Ties must be handled:- Result ...

Informations

Publié par	Vawyor
Nombre de lectures	20
Langue	English

Extrait

Christian Böhm University for Health Informatics and Technology

Powerful Database Primitives to Support High Performance Data Mining

Tutorial, IEEE Int. Conf. on Data Mining, Dec/09/2002

Motivation



3 120

4 120

High Performance Data Mining

Marketing Fraud Detection CRM Online Scoring OLAP

Fast decisions require knowledge just in time

Previous Approaches to Fast Data Mining

Sampling of quality LossApproximations (grid) Dimensionality reduct. Parallelism Expensive & complex

All approaches combinable with DB primitives KDD appl. get parallelism for free

5 120

6 120

Feature Based Similarity

Simple Similarity Queries



Specify query object and -Find similar objects  range query -Find thekmost similar objects  nearest neighbor q.

7 120

8 120

Multidimensional Index Structure (R-Tree)

Directory Page: Rectangle1, Address1 Rectangle2, Address2 Rectangle3, Address3 Rectangle4, Address4

Similarity  Range Queries



Given: Query pointq Maximum distanceε Formal definition:

Cardinality of the result set is difficult to control: εtoo smallÎno results εtoo largeÎcomplete DB

9 120

10 120

Index Based Processing of Range Queries

Similarity  Nearest Neighbor Queries



Given: Query pointq

Formal definition:

Ties must be handled: -Result set enlargement -Non-determinism (dont care)

11 120

12 120

Index Based Processing of NN Queries

k-Nearest Neighbor Search and Ranking



k-nearest neighbor query: -Do not only search only for one nearest neighbor butk -Stop distance is the distance of thekth(last) candidate point -

Ranking-query: -Incremental version ofk-nearest neighbor search -First call ofeFNhct(txe)returns first neighbor -Second call oftcFe)t(exhNreturns second neighbor... -Typically only few results are fetchedÆDont generate all!

13 120

14 120

Advanced Applications: Duplicates



Duplicate detection -E.g. Astronomical catalogue matching

Similarity queries for large number of query obj

Advanced Applications: Data Mining



Density based clustering (DBSCAN)

15 120

16 120

What is a Similarity Join?



Given two setsR, Sof points Find all pairs of points according to similarity

Various exact definitions for the similarity join

Organization of the Tutorial



Motivation Defining the Similarity Join Applications of the Similarity Join Similarity Join Algorithms Conclusion & Future Potential

18 120

Defining the Similarity Join

What Is a Similarity Join?

Intuitive notion: 3 properties of the similarity join The similarity join is ajoinin the relational sense Two setsRandSare combined into one such that the new set contains pairs of points that fulfill a join condition

metricobjects

Vectoror rather than ordinary tuples of any type The join condition involvessimilarity