Christian BöhmUniversity for Health Informatics and TechnologyPowerful Database Primitivesto Support High Performance Data MiningTutorial, IEEE Int. Conf. on Data Mining, Dec/09/2002Motivation2120Christian BöhmƒƒƒƒƒƒƒƒƒHigh Performance Data Mining Marketing Fraud Detection CRM Online Scoring OLAPFast decisions require knowledge just in time3120Previous Approaches to Fast Data MiningSamplingApproximations (grid) Loss of qualityDimensionality reduct.Expensive & complexParallelismAll approaches combinable with DB primitivesKDD appl. get parallelism for free4120Christian Böhm Christian BöhmFeature Based Similarity5120Simple Similarity Queries• Specify query object and- Find similar objects – range query- Find the k most similar objects – nearest neighbor q.6120Christian Böhm Christian BöhmÎÎMultidimensional Index Structure (R-Tree)Directory Page: Data Page: Rectangle , Address1 1Point : x , x , x , ...1 11 12 132 2Point : x , x , x , ...2 21 22 23Rectangle , Address3 3Point : x , x , x , ...3 31 32 334 47120Similarity – Range Queries• Given: Query point qMaximum distance ε• Formal definition:• Cardinality of the result set isdifficult to control:ε too small no results8 ε too large complete DB120Christian Böhm Christian BöhmIndex Based Processing of Range Queries9120Similarity – Nearest Neighbor Queries• Given: Query point q• Formal definition:• Ties must be handled:- Result ...
Given: Query pointq Maximum distanceε Formal definition:
Cardinality of the result set is difficult to control: εtoo smallÎno results εtoo largeÎcomplete DB
9 120
10 120
Index Based Processing of Range Queries
Similarity Nearest Neighbor Queries
Given: Query pointq
Formal definition:
Ties must be handled: -Result set enlargement -Non-determinism (dont care)
11 120
12 120
Index Based Processing of NN Queries
k-Nearest Neighbor Search and Ranking
k-nearest neighbor query: -Do not only search only for one nearest neighbor butk -Stop distance is the distance of thekth(last) candidate point -
Ranking-query: -Incremental version ofk-nearest neighbor search -First call ofeFNhct(txe)returns first neighbor -Second call oftcFe)t(exhNreturns second neighbor... -Typically only few results are fetchedÆDont generate all!
Given two setsR, Sof points Find all pairs of points according to similarity
R
S
Various exact definitions for the similarity join
Organization of the Tutorial
Motivation Defining the Similarity Join Applications of the Similarity Join Similarity Join Algorithms Conclusion & Future Potential
18 120
Defining the Similarity Join
What Is a Similarity Join?
Intuitive notion: 3 properties of the similarity join The similarity join is ajoinin the relational sense Two setsRandSare combined into one such that the new set contains pairs of points that fulfill a join condition
metricobjects
Vectoror rather than ordinary tuples of any type The join condition involvessimilarity
19 120
20 120
What Is a Similarity Join?
Similarity Join
Distance Range Join NN-based Approaches
Closest Pair Query
Distance Range Join (ε-Join)
Intuitition:Given parameterε All pairs of points where distance≤ ε
Formal Definition:
k-NN Join
In SQL-like notation: SELECT*FROMR, SWHERE||R.obj−S.obj||≤ ε
21 120
22 120
Distance Range Join (ε-Join)
Most widespread and best evaluated join Often also calledthesimilarity join
Distance Range Join (ε-Join)
The distance rangeselfjoin
is of particular importance for data mining (clustering) and robust similarity search Change definition to exclude trivial results