RESEARCH IN OFFICIAL STATISTICS. 2 1998
140 pages
English

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

RESEARCH IN OFFICIAL STATISTICS. 2 1998

-

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
140 pages
English
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Description

Science and technology

Informations

Publié par
Nombre de lectures 13
Langue English
Poids de l'ouvrage 3 Mo

Extrait

ISSN 1023-098X
OFFICIAL
STATISTICS
2 ■ 1998
An international journal for research in official statistics Research in
Official Statistics
ROS — An international journal for research in official statistics
Publisher: Eurostat • enhance the scientific image of official statis­
(Statistical Office of the European Communities) tics;
• thereby help Eurostat to fulfil its mission. Editor in Chief: Photis Nanopoulos
Executive Editor: Daniel Defays
ROS will promote cooperation among
researchers by providing an up-to-date and Editorial Advisory Committee
accelerated reporting forum through which such Deville, Jean-Claude (INSEE, France)
activities could be brought to the general notice. Droesbeke, Jean-Jacques (ULB, Belgium)
It will also create an interactive forum for the Fienberg, Stephen (Carnegie Mellon University,
exchange of ideas and information useful to United States)
statisticians in general and, in particular, Hand, David (Open University, United Kingdom)
researchers in methodological, analytical, con­Keller, Wouter (Statistics Netherlands)
ceptual or organisational issues. Klösgen, Willi (GMD, Germany)
Lauro, Carlo (University of Napoli, Italy)
Lenz, Hans-J. (Freie Universität Berlin, Germany) Subscription
Martin-Guzman, Pilar (INE, Spain)
Research in Official Statistics is published each
Papageorgiou, H. (University of Athens)
year in one volume of two issues. Requests for
Prat, Albert (Universität Politecnica de Catalunya, Spain)
information on placing orders for the journal and
Unwin, Anthony of Augsburg, Germany)
for sample copies should be addressed to the
(This list is currently being expanded.)
ROS secretariat in Luxembourg. Subscription
could start from any issue in any volume (includ­
Mailing (secretariat) address ing back issues subject to availability).
ROS — Research in Official Statistics
Eurostat BECH Building, Room A2-162 The basic subscription price is ECU 25 for single
L-2920 Luxembourg issues or ECU 45 for each volume made up of
two issues. These prices include postage (sur­Tel.(352)43 01-34190
face delivery), packing and handling charges in Fax (352) 43 01-34149
most countries. Various levels of discounts are E-mail: journal.ROS@eurostat.cec.be
available to different categories of subscribers,
notably students, members of accredited statisti­
Aim
cal associations, staff of national statistical insti­
Research in Official Statistics (ROS) is pub­ tutes and individuals from some developing and
lished by Eurostat which has the mission of pro­ central and east European countries.
viding the European Union with a high-quality
statistical information service. Subscription address
Office for Official Publications of the European
ROS publishes papers of high scientific content Communities (OPOCE)
resulting from research carried out in the field of 2, rue Mercier
official statistics. By doing this, it aims to: L-2985 Luxembourg
• promote statistical research activities; Advertisements and book reviews
• create a forum for scientific exchanges Information on advertising in the journal and on
between researchers; book reviews should be directed to the executive
editor. • give visibility to the results of such activities;
Individuals who so wish may make single copies of any article which appeared in this journal for their own per­
sonal scientific (strictly non-commercial) use. Nevertheless, no part of the publication may be reproduced, stored
in a retrieval system, or transmitted in any way or form, or by any means, electronic, mechanical, photocopying,
recording, or otherwise, without prior written permission of Eurostat.
Request for such permission should be made by completing a standard form furnished by the publisher for this
purpose. This form may be obtained from the ROS mailing address shown above. RESEARCH
IN OFFICIAL
STAT I ST I C S
2 ■ 1998
• • * •
* * Wñ
•*
• i *
eurostat
An international journal for research in official statistics A great deal of additional information on the European Union is available on the Internet.
It can be accessed through the Europa server (http://europa.eu.int).
Luxembourg: Office for Official Publications of the European Communities, 1999
© European Communities, 1999
Printed in Italy
PRINTED ON WHITE CHLORINE-FREE PAPER Research in
Official Statistics
ROS — An international journal for research in official statistics
ROS — Volume 1 — Number 2, 1998
Contents
Articles
Data mining - reaching beyond statistics
David J. Hand 5
A new feature selection method based on geometrical thickness
Yujiro Ono and Manabu Ichino 19
Sampling designs in compiling consumer price indices: current practices
at EU statistical institutes
Martin Boon 3
Special uniques, random uniques and sticky populations: some
counterintuitive effects of geographical detail on disclosure risk
M. J. Elliot, C. J. Skinner and A. Dale 53
Looking for efficient automated secondary cell suppression systems:
a software comparison
Sarah Giessing 69
Re-identification methods for evaluating the confidentiality of analytically
valid microdata
William E. Winkler 87
Forum
Preface to the papers from the statistical data protection (SDP-98) seminar
Josep Domingo-Ferrer and Josep Maria Mateo-Sanz 105
Preface to the papers from the KESDA-98 seminar
Monique Nolrhomme-Fraiture 113
Statistical research activities at Statistics Finland
Risto Lehtonen 127
In the next issue of ROS 135 Data mining — reaching beyond statistics
David J. Hand
Department of Statistics, The Open University
Milton Keynes, MK7 6AA, United Kingdom
d.j. hand® open. ac. uk
Keywords: data mining; data analysis
Abstract
Data mining is a new discipline, with origins in statistics, machine learning, database management,
and associated areas. Its objective is to identify structure in large data sets. The size of the data sets
means that classical exploratory data analytic methods are often inadequate. We examine the prob­
lems that make the development of new methods necessary and the sorts of tools which are being
used in data mining applications.
1. Introduction
The aim of this paper is to examine the new science of data mining and to show, in particu­
lar, how it differs from the more traditional areas of statistics and exploratory data analysis.
I define 'data mining' as the secondary analysis of large databases aimed at finding unsus­
pected relationships which are of interest or value to the database owners (Hand (1998)).
Each word in this definition is important in serving to characterise data mining as a disci­
pline in its own right, distinct from, for example, exploratory data analysis or database tech­
nology. Of course, no discipline is entirely new and data mining is no exception: as we shall
see below, it overlaps with statistics, machine learning, database technology, and other
related areas. Data mining is fundamentally interdisciplinary.
The analysis is secondary because the data has already been collected and is stored on a
computer. It may have been collected with some particular question in mind — if so it will
presumably have been analysed in order to answer that question. Occasionally, the data
will have been collected without any particular question in mind. A good example is offi­
cial statistics. This data is collected in order to answer particular questions regarding the
economy and society, but is then also available to explore other enquiries. Another exam­
ple is the automatic recording of supermarket transactions. The secondary nature of data
mining makes it different from much of statistics, which often follows a carefully designed
data collection protocol with the objective of being able to answer specific questions —
that is, with the intention of primary analysis. The classical examples of such design tech­
nologies are experimental design and survey design. Of course, the mere fact that the
analysis is secondary does not distinguish data mining from traditional exploratory data
analysis, which one might also regard as secondary by definition. David J. Hand: Data mining — reaching beyond statistics
What does distinguish the two areas is the fact that data mining deals with large data sets.
To a classical statistician, a large data set might contain just a thousand points. While this
may be large compared with the data sets of a few tens or hundreds from which the theo­
retical base of classical statistics has developed, it pales into insignificance compared with
modern 'large' data sets. For example, one of my graduate students is studying the records
of 250 000 applicants for bank loans, while another is studying the 350 million transactions
generated in a single year by a credit card company — a data set of 150 gigabytes. Even
this is small compared with the 200 million long distance phone calls carried per day by
AT & Τ Clearly the word 'large' is being expected to work overtime and some extended
terminology is really necessary. This is important, as we shall see below. As one moves up
the ladder from small, v

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents