Data Cleaning , livre ebook

Association for Computing Machinery and Morgan & Claypool Publishers - Xu Chu , Ihab F. Ilyas

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

148 pages

English

Vous pourrez modifier la taille du texte de cet ouvrage

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

This is an overview of the end-to-end data cleaning process. Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and incorrect business decisions.

Poor data across businesses and the U.S. government are reported to cost trillions of dollars a year. Multiple surveys show that dirty data is the most common barrier faced by data scientists. Not surprisingly, developing effective and efficient data cleaning solutions is challenging and is rife with deep theoretical and engineering problems. This book is about data cleaning, which is used to refer to all kinds of tasks and activities to detect and repair errors in the data. Rather than focus on a particular data cleaning task, this book describes various error detection and repair methods, and attempts to anchor these proposals with multiple taxonomies and views. Specifically, it covers four of the most common and important data cleaning tasks, namely, outlier detection, data transformation, error repair (including imputing missing values), and data deduplication. Furthermore, due to the increasing popularity and applicability of machine learning techniques, it includes a chapter that specifically explores how machine learning techniques are used for data cleaning, and how data cleaning is used to improve machine learning models.

This book is intended to serve as a useful reference for researchers and practitioners who are interested in the area of data quality and data cleaning. It can also be used as a textbook for a graduate course. Although we aim at covering state-of-the-art algorithms and techniques, we recognize that data cleaning is still an active field of research and therefore provide future directions of research whenever appropriate.

Preface

Figure and Table Credits

Introduction

Outlier Detection

Data Deduplication

Data Transformation

Data Quality Rule Definition and Discovery

Rule-Based Data Cleaning

Machine Learning and Probabilistic Data Cleaning

Conclusion and Future Thoughts

References

Index

Author Biographies

Sujets

Database Administration

Computers

Informatique

Database

Computer science

Informations

Publié par	Association for Computing Machinery and Morgan & Claypool Publishers
Date de parution	18 juin 2019
Nombre de lectures	1
EAN13	9781450371544
Langue	English
Poids de l'ouvrage	5 Mo

Informations légales : prix de location à la page 0,2798€. Cette information est donnée uniquement à titre indicatif conformément à la législation en vigueur.

Extrait

Data Cleaning
ACM Books
Editor in Chief
M. Tamer zsu, University of Waterloo
ACM Books is a series of high-quality books for the computer science community, published by ACM and many in collaboration with Morgan Claypool Publishers. ACM Books publications are widely distributed in both print and digital formats through booksellers and to libraries (and library consortia) and individual ACM members via the ACM Digital Library platform.
Data Cleaning
Ihab F. Ilyas, University of Waterloo
Xu Chu, Georgia Institute of Technology
2019
Conversational UX Design: A Practitioner s Guide to the Natural Conversation Framework
Robert J. Moore, IBM Research-Almaden
Raphael Arar, IBM Research-Almaden
2019
Heterogeneous Computing: Hardware and Software Perspectives
Mohamed Zahran, New York University
2019
Hardness of Approximation Between P and NP
Aviad Rubinstein, Stanford University
2019
Making Databases Work: The Pragmatic Wisdom of Michael Stonebraker
Editor: Michael L. Brodie, Massachusetts Institute of Technology
2018
The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition
Editors: Sharon Oviatt, Monash University
Bj rn Schuller, University of Augsburg and Imperial College London
Philip R. Cohen, Monash University
Daniel Sonntag, German Research Center for Artificial Intelligence (DFKI)
Gerasimos Potamianos, University of Thessaly
Antonio Kr ger, Saarland University and German Research Center for Artificial Intelligence (DFKI)
2018
Declarative Logic Programming: Theory, Systems, and Applications
Editors: Michael Kifer, Stony Brook University
Yanhong Annie Liu, Stony Brook University
2018
The Sparse Fourier Transform: Theory and Practice
Haitham Hassanieh, University of Illinois at Urbana-Champaign
2018
The Continuing Arms Race: Code-Reuse Attacks and Defenses
Editors: Per Larsen, Immunant, Inc .
Ahmad-Reza Sadeghi, Technische Universit t Darmstadt
2018
Frontiers of Multimedia Research
Editor: Shih-Fu Chang, Columbia University
2018
Shared-Memory Parallelism Can Be Simple, Fast, and Scalable
Julian Shun, University of California, Berkeley
2017
Computational Prediction of Protein Complexes from Protein Interaction Networks
Sriganesh Srihari, The University of Queensland Institute for Molecular Bioscience
Chern Han Yong, Duke-National University of Singapore Medical School
Limsoon Wong, National University of Singapore
2017
The Handbook of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling, and Common Modality Combinations
Editors: Sharon Oviatt, Incaa Designs
Bj rn Schuller, University of Passau and Imperial College London
Philip R. Cohen, Voicebox Technologies
Daniel Sonntag, German Research Center for Artificial Intelligence (DFKI)
Gerasimos Potamianos, University of Thessaly
Antonio Kr ger, Saarland University and German Research Center for Artificial Intelligence (DFKI)
2017
Communities of Computing: Computer Science and Society in the ACM
Thomas J. Misa, Editor, University of Minnesota
2017
Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining
ChengXiang Zhai, University of Illinois at Urbana-Champaign
Sean Massung, University of Illinois at Urbana-Champaign
2016
An Architecture for Fast and General Data Processing on Large Clusters
Matei Zaharia, Stanford University
2016
Reactive Internet Programming: State Chart XML in Action
Franck Barbier, University of Pau, France
2016
Verified Functional Programming in Agda
Aaron Stump, The University of Iowa
2016
The VR Book: Human-Centered Design for Virtual Reality
Jason Jerald, NextGen Interactions
2016
Ada s Legacy: Cultures of Computing from the Victorian to the Digital Age
Robin Hammerman, Stevens Institute of Technology
Andrew L. Russell, Stevens Institute of Technology
2016
Edmund Berkeley and the Social Responsibility of Computer Professionals
Bernadette Longo, New Jersey Institute of Technology
2015
Candidate Multilinear Maps
Sanjam Garg, University of California, Berkeley
2015
Smarter Than Their Machines: Oral Histories of Pioneers in Interactive Computing
John Cullinane, Northeastern University; Mossavar-Rahmani Center for Business and Government, John F. Kennedy School of Government, Harvard University
2015
A Framework for Scientific Discovery through Video Games
Seth Cooper, University of Washington
2014
Trust Extension as a Mechanism for Secure Code Execution on Commodity Computers
Bryan Jeffrey Parno, Microsoft Research
2014
Embracing Interference in Wireless Systems
Shyamnath Gollakota, University of Washington
2014
Data Cleaning
Ihab F. Ilyas
University of Waterloo
Xu Chu
Georgia Institute of Technology
ACM Books #28
Copyright 2019 by Association for Computing Machinery
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means-electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews-without the prior permission of the publisher.
Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which the Association for Computing Machinery is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration.
Data Cleaning
Ihab F. Ilyas
Xu Chu
books.acm.org
http://books.acm.org
ISBN: 978-1-4503-7152-0 hardcover
ISBN: 978-1-4503-7153-7 paperback
ISBN: 978-1-4503-7154-4 ePub
ISBN: 978-1-4503-7155-1 eBook
Series ISSN: 2374-6769 print 2374-6777 electronic
DOIs:
10.1145/3310205 Book
10.1145/3310205.3310206 Preface
10.1145/3310205.3310207 Chapter 1
10.1145/3310205.3310208 Chapter 2
10.1145/3310205.3310209 Chapter 3
10.1145/3310205.3310210 Chapter 4
10.1145/3310205.3310211 Chapter 5
10.1145/3310205.3310212 Chapter 6
10.1145/3310205.3310213 Chapter 7
10.1145/3310205.3310214 Chapter 8
10.1145/3310205.3310215 References / Index / Bios
A publication in the ACM Books series, #28
Editor in Chief: M. Tamer zsu, University of Waterloo
This book was typeset in Arnhem Pro 10/14 and Flama using ZzTEX.
Cover photo: Jason Dorfman MIT / CSAIL
First Edition
10 9 8 7 6 5 4 3 2 1
To my family: Francis, Aida, Mirette, Andrew and Marina
To my wife Jianmei and my daughter Hannah
Contents
Preface
Figure and Table Credits
Chapter 1 Introduction
1.1 Data Cleaning Workflow
1.2 Book Scope
Chapter 2 Outlier Detection
2.1 A Taxonomy of Outlier Detection Methods
2.2 Statistics-Based Outlier Detection
2.3 Distance-Based Outlier Detection
2.4 Model-Based Outlier Detection
2.5 Outlier Detection in High-Dimensional Data
2.6 Conclusion
Chapter 3 Data Deduplication
3.1 Similarity Metrics
3.2 Predicting Duplicate Pairs
3.3 Clustering
3.4 Blocking for Deduplication
3.5 Distributed Data Deduplication
3.6 Record Fusion and Entity Consolidation
3.7 Human-Involved Data Deduplication
3.8 Data Deduplication Tools
3.9 Conclusion
Chapter 4 Data Transformation
4.1 Syntactic Data Transformations
4.2 Semantic Data Transformations
4.3 ETL Tools
4.4 Conclusion
Chapter 5 Data Quality Rule Definition and Discovery
5.1 Functional Dependencies
5.2 Conditional Functional Dependencies
5.3 Denial Constraints
5.4 Other Types of Constraints
5.5 Conclusion
Chapter 6 Rule-Based Data Cleaning
6.1 Violation Detection
6.2 Error Repair
6.3 Conclusion
Chapter 7 Machine Learning and Probabilistic Data Cleaning
7.1 Machine Learning for Data Deduplication
7.2 Machine Learning for Data Repair
7.3 Data Cleaning for Analytics and Machine Learning
Chapter 8 Conclusion and Future Thoughts
References
Index
Author Biographies
Preface
Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and incorrect business decisions. Poor data across businesses and the U.S. government are reported to cost trillions of dollars a year. Multiple surveys show that dirty data is the most common barrier faced by data scientists. Not surprisingly, developing effective and efficient data cleaning solutions is challenging and is rife with deep theoretical and engineering problems.
Data cleaning is used to refer to all kinds of tasks and activities to detect and repair errors in the data. Rather than focus on a particular data cleaning task, in this book, we give an overview of the end-to-end data cleaning process, describing various error detection and repair methods, and attempt to anchor these proposals with multiple taxonomies and views. Specifically, we cover four of the most common and important data cleaning tasks, namely, outlier detection, data transformation, error repair (including imputing missing values), and data deduplication. Furthermore, due to the increasing popularity and applicability of machine learning techniques, we include a chapter that specifically explores how machine learning techniques are used for data cleaning, and how data cleaning is used to improve machine learning models.
This book is intended to serve as a useful reference for researchers and practitioners who are interested in the area of data quality and data cleaning. It can also be used as a textbook for a graduate course. Although we aim at covering state-of-the-art algorithms and techniques, we recognize that data cleaning is still an active field of research and therefore provide future directions of research whenever appropriate.
Ihab Ilyas
Xu Chu
March 2019
Figure and Table Credits
Figures
Figure 2.3 Based On: Patrick Wessa. Free statistics software, office for research development and education, version 1.1. 23-r7. http://www.wessa.net, 2012
Figure 2.4 Mar