Data Mining:
Concepts and Techniques
— Chapter 2 —
Jiawei Han
Department puter Science
University of Illinois at Urbana-Champaign
/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
January 19, 2011 Data Mining: Concepts and Techniques 1
January 19, 2011 Data Mining: Concepts and Techniques 2
Chapter 2: Data Preprocessing
Why preprocess the data?
Descriptive data summarization
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
January 19, 2011 Data Mining: Concepts and Techniques 3
Why Data Preprocessing?
Data in the real world is dirty
plete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
., occupation=“”
noisy: containing errors or outliers
., Salary=“-10”
inconsistent: containing discrepancies in codes
or names
., Age=“42” Birthday=“03/07/1997”
., Was rating “1,2,3”, now rating “A, B, C”
., discrepancy between duplicate records
January 19, 2011 Data Mining: Concepts and Techniques 4
Why Is Data Dirty?
plete data e from
“Not applicable” data value when collected
Different considerations between the time when the data was
collected and when it is analyzed.
Human/hardware/software problems
Noisy data (incorrect values) e from
Faulty data collection instruments
Human puter error at data entry
Errors in data transmission
Inconsistent data e from
Different data sources
Functional dependency violation (., modify some linked data)
Duplicate records also need data cleaning
January 19, 2011 Data Mining: Concepts and Techniques 5
Why Is Data Preprocessing Important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
., duplicate or missing data may cause incorrect or even
misleading statistics.
Data warehouse
数据挖掘课件数据挖掘02 来自淘豆网www.taodocs.com转载请标明出处.