Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques, 3rd Edition

Data Mining: Concepts and Techniques, 3rd Edition,Jiawei Han,Micheline Kamber,Jian Pei,ISBN9780123814791

  &      &      

Morgan Kaufmann




240 X 197

A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery

Print Book + eBook

USD 89.94
USD 149.90

Buy both together and save 40%

Print Book


In Stock

Estimated Delivery Time
USD 74.95

eBook Overview

EPUB format

PDF format

VST (VitalSource Bookshelf) format

USD 74.95
Add to Cart

Key Features

    * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects.
    * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields.
    *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of your data


    Data Mining: Concepts and Techniques provides the concepts and techniques in processing gathered data or information, which will be used in various applications. Specifically, it explains data mining and the tools used in discovering knowledge from the collected data. This book is referred as the knowledge discovery from data (KDD). It focuses on the feasibility, usefulness, effectiveness, and scalability of techniques of large data sets. After describing data mining, this edition explains the methods of knowing, preprocessing, processing, and warehousing data. It then presents information about data warehouses, online analytical processing (OLAP), and data cube technology. Then, the methods involved in mining frequent patterns, associations, and correlations for large data sets are described. The book details the methods for data classification and introduces the concepts and methods for data clustering. The remaining chapters discuss the outlier detection and the trends, applications, and research frontiers in data mining. This book is intended for Computer Science students, application developers, business professionals, and researchers who seek information on data mining.


    Data warehouse engineers, data mining professionals, database researchers, statisticians, data analysts, data modelers, and other data professionals working on data mining at the R&D and implementation levels. And upper-level undergrads and graduate students in data mining at computer science programs.

    Jiawei Han

    Jiawei Han is Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign. Well known for his research in the areas of data mining and database systems, he has received many awards for his contributions in the field, including the 2004 ACM SIGKDD Innovations Award. He has served as Editor-in-Chief of ACM Transactions on Knowledge Discovery from Data, and on editorial boards of several journals, including IEEE Transactions on Knowledge and Data Engineering and Data Mining and Knowledge Discovery.

    Affiliations and Expertise

    University of Illinois, Urbana Champaign

    View additional works by Jiawei Han

    Micheline Kamber

    Micheline Kamber is a researcher with a passion for writing in easy-to-understand terms. She has a master's degree in computer science (specializing in artificial intelligence) from Concordia University, Canada.

    Affiliations and Expertise

    Simon Fraser University, Burnaby, Canada

    View additional works by Micheline Kamber

    Jian Pei

    Jian Pei is Associate Professor of Computing Science and the director of Collaborative Research and Industry Relations at the School of Computing Science at Simon Fraser University, Canada. In 2002-2004, he was an Assistant Professor of Computer Science and Engineering at the State University of New York (SUNY) at Buffalo. He received a Ph.D. degree in Computing Science from Simon Fraser University in 2002, under Dr. Jiawei Han's supervision.

    Affiliations and Expertise

    Simon Fraser University, Burnaby, Canada

    Data Mining: Concepts and Techniques, 3rd Edition


    Foreword to Second Edition



    About the Authors

    Chapter 1 Introduction

        1.1 Why Data Mining?

             1.1.1 Moving toward the Information Age

             1.1.2 Data Mining as the Evolution of Information Technology

        1.2 What Is Data Mining?

        1.3 What Kinds of Data Can Be Mined?

             1.3.1 Database Data

             1.3.2 Data Warehouses

             1.3.3 Transactional Data

             1.3.4 Other Kinds of Data

        1.4 What Kinds of Patterns Can Be Mined?

             1.4.1 Class/Concept Description: Characterization and Discrimination

             1.4.2 Mining Frequent Patterns, Associations, and Correlations

             1.4.3 Classification and Regression for Predictive Analysis

             1.4.4 Cluster Analysis

             1.4.5 Outlier Analysis

             1.4.6 Are All Patterns Interesting?

        1.5 Which Technologies Are Used?

             1.5.1 Statistics

             1.5.2 Machine Learning

             1.5.3 Database Systems and Data Warehouses

             1.5.4 Information Retrieval

        1.6 Which Kinds of Applications Are Targeted?

             1.6.1 Business Intelligence

             1.6.2 Web Search Engines

        1.7 Major Issues in Data Mining

             1.7.1 Mining Methodology

             1.7.2 User Interaction

             1.7.3 Efficiency and Scalability

             1.7.4 Diversity of Database Types

             1.7.5 Data Mining and Society

        1.8 Summary

        1.9 Exercises

        1.10 Bibliographic Notes

    Chapter 2 Getting to Know Your Data

        2.1 Data Objects and Attribute Types

             2.1.1 What Is an Attribute?

             2.1.2 Nominal Attributes

             2.1.3 Binary Attributes

             2.1.4 Ordinal Attributes

             2.1.5 Numeric Attributes

             2.1.6 Discrete versus Continuous Attributes

        2.2 Basic Statistical Descriptions of Data

             2.2.1 Measuring the Central Tendency: Mean, Median, and Mode

             2.2.2 Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and Interquartile Range

             2.2.3 Graphic Displays of Basic Statistical Descriptions of Data

        2.3 Data Visualization

             2.3.1 Pixel-Oriented Visualization Techniques

             2.3.2 Geometric Projection Visualization Techniques

             2.3.3 Icon-Based Visualization Techniques

             2.3.4 Hierarchical Visualization Techniques

             2.3.5 Visualizing Complex Data and Relations

        2.4 Measuring Data Similarity and Dissimilarity

             2.4.1 Data Matrix versus Dissimilarity Matrix

             2.4.2 Proximity Measures for Nominal Attributes

             2.4.3 Proximity Measures for Binary Attributes

             2.4.4 Dissimilarity of Numeric Data: Minkowski Distance

             2.4.5 Proximity Measures for Ordinal Attributes

             2.4.6 Dissimilarity for Attributes of Mixed Types

             2.4.7 Cosine Similarity

        2.5 Summary

        2.6 Exercises

        2.7 Bibliographic Notes

    Chapter 3 Data Preprocessing

        3.1 Data Preprocessing: An Overview

             3.1.1 Data Quality: Why Preprocess the Data?

             3.1.2 Major Tasks in Data Preprocessing

        3.2 Data Cleaning

             3.2.1 Missing Values

             3.2.2 Noisy Data

             3.2.3 Data Cleaning as a Process

        3.3 Data Integration

             3.3.1 Entity Identification Problem

             3.3.2 Redundancy and Correlation Analysis

             3.3.3 Tuple Duplication

             3.3.4 Data Value Conflict Detection and Resolution

        3.4 Data Reduction

             3.4.1 Overview of Data Reduction Strategies

             3.4.2 Wavelet Transforms

             3.4.3 Principal Components Analysis

             3.4.4 Attribute Subset Selection

             3.4.5 Regression and Log-Linear Models: Parametric Data Reduction

             3.4.6 Histograms

             3.4.7 Clustering

             3.4.8 Sampling

             3.4.9 Data Cube Aggregation

        3.5 Data Transformation and Data Discretization

             3.5.1 Data Transformation Strategies Overview

             3.5.2 Data Transformation by Normalization

             3.5.3 Discretization by Binning

             3.5.4 Discretization by Histogram Analysis

             3.5.5 Discretization by Cluster, Decision Tree, and Correlation Analyses

             3.5.6 Concept Hierarchy Generation for Nominal Data

        3.6 Summary

        3.7 Exercises

        3.8 Bibliographic Notes

    Chapter 4 Data Warehousing and Online Analytical Processing

        4.1 Data Warehouse: Basic Concepts

             4.1.1 What Is a Data Warehouse?

             4.1.2 Differences between Operational Database Systems and Data Warehouses

             4.1.3 But, Why Have a Separate Data Warehouse?

             4.1.4 Data Warehousing: A Multitiered Architecture

             4.1.5 Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse

             4.1.6 Extraction, Transformation, and Loading

             4.1.7 Metadata Repository

        4.2 Data Warehouse Modeling: Data Cube and OLAP

             4.2.1 Data Cube: A Multidimensional Data Model

             4.2.2 Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Data Models

             4.2.3 Dimensions: The Role of Concept Hierarchies

             4.2.4 Measures: Their Categorization and Computation

             4.2.5 Typical OLAP Operations

             4.2.6 A Starnet Query Model for Querying Multidimensional Databases

        4.3 Data Warehouse Design and Usage

             4.3.1 A Business Analysis Framework for Data Warehouse Design

             4.3.2 Data Warehouse Design Process

             4.3.3 Data Warehouse Usage for Information Processing

             4.3.4 From Online Analytical Processing to Multidimensional Data Mining

        4.4 Data Warehouse Implementation

             4.4.1 Efficient Data Cube Computation: An Overview

             4.4.2 Indexing OLAP Data: Bitmap Index and Join Index

             4.4.3 Efficient Processing of OLAP Queries

             4.4.4 OLAP Server Architectures: ROLAP versus MOLAP versus HOLAP

        4.5 Data Generalization by Attribute-Oriented Induction

             4.5.1 Attribute-Oriented Induction for Data Characterization

             4.5.2 Efficient Implementation of Attribute-Oriented Induction

             4.5.3 Attribute-Oriented Induction for Class Comparisons

        4.6 Summary

        4.7 Exercises

        4.8 Bibliographic Notes

    Chapter 5 Data Cube Technology

        5.1 Data Cube Computation: Preliminary Concepts

             5.1.1 Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Cube Shell

             5.1.2 General Strategies for Data Cube Computation

        5.2 Data Cube Computation Methods

             5.2.1 Multiway Array Aggregation for Full Cube Computation

             5.2.2 BUC: Computing Iceberg Cubes from the Apex Cuboid Downward

             5.2.3 Star-Cubing: Computing Iceberg Cubes Using a Dynamic Star-Tree Structure

             5.2.4 Precomputing Shell Fragments for Fast High-Dimensional OLAP

        5.3 Processing Advanced Kinds of Queries by Exploring Cube Technology

             5.3.1 Sampling Cubes: OLAP-Based Mining on Sampling Data

             5.3.2 Ranking Cubes: Efficient Computation of Top-k Queries

        5.4 Multidimensional Data Analysis in Cube Space

             5.4.1 Prediction Cubes: Prediction Mining in Cube Space

             5.4.2 Multifeature Cubes: Complex Aggregation at Multiple Granularities

             5.4.3 Exception-Based, Discovery-Driven Cube Space Exploration

        5.5 Summary

        5.6 Exercises

        5.7 Bibliographic Notes

    Chapter 6 Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods

        6.1 Basic Concepts

             6.1.1 Market Basket Analysis: A Motivating Example

             6.1.2 Frequent Itemsets, Closed Itemsets, and Association Rules

        6.2 Frequent Itemset Mining Methods

             6.2.1 Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation

             6.2.2 Generating Association Rules from Frequent Itemsets

             6.2.3 Improving the Efficiency of Apriori

             6.2.4 A Pattern-Growth Approach for Mining Frequent Itemsets

             6.2.5 Mining Frequent Itemsets Using Vertical Data Format

             6.2.6 Mining Closed and Max Patterns

        6.3 Which Patterns Are Interesting?—Pattern Evaluation Methods

             6.3.1 Strong Rules Are Not Necessarily Interesting

             6.3.2 From Association Analysis to Correlation Analysis

             6.3.3 A Comparison of Pattern Evaluation Measures

        6.4 Summary

        6.5 Exercises

        6.6 Bibliographic Notes

    Chapter 7 Advanced Pattern Mining

        7.1 Pattern Mining: A Road Map

        7.2 Pattern Mining in Multilevel, Multidimensional Space

             7.2.1 Mining Multilevel Associations

             7.2.2 Mining Multidimensional Associations

             7.2.3 Mining Quantitative Association Rules

             7.2.4 Mining Rare Patterns and Negative Patterns

        7.3 Constraint-Based Frequent Pattern Mining

             7.3.1 Metarule-Guided Mining of Association Rules

             7.3.2 Constraint-Based Pattern Generation: Pruning Pattern Space and Pruning Data Space

        7.4 Mining High-Dimensional Data and Colossal Patterns

             7.4.1 Mining Colossal Patterns by Pattern-Fusion

        7.5 Mining Compressed or Approximate Patterns

             7.5.1 Mining Compressed Patterns by Pattern Clustering

             7.5.2 Extracting Redundancy-Aware Top-k Patterns

        7.6 Pattern Exploration and Application

             7.6.1 Semantic Annotation of Frequent Patterns

             7.6.2 Applications of Pattern Mining

        7.7 Summary

        7.8 Exercises

        7.9 Bibliographic Notes

    Chapter 8 Classification: Basic Concepts

        8.1 Basic Concepts

             8.1.1 What Is Classification?

             8.1.2 General Approach to Classification

        8.2 Decision Tree Induction

             8.2.1 Decision Tree Induction

             8.2.2 Attribute Selection Measures

             8.2.3 Tree Pruning

             8.2.4 Scalability and Decision Tree Induction

             8.2.5 Visual Mining for Decision Tree Induction

        8.3 Bayes Classification Methods

             8.3.1 Bayes’ Theorem

             8.3.2 Na¨ive Bayesian Classification

        8.4 Rule-Based Classification

             8.4.1 Using IF-THEN Rules for Classification

             8.4.2 Rule Extraction from a Decision Tree

             8.4.3 Rule Induction Using a Sequential Covering Algorithm

        8.5 Model Evaluation and Selection

             8.5.1 Metrics for Evaluating Classifier Performance

             8.5.2 Holdout Method and Random Subsampling

             8.5.3 Cross-Validation

             8.5.4 Bootstrap

             8.5.5 Model Selection Using Statistical Tests of Significance

             8.5.6 Comparing Classifiers Based on Cost–Benefit and ROC Curves

        8.6 Techniques to Improve Classification Accuracy

             8.6.1 Introducing Ensemble Methods

             8.6.2 Bagging

             8.6.3 Boosting and AdaBoost

             8.6.4 Random Forests

             8.6.5 Improving Classification Accuracy of Class-Imbalanced Data

        8.7 Summary

        8.8 Exercises

        8.9 Bibliographic Notes

    Chapter 9 Classification: Advanced Methods

        9.1 Bayesian Belief Networks

             9.1.1 Concepts and Mechanisms

             9.1.2 Training Bayesian Belief Networks

        9.2 Classification by Backpropagation

             9.2.1 A Multilayer Feed-Forward Neural Network

             9.2.2 Defining a Network Topology

             9.2.3 Backpropagation

             9.2.4 Inside the Black Box: Backpropagation and Interpretability

        9.3 Support Vector Machines

             9.3.1 The Case When the Data Are Linearly Separable

             9.3.2 The Case When the Data Are Linearly Inseparable

        9.4 Classification Using Frequent Patterns

             9.4.1 Associative Classification

             9.4.2 Discriminative Frequent Pattern–Based Classification

        9.5 Lazy Learners (or Learning from Your Neighbors)

             9.5.1 ?-Nearest-Neighbor Classifiers

             9.5.2 Case-Based Reasoning

        9.6 Other Classification Methods

             9.6.1 Genetic Algorithms

             9.6.2 Rough Set Approach

             9.6.3 Fuzzy Set Approaches

        9.7 Additional Topics Regarding Classification

             9.7.1 Multiclass Classification

             9.7.2 Semi-Supervised Classification

             9.7.3 Active Learning

             9.7.4 Transfer Learning

        9.8 Summary

        9.9 Exercises

        9.10 Bibliographic Notes

    Chapter 10 Cluster Analysis: Basic Concepts and Methods

        10.1 Cluster Analysis

             10.1.1 What Is Cluster Analysis?

             10.1.2 Requirements for Cluster Analysis

             10.1.3 Overview of Basic Clustering Methods

        10.2 Partitioning Methods

             10.2.1 ?-Means: A Centroid-Based Technique

             10.2.2 ?-Medoids: A Representative Object-Based Technique

        10.3 Hierarchical Methods

             10.3.1 Agglomerative versus Divisive Hierarchical Clustering

             10.3.2 Distance Measures in Algorithmic Methods

             10.3.3 BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature Trees

             10.3.4 Chameleon: Multiphase Hierarchical Clustering Using Dynamic Modeling

             10.3.5 Probabilistic Hierarchical Clustering

        10.4 Density-Based Methods

             10.4.1 DBSCAN: Density-Based Clustering Based on Connected Regions with High Density

             10.4.2 OPTICS: Ordering Points to Identify the Clustering Structure

             10.4.3 DENCLUE: Clustering Based on Density Distribution Functions

        10.5 Grid-Based Methods

             10.5.1 STING: STatistical INformation Grid

             10.5.2 CLIQUE: An Apriori-like Subspace Clustering Method

        10.6 Evaluation of Clustering

             10.6.1 Assessing Clustering Tendency

             10.6.2 Determining the Number of Clusters

             10.6.3 Measuring Clustering Quality

        10.7 Summary

        10.8 Exercises

        10.9 Bibliographic Notes

    Chapter 11 Advanced Cluster Analysis

        11.1 Probabilistic Model-Based Clustering

             11.1.1 Fuzzy Clusters

             11.1.2 Probabilistic Model-Based Clusters

             11.1.3 Expectation-Maximization Algorithm

        11.2 Clustering High-Dimensional Data

             11.2.1 Clustering High-Dimensional Data: Problems, Challenges, and Major Methodologies

             11.2.2 Subspace Clustering Methods

             11.2.3 Biclustering

             11.2.4 Dimensionality Reduction Methods and Spectral Clustering

        11.3 Clustering Graph and Network Data

             11.3.1 Applications and Challenges

             11.3.2 Similarity Measures

             11.3.3 Graph Clustering Methods

        11.4 Clustering with Constraints

             11.4.1 Categorization of Constraints

             11.4.2 Methods for Clustering with Constraints

        11.5 Summary

        11.6 Exercises

        11.7 Bibliographic Notes

    Chapter 12 Outlier Detection

        12.1 Outliers and Outlier Analysis

             12.1.1 What Are Outliers?

             12.1.2 Types of Outliers

             12.1.3 Challenges of Outlier Detection

        12.2 Outlier Detection Methods

             12.2.1 Supervised, Semi-Supervised, and Unsupervised Methods

             12.2.2 Statistical Methods, Proximity-Based Methods, and Clustering-Based Methods

        12.3 Statistical Approaches

             12.3.1 Parametric Methods

             12.3.2 Nonparametric Methods

        12.4 Proximity-Based Approaches

             12.4.1 Distance-Based Outlier Detection and a Nested Loop Method

             12.4.2 A Grid-Based Method

             12.4.3 Density-Based Outlier Detection

        12.5 Clustering-Based Approaches

        12.6 Classification-Based Approaches

        12.7 Mining Contextual and Collective Outliers

             12.7.1 Transforming Contextual Outlier Detection to Conventional Outlier Detection

             12.7.2 Modeling Normal Behavior with Respect to Contexts

             12.7.3 Mining Collective Outliers

        12.8 Outlier Detection in High-Dimensional Data

             12.8.1 Extending Conventional Outlier Detection

             12.8.2 Finding Outliers in Subspaces

             12.8.3 Modeling High-Dimensional Outliers

        12.9 Summary

        12.10 Exercises

        12.11 Bibliographic Notes

    Chapter 13 Data Mining Trends and Research Frontiers

        13.1 Mining Complex Data Types

             13.1.1 Mining Sequence Data: Time-Series, Symbolic Sequences, and Biological Sequences

             13.1.2 Mining Graphs and Networks

             13.1.3 Mining Other Kinds of Data

        13.2 Other Methodologies of Data Mining

             13.2.1 Statistical Data Mining

             13.2.2 Views on Data Mining Foundations

             13.2.3 Visual and Audio Data Mining

        13.3 Data Mining Applications

             13.3.1 Data Mining for Financial Data Analysis

             13.3.2 Data Mining for Retail and Telecommunication Industries

             13.3.3 Data Mining in Science and Engineering

             13.3.4 Data Mining for Intrusion Detection and Prevention

             13.3.5 Data Mining and Recommender Systems

        13.4 Data Mining and Society

             13.4.1 Ubiquitous and Invisible Data Mining

             13.4.2 Privacy, Security, and Social Impacts of Data Mining

        13.5 Data Mining Trends

        13.6 Summary

        13.7 Exercises

        13.8 Bibliographic Notes



    Quotes and reviews

    ""[A] well-written textbook (2nd ed., 2006; 1st ed., 2001) on data mining or knowledge discovery. The text is supported by a strong outline. The authors preserve much of the introductory material, but add the latest techniques and developments in data mining, thus making this a comprehensive resource for both beginners and practitioners. The focus is data—all aspects. The presentation is broad, encyclopedic, and comprehensive, with ample references for interested readers to pursue in-depth research on any technique. Summing Up: Highly recommended. Upper-division undergraduates through professionals/practitioners.""--CHOICE

    ""This interesting and comprehensive introduction to data mining emphasizes the interest in multidimensional data mining--the integration of online analytical processing (OLAP) and data mining. Some chapters cover basic methods, and others focus on advanced techniques. The structure, along with the didactic presentation, makes the book suitable for both beginners and specialized readers.""--ACM’s Computing Reviews.com

    We are living in the data deluge age. The Data Mining: Concepts and Techniques shows us how to find useful knowledge in all that data. Thise 3rd editionThird Edition significantly expands the core chapters on data preprocessing, frequent pattern mining, classification, and clustering. The bookIt also comprehensively covers OLAP and outlier detection, and examines mining networks, complex data types, and important application areas. The book, with its companion website, would make a great textbook for analytics, data mining, and knowledge discovery courses.--Gregory Piatetsky, President, KDnuggets

    Jiawei, Micheline, and Jian give an encyclopaedic coverage of all the related methods, from the classic topics of clustering and classification, to database methods (association rules, data cubes) to more recent and advanced topics (SVD/PCA , wavelets, support vector machines)…. Overall, it is an excellent book on classic and modern data mining methods alike, and it is ideal not only for teaching, but as a reference book.—From the foreword by Christos Faloutsos, Carnegie Mellon University

    ""A very good textbook on data mining, this third edition reflects the changes that are occurring in the data mining field. It adds cited material from about 2006, a new section on visualization, and pattern mining with the more recent cluster methods. It’s a well-written text, with all of the supporting materials an instructor is likely to want, including Web material support, extensive problem sets, and solution manuals. Though it serves as a data mining text, readers with little experience in the area will find it readable and enlightening. That being said, readers are expected to have some coding experience, as well as database design and statistics analysis knowledge…Two additional items are worthy of note: the text’s bibliography is an excellent reference list for mining research; and the index is very complete, which makes it easy to locate information. Also, researchers and analysts from other disciplines--for example, epidemiologists, financial analysts, and psychometric researchers--may find the material very useful.""--Computing Reviews

    ""Han (engineering, U. of Illinois-Urbana-Champaign), Micheline Kamber, and Jian Pei (both computer science, Simon Fraser U., British Columbia) present a textbook for an advanced undergraduate or beginning graduate course introducing data mining. Students should have some background in statistics, database systems, and machine learning and some experience programming. Among the topics are getting to know the data, data warehousing and online analytical processing, data cube technology, cluster analysis, detecting outliers, and trends and research frontiers. Chapter-end exercises are included.""--SciTech Book News

    ""This book is an extensive and detailed guide to the principal ideas, techniques and technologies of data mining. The book is organised in 13 substantial chapters, each of which is essentially standalone, but with useful references to the book’s coverage of underlying concepts. A broad range of topics are covered, from an initial overview of the field of data mining and its fundamental concepts, to data preparation, data warehousing, OLAP, pattern discovery and data classification. The final chapter describes the current state of data mining research and active research areas.""--BCS.org


    Cyber Week Book Event | Use Code CYBOOK15

    Shop with Confidence

    Free Shipping around the world
    ▪ Broad range of products
    ▪ 30 days return policy

    Contact Us