A Theorical Introduction to Data Mining

This article introduces the aim of data mining and explains basic concepts and terms.

Data Mining (i. e. Knowledge discovery from data): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data.

Data Warehouse : A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin] Data warehouses are used for data mining.

Potential Usages : Web information mining,  spam filtering, medical data mining, weather data mining, market sale strategies etc.

Data Mining Related Operations
Handling Noisy Data : Handling missing, duplicate or errorneous data before data mining. Noisy data can be removed, or corrected by a specific approach (i.e. correlation analysis).
Integration  : Combining data from multiple sources.
Normalization : Scaling data to specified range. For example, scaling 750 in [500, 1000] to range [0,1] (the result is 0.5) 
Feature Selection : Selecting only useful features (i.e. attributes for record data) of data.

Data Mining:
Classification: Finding a model for a class attribute of data to predict the values of other attributes. (An example class attribute: CustomerBuysProduct (bool))
Different methods can be used for classification:
  • Decision Trees: Uses decision trees to make model and evaluates new data on the tree.
  • Rule-Based Classifying: Deduces rules on the data (if X = Y and if Z z T result is W etc.).
  • Bayes Classifying: Uses previous probabilities to classify.
  • K-Nearest Neighbor Classifying: Uses distances between previous data to new data, to classify.
  • ...
Clustering: Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.
 Different methods can be used for clustering:
  • K-means Clustering: Splits data according to a previously known number of clusters.
  • Hierarchical Clustering: Produces a set of nested clusters organized as a hierarchical tree.
  • ...
Association (Rule) Discovery: Producing dependency rules which will predict occurrence of a feature (i.e. attribute) of data based on occurrences of other features.
Pattern Discovery: Deducing patterns as a result of classification, clustering, Pattern discovery etc.

Postprocessing: Evaluating and selecting interesting patterns, interpreting and visualizing them as an information report.

This entry was posted in . Bookmark the permalink.

3 Responses to A Theorical Introduction to Data Mining

  1. Nice sum-up of a lot of otherwise confusing or hard-to-grasp concepts. Also I would like to add data profiling as a discipline in-between preprocessing and data mining. In general profiling is about applying standard metrics to help you discover where to look when you want to do a deeper analysis or processing of data.

  2. Data mining can be defined as the process harvesting and discovering useful and valuable information through the analysis of enormous amounts of data found in databases, websites or data warehouses through the use of a number of techniques such as artificial intelligence, statistical and machine learning. It is a relatively a new and promising technology..

    Introduction to Data Mining Processes

  3. Anonymous says:

    Interеsting blοg! Ιs your theme custom made
    οr did you downloaԁ it from sоmewhere?

    A thеme like yours with a few simple adjustements would really
    make my blog stand out. Pleasе lеt mе know whегe you got your thеmе.

    Ϻany thanks

    my web sitе - legal hallucinogens powders

Leave a Reply