Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Data Mining Techniques

Data Mining  Techniques

There are many different applications  of data mining,  although  most applications  fit into known, well-defined and scientific techniques.

Despite the large number  of specific data mining  algorithms  developed  over the years, there are only a handful of fundamentally  similar  algorithmic  tasks.  In most of these analytic  problems,  the

business need is to find correlations ” or “patterns ” between a particular  variable describing an individual and other variables pertaining  to that same individual.

For example, in historical data we may want to know which customers defected from the company after their contracts expired.  Therefore, we have …

.    The target variable in this case is “defection” or “churn”.

.    We want to find out which variable or set of variables that are correlated with this pattern of defection.  Example is it “age” , “sex” , “income bracket”, “tenure”, etc. or is it most likely  a   combination of those variables.

One of the fundamental ideas of data mining  is finding  or selecting  important,  informative  attributes or “variables ” (age, sex, incometenure, etc.) also called “independent variables” or “predictors” of  entities within the data which have an impact or correlation to a given “target attribute” (defection)  also called “dependent variable” .  Informative means containing  information.

Information is a quantity (or quality) that reduces the uncertainty about something.   The better the information,  the more uncertainty it reduces.  Finding this correlation could help the company  alter its future action such as customer selling  strategies in  order to eliminate  or reduce customer defection, and thereby improve overall company profitability.

Some of the well-known  data mining techniques are:

.   Naïve Bayes Classifier

.   Regression

.   Classification and Segmentation

.   Clustering

.   Association

.   Similarity Matching

.   etc.

Terminology

Data mining  has many terms that mean the same thing.  This has come about as a result of having  a portion of many similar  disciplines   such as statistics, operational  research, machine  learning,  artificial  intelligence,  database technology,  pattern recognition,  and other disciplines  converge into what is now called data mining.

The following  terms all mean the same:

.    A Dataset, a File, a Table in a database (or a database/DW query), a Worksheet in Excel.

This is the set of data to be examined or data mined.   It could be coming  from a flat dataset, from a table in a database, from a query of a database (perhaps from multiple   tables), or from a query   of a data warehouse or data mart dimension and fact tables, or from a worksheet, etc.

.    An Instance , a Record, a Row, a Data Point, a Feature Vector, a Tuple , a Case.

This is a single record of a dataset, or single  row from a database table or query.  It is all the data pertaining  to one transactional  event.  Be aware that a data point  is not a single  piece of data.

Rather it is all the data relevant to an individual  transaction.

.    A Variable, an Attribute , a Property, a Field, a Column in a database table.

This is a single variable or a single piece of information.   An attribute  could be a predictor  attribute, a target attribute  or it could be neither.

.    A Predictor variable or attribute, an Independent variable, an Explanatory variable. This is a variable  that can predict  (to some degree) the outcome or target variable.

.    A Target variable or attribute,  a Dependent variable , an Outcome variable.

This outcome variable is to be predicted.  For past instances, this attribute  exists. For future instances, this is the variable  to be predicted.

1. Prediction (Naïve Bayes classifier):

Naïve Bayes classifier is based on the Baysian theory where the prediction for the probability of occurrence of an event is computed based on an associated (or related) event.

Unlike  simple  probability  which simply  relies on computing  the frequency of occurrence of an event by taking the ratio of “positive”  instances to all instances, Bayesian probability   (based on Thomas Bayes

theorem) is a probability  formula  for determining  the frequency of occurrence of an eventgiven the frequency (or some knowledge) of another related event.

The Bayes theorem formula  is …

P(A|B)  =    P(A)  P(B|A)

P(B)

Example 1

where

A         – the event of interest  (the target variable)

B         – the related event  (the independent variable)

P(A|B) – probability  of A given B is true

P(B|A) – probability  of B given A is true

P(A), P(B) – probability  of A or B independently

What is the probability of fire if we see smoke?

.    Assuming that fires are rare. They only happen  1% of the cases.

.    Assuming  that smoke is more common, about 15% of the cases (cooking, construction,  etc.)

.    Assuming that the probability  of smoke given fire (i.e.fire generating smoke) is 90%

Using the formula

P(A|B)  =     P(A)  P(B|A)                       where  A         – fire                 The target variable

P(B)                                            B         – smoke            The independent variable

P(A|B) – probability  offire given smoke P(B|A) – probability  of smoke given fire

P(A), P(B) – probability  of fire or smoke independently

P(fire|smoke) = P(fire) x P(smoke|fire) =  .01 x .90 = 6%

P(smoke)            .15

Example 2

75% of the children in schools have a dog, and 30% have a cat. Of people that have cats, 60%  of them also have a dog,

P(dog) = 75%,  P(cat) = 30% P(dog|cat) = 60%

What is the probability that if I have a dog, I also have a cat?

A – having B – having

a dog

a cat

Target

Ind. Var.

P(cat|dog) =     P(cat) x P(dog|cat)  =   .30 x .60   =  24%

P(dog)                   .75

2. Regression:

Regression is a statistical process for estimating a “target variable ” given one or more independent  variable(s).  Regression analysis  can only be performed on numerical data.  Regression is widely  used for prediction and forecasting.  It is the act of prediction of continues numerical values.

The best example  of regression analysis  is linear regression. This is where known/historic   data points

are plotted on an x and y axis.  The idea is to derive  a function  (in linear  regression  is it a line)  that will

help predict (or at least provide  a good  estimation)  of a target value, given  a new independent data point.

The formula is to find the “best fit” line where the delta (difference) between the actual value and the

predicted value is minimized.   The best regression line will produce the smallest  number when summing the squares of the deltas.         :(actual -predicted)2.

 

$1,000

X-axis - Annual Income (predictor)

Linear Regression: Finding the line that

minimizes the total

of the square of the

distance fromthe

actual data point

to the prediction line

The prediction model is the line drawn across the many data points

The formula for linear regression for X and Y data points  is:

Y' = a + bX

( Y' = intersect + slope * X )

whereY' is the predicted value for Y

a  is the intersect of regression line  with Y axis b   is the slope  of the regression  line

X  is the actual value for X

The formula for a (the regression line  intercept with Y) is:

a =  (Σy)(Σx2) – (Σx)(Σxy) n(Σx2) – (Σx)2

where: a  is the intersect of regression line  with Y axis Σ  is the sum of …

n  is the number of data points

The formula for b (the slope of the regression line) is:

b =  n(Σxy) – (Σx)(Σy)                       where: b  is the slope of regression line

n(Σx2) – (Σx)2                                                           Σ  is the sum of

n  is the number of data points

3. Classification and Segmentation:

Classification or “class probability  estimation”  (also called Segmentation) attempts to predict for an individual  in a population,  which of a set of known classes or segments this individual belongs to.

By knowing  which class (or segment) an individual  belongs  to, we can predict an outcome or behavior of a future instance based on previously  known outcomes  of similarly  classified individuals.

For classification algorithms,  a model is created that will predict which class from a number  of known classes will a new individual belong to.  This could be done via decision trees.

Classification  is considered a “supervised” data mining exercise.

One of the best ways of performing  data mining  is to segment the population into different groups

with respect to some or many attributes.  An attribute is a property having some quantity or quality. Examples: “income”, “age”, “race”, “sex”, “education level”, “home ownership”, “geography”, etc. Our job is to keep segmenting  the population  set until  each segment is as pure as possible.

With a “Scoring ” or “Class Probability”, a score representing the probability   (or some other measure) the likelihood  of that individual  belonging  to that particular  segment or class.

4. Clustering:

Clustering is similar  to classification  or segmentation.   The difference in clustering  is that the process

does not have pre-determined target groups.  The algorithm tries to find some relationships or

common attributes (pattern) within the data to group by or cluster the set of instances or individuals in without being given a training  set (historical  data) with pre-classified  outcomes or targets.

Clustering is much more difficult  as it is considered “un-supervised” data mining.

Classification  or segmentation  on the other hand is given a set of known groups/classifications

(a.k.a. target groups) as part of a training  set, and the classification  model tries to predict which of the groups/classes a new individual  belongs  to based on that pre-classified training  set.

5. Association:

Association (or co-occurrence grouping) attempts to find association between entities (e.g. products) based on historical transactions. “Market-Basket Analysisis classic case of association.

Market-basket analysis  attempts to answer the question “What  items are commonly purchased together” in the same basket (e.g. shopping  cart).

While segmentation and clustering looks  at the similarity  between objects  based on the various object variables or attributes, and attempts to group those objects based on those similarities.    Association

looks at similarities  of objects  based on their  appearing together in the same transaction.

Association is the best data mining  technique  and exploitation  by selling  organizations  to perform

cross-selling.  Cross selling is the act of product recommendation.  It is the act of recommending a second product that is often purchased along with the product  of your interest.  This

recommendation  is often done after you decide to purchase the first product.   Examples:  game console and game software, Electronic/Electrical machines and multi-year product protection   warrantee, Printer and printer ink,  Laptop and laptop bag etc.

6. Similarity  Matching:

Similarity  matching  attempts to identify  individuals  (or organizations)   based on data known  about them.

Data about the individuals  from internally  assembled sources can be combined  with lifestyle  segmentation

data obtained from external sources, such as. PRIZM® data from Claritas, Personicx®  from Acxiom,

Mosaic® from Experian, or census data from the federal government,  etc. to create customer profiling.

These profiles are often based on

- demographic (age, race, sex, marital status, education, occupation, income,family size, religion, etc.),

- geographic (country, state, county, city, zipcode, community, urban/rural, etc.),

- behavioral (brand awareness, brand loyalty,price sensitivity, shopping experience, usage rate, etc.) .

- media usage (TV, radio, theater, social media, internet usage, internet searches, books, magazine, etc.),

- interests (hobbies, social events, vacations, entertainment, club membership, recreational,food, etc.),

- personality (achiever, emulator, belonger, savior, doomsdayer, survivalist, philanthropist, etc. ).  Similarity  matching is the basis for one of the most popular methods  for product recommendation

6. Outer detection:

This type of data mining  technique  relates to the observation  of data items in the data set, which do not match an expected pattern or expected behavior.  This technique may be used in various  domains  like

intrusion,  detection, fraud detection, etc. It is also known as Outlier  Analysis  or Outilier  mining.  The

outlier  is a data point that diverges too much from the rest of the dataset. The majority  of the real-world datasets have an outlier.  Outlier  detection plays a significant  role in the data mining  field.  Outlier

detection is valuable  in numerous  fields  like  network interruption  identification,  credit or debit  card fraud detection, detecting outlying  in wireless sensor network data, etc.

7. Sequential Patterns:

The   sequential  pattern   is   a   data   mining  technique   specialized  for evaluating   sequential  data to discover  sequential  patterns.  It  comprises  of finding  interesting  subsequences  in  a  set  of  sequences, where  the  stake  of a  sequence  can be  measured in terms  of different  criteria like length, occurrence frequency,  etc.  In  other  words,  this  technique  of data mining  helps  to  discover  or  recognize  similar patterns in transaction data over some time