Data Mining in a Nutshell

Data mining is the process of discovering interesting knowledge, such as patterns, associations, changes, anomalies and significant structures, from large amounts of data stored in databases, data warehouses, or other information repositories. Due to the wide availability of huge amounts of data in electronic forms, and the imminent need for turning such data into useful information and knowledge for broad applications including market analysis, business management, and decision support, data mining has attracted a great deal of attention in information industry in recent years. Data mining has been popularly treated as a synonym of knowledge discovery in databases, although some researchers view data mining as an essential step of knowledge discovery.

data-mining

In general, a knowledge discovery process consists of an iterative sequence steps which includes:
Data cleaning, which handles noisy, erroneous, missing, or irrelevant data.
Data integration, where multiple, heterogeneous data sources may be integrated into one.
Data selection, where data relevant to the analysis task are retrieved from the database.
Data transformation, where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations.
Data mining, which is an essential process where intelligent methods are applied in order to extract data patterns otherwise known as knowledge.
Pattern evaluation, which is to identify the truly interesting patterns representing knowledge based on some interesting measures.
Knowledge presentation, where visualization and knowledge representation techniques are used to present the mined knowledge to the user.

Data mining steps

With today’s widely available data storage and data handling techniques, the first four processes:
data cleaning, data integration, data selection, and data transformation, can be performed on data warehouses/data mart layer using some of popularly known ETL tools (Extract, Transform and Load). The data mining, pattern evaluation, and knowledge presentation processes are sometimes integrated into one (possibly iterative) process, referred as data mining.

Major tasks of data mining

A data mining system may accomplish one or more of the following data mining tasks.

1. Summarization or Class description. Class description provides a concise and succinct summarization of a collection of data and distinguishes it from others. The summarization of a collection of data is called class characterization; whereas the comparison between two or more collections of data is called class comparison or discrimination. Class description should cover not only its summary properties, such as count, sum, and average, but also its properties on data dispersion, such as variance, quartiles, etc. For example, class description can be used to compare European versus Asian sales of a company, identify the important factors which discriminate the two classes, and present a summarized overview.

2. Association. Association is the discovery of association relationships or correlations among a set of items. They are often expressed in the rule form showing attribute-value conditions that occur frequently together in a given set of data. An association rule in the form of X Y is interpreted as “database tuples that satisfy X are likely to satisfy Y”. Association analysis is widely used in transaction data analysis for directed marketing, catalog design, and other business decision making process. Substantial research has been performed recently on association analysis with efficient algorithms proposed, including the level-wise Apriori search, mining multiple-level, multi-dimensional associations, mining associations for numerical, categorical, and interval data, meta-pattern directed or constraint-based mining, and mining correlations.

3. Classification. Classification analyses a set of training data (i.e., a set of objects whose class label is known) and constructs a model for each class based on the features in the data. A decision tree or a set of classification rules is generated by such a classification process, which can be used for better understanding of each class in the database and for classification of future data. For example, one may classify diseases and help predict the kind of diseases based on the symptoms of patients. There have been many classification methods developed in the fields of machine learning, statistics, database, neural network, rough sets,
and others. Classification has been used in customer segmentation, business modelling, and credit analysis.

4. Prediction. This mining function predicts the possible values of some missing data or the value distribution of certain attributes in a set of objects. It involves the finding of the set of attributes relevant to the attribute of interest (e.g., by some statistical analysis) and predicting the value distribution based on the set of data similar to the selected object(s). For example, an employee’s potential salary can be predicted based on the salary distribution of similar employees in the company. Usually, regression analysis, generalized linear model, correlation analysis and decision trees are useful tools in quality prediction. Genetic algorithms and neural network models are also popularly used in prediction.

5. Clustering. Clustering analysis is to identify clusters embedded in the data, where a cluster is a collection of data objects that are “similar” to one another. Similarity can be expressed by distance functions, specified by users or experts. A good clustering method produces high quality clusters to ensure that the inter-cluster similarity is low and the intra-cluster similarity is high. For example, one may cluster the houses in an area according to their house category, floor area, and geographical locations. Data mining research has been focused on high quality and scalable clustering methods for large databases and multidimensional data warehouses.

6. Time-series analysis. Time-series analysis is to analyze large set of time series data to find certain regularities and interesting characteristics, including search for similar sequences or sub-sequences, and mining sequential patterns, periodicities, trends and deviations. For example, one may predict the trend of the stock values for a company based on its stock history, business situation, competitors’ performance, and current market. There are also other data mining tasks, such as outlier analysis, etc. Identification of new data mining tasks to make better use of the collected data itself is an interesting research topic.

Data mining approaches
Data mining is a young interdisciplinary field, drawing from areas such as database systems, data warehousing, statistics, machine learning, data visualization, information retrieval, and high performance computing. Other contributing areas include neural networks, pattern recognition, spatial data analysis, image databases, signal processing, probabilistic graph theory, and inductive logic programming. Data mining needs the integration of approaches from multiple disciplines.

A large set of data analysis methods have been developed in statistics over many years of studies. Machine learning has also contributed substantially to classification and induction problems. Neural networks have shown its effectiveness in classification, prediction, and clustering analysis tasks. However, with increasingly large amounts of data stored in databases for data mining, these methods face challenges on efficiency and scalability. Efficient data structures, indexing and data accessing techniques developed in database researches contribute to high performance data mining. Many data analysis methods developed in statistics, machine learning, and other disciplines need to be re-examined, and set-oriented, scalable algorithms should be developed for effective data mining.

Another difference between traditional data analysis and data mining is at that traditional data analysis is assumption-driven in the sense that a hypothesis is formed and validated against the data, whereas data mining in contrast is discovery-driven in the sense that patterns are automatically extracted from data, which requires substantial search efforts. Therefore, high performance computing will play an important role in data mining.

Parallel, distributed, and incremental data mining methods should be developed, and parallel computer architectures and other high performance computing techniques should be explored in data mining as well. Since it is easy for human eyes to identify patterns and regularities in data sets or data mining results, data and knowledge visualization is an effective approach for presentation of data and knowledge, exploratory data analysis, and interactive data mining.

With the construction of large data warehouses, data mining in data warehouse is one step beyond on-line analytic processing (OLAP) of data warehouse data. By integration with OLAP and data cube technologies, on-line analytical mining mechanism contributes to interactive mining of multiple abstraction spaces of data cubes.

Data mining is not confined to relational, transactional, and data warehouse data. There are high demands for mining spatial, text, multimedia, and time-series data, and mining complex, heterogeneous, semi-structured and unstructured data, including the Web-based information repositories. Complex types of data may require advanced data mining techniques.

For example, for object-oriented and object-relational databases, object cube based generalization techniques can be developed for handling complex structured objects, methods, class/subclass hierarchies, etc. Mining can then be performed on the multi-dimensional abstraction spaces provided by object-cubes.

Spatial database stores both spatial data which represents points, lines, and regions, and non spatial data which represents other properties of spatial objects and their non spatial relationships. Spatial data cube can be constructed which consists of both spatial and non spatial dimensions and/or measures. Since a spatial measure may represent a group of aggregated spatial objects, whereas multi-dimensional spatial aggregation may produce a great number of such aggregated spatial objects, it is impossible to pre compute and store all of such spatial aggregations. Therefore, selective materialization of aggregated spatial objects is a good trade off between storage
space and on-line computation time.

Spatial data mining can be performed in a spatial data cube as well as directly in a spatial database. Because of the high cost of spatial computation, a multi-tier computation technique can be adopted in spatial data mining. For example, at mining spatial association rules, one can first apply rough spatial computation, such as minimal bounding rectangle method, to filter out most of the sets of spatial objects which should be excluded from further consideration (e.g., not spatially close enough), and then apply relatively costly, refined spatial computation only to the set of promising candidates.

Text analysis methods and content-based image retrieval techniques play an important role at mining text and multimedia data, respectively. These techniques can be integrated with data cube and data mining techniques for effective mining of such types of data.

It is challenging to mine knowledge from World-Wide-Web because of the huge amount of unstructured or semi-structured data. However, Web access patterns can be mined from the preprocessed and cleaned Web log records, and hot Web sites can be identified based on their access frequencies and the number of links pointed to the corresponding sites.

Research 
There have been many data mining systems developed in recent years, and this trend of research and development on data mining is expected to be flourishing because the huge amounts of data have been collected in databases and the necessity of understanding and making good use of such data in decision making has served as the driving force in data mining.

The diversity of data, data mining tasks, and data mining approaches poses many challenging research issues on data mining. The design of data mining languages, the development of efficient and effective data mining methods and systems, the construction of interactive and integrated data mining environment, and the application of data mining techniques at solving large application problems are the important tasks for data mining researchers and data mining system and application developers.

Moreover, with the fast computerization of the society, the social impact of data mining should not be under-estimated. When a large amount of interrelated data are effectively analyzed from different perspectives, it can pose threats to the goal of protecting data security and guarding against the invasion of privacy. It is a challenging task to develop effective techniques for preventing the disclosure of sensitive information in data mining, especially as the use of data mining systems is rapidly increasing in domains ranging

References
1. R. Agrawal, M. Mehta, J. Shafer, R. Srikant, A. Arning, T. Bollinger. The Quest Data Mining System. Proceedings of 1996 International Conference on Data Mining and Knowledge Discovery (KDD ’96), Port-land, Oregon, pp. 244-249, August 1996.
2. S. Chaudhuri and U. Dayal. An Overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record, 26(1):65-74, 1997.
3. M. S. Chen, J. Han, and P. S. Yu. Data Mining: An Overview from a Database Perspective. IEEE Transactions on Knowledge and Data Engineering, 8(6):866-883, 1996.
4. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.). Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
5. G. Piatetsky-Shapiro and W. J. Frawley (eds.), Knowledge Discovery in Databases. AAAI/MIT Press, 1991.

ROC-Receiver Operating Characteristic

Two days ago, myself and the person whom I had no idea had a healthy conversion on classification models, interesting we were discussing about potential techniques and approaches that should be followed to select an optimal classification model for better results. At some point I suggested him ROC curves for the same reason.

Receiver operating characteristic, Roc curve in short is a graphical chart representing the performance of classifiers on varying threshold cut offs.  In general ROC curves are used to compare classification models and help to select the best model that cause better results over the other.

roc-curves

Before attempting how to make such ROC curves, let’s first understand how to interpret those graphs and choose the best models.  Area under the curve, AUC indicated the area below the curve of the model.  In practice a model that has higher area is considered to be a good model.

ROC Graph is nothing but a plot between TPR and FPR. TPR runs along the y-axis and FPR runs along the x-axis.  TPR (True positive rate) also know as sensitivity or Recall. FPR (False positive rate) is equivalent to what is known as 1-specificity.

Concept of TPR and FPR can be easily understood by this simple example. Say you have a classification model to classify good customers and bad customers based on some business data.  With respect to the example, TPR is something that talks about the number of good customers predicted same as good customers by the model.  In converse FPR is something that talks about the number of bad customers predicted as good customer by the model.

In practice, we expect a good model to have high TPR value and less FPR.

Drawing ROC graph.

Let’s say we have two models, M1 and M2 out of these we would like to find better models that would have better TPR and FPR values which influence the accuracy of the result.

confusion-matrix

Drawing ROC curves start with computing confusion matrix.  Let say for model M1 our initial cut off threshold be 0.5 meaning hypothesis result producing value less than 0.5 will be consider as class 0 (eg. bad customer) or hypothesis result producing value less than 0.5 will be considered as class 1 (eg. good customer). Based on the predicted class and actual class confusion matrix is built.

Formula for TPR and FPR would be

TPR = TP / TP + FN

FPR = FP / FP + TN

roc-curves

From the confusion matrix about, we have all information required to calculate TPR and FPR. From there we again change the cut off threshold (say 0.6) and again compute confusion matrix for TPR and FPR values.  And then computed TPR and FPR values are plotted onto a graph for model M1.

Distance Measures – What is Similarity, Dissimilarity and Correlation

The term distance measure has got wide variety of definitions among the math and data mining practitioners. As a result those terms, concepts and their usage went way beyond the head for beginner who try started understand them for first time. And today I write this post to give more simplified and very intuitive definitions for things that shouldn’t be harder to understand and follow.

In the simplest form distance measures are mathematical approaches to measure distance between the objects. Computing distance measures helps us to compare the objects from three different standpoints such as.

  1. Similarity
  2. Dissimilarity

  3. Correlation

Similarity are measure that range from 0 to 1 [0, 1]

Dissimilarity is measures that range from 0 to INF [0, Infinity]

Correlation is measures that range from +1 to -1 [+1, -1]

More detailed explanation on these measures is covered in the later part of this post.

Application & motivation

Distance measures such as similarity and dissimilarity and correlation are basic building block for activities such as clustering, classification and anomaly detection. So getting familiar and understanding these metrics in depth helps in better insight knowledge in other advance data mining algorithms and analysis. distance-measures

Given the data of objects representing the fruits, distance measure will help in classify them into apples and oranges.

Vocabulary

When talking about distance measure we could not avoid introduction the term Distance Transformation or simply Transformation. Distance Transformation refers to the activity of converting similarity score into dissimilarity score or vice versa. The necessity of Distance Transformation merged due to the reason that developer or practitioners might be familiar with computing dissimilarity score but actual algorithm they involved in developing expects similarity score for further proceedings. So Distance Transformation comes into picture and help developer to convert dissimilarity score into similarity score and pass on to algorithm.

Another common term comes to hears when discussing distance measure is Proximity. In simple from proximity is just another name that is interchangeably used to refer similarity and dissimilarity score particular.

Let’s consider simple objects with single attribute and discuss similarity, dissimilarity, correlation measures.

Similarity measures – a score that describe how much object are similar to each other. Similarity are measure that range from 0 to 1 [0,1]

Dissimilarity measures – a score that describe how much objects are dissimilarity to each other. Dissimilarity is measures that range from 0 to INF [0, Infinity]

Today there are variety of formulas for computing similarity and dissimilarity for simple objects and the choice of distance measures formulas that need to be used is determined by the type of attributes (Nominal, Ordinal, Interval or Ration) in the objects.

Below table summarizes the similarity and dissimilarity formulas for simple objects.

Attribute

Similarity (S)

Dissimilarity (D)

Nominal

S = 1 if X = Y

S = 0 if X Y

D = 0 if X = Y

D = 1 if X Y

Ordinal

S = 1 – D

D = |X-Y| / (n-1)

where n is the number of vales

Interval or Ratio

S = 1 / (1 + D)

S = 1 – (D – min(D) ) / max(D) – min(D)

D = |X – Y|

Note: As mentioned earlier, in some situation it’s easier to compute dissimilarity first and then dissimilarity is convert to similarity measure (example Ordinal type attribute) for further proceedings.

Now let’s consider more complex objects that is objects with multiple attributes and discuss various formulas and methods comes into picture when calculating the distance measure.

Euclidean distance – Euclidean distance is a classical method helps compute distance between two objects A and B in Euclidean space (1- or 2- or n- dimension space). In Euclidean geometry, the distance between the points can be found by traveling along the line connecting the points. Inherently in the calculation you use the Pythagorean Theorem to compute the distance.

eucliden

Taxicab or Manhattan distance – Similar to Euclidean distance between point A and B but only difference is the distance is calculated by traversing the vertical and horizontal line in the grid base system. Example, Manhattan distance used to calculate distance between two points that are geographically separated by the building blocks in the city.

taxicab_distance

The difference between these two distance calculations is best seen visually. Figure illustrates the difference.

Taxicab or manhattan-distance

Minkowski

The Minkowski distance is a metric on Euclidean space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance.

minkowski

Where r is a parameter.

When r =1 Minkowski formula tend to compute Manhattan distance.

When r =2 Minkowski formula tend to compute Euclidean distance.

When r =∞ Minkowski formula tend to compute Supremum.

Cosine similarity

The cosine similarity between two vectors (or two documents on the Vector Space) is a measure that calculates the cosine of the angle between them. This metric is a measurement of orientation and not magnitude; it can be seen as a comparison between documents in terms of angle between them.

cosine-similarity

Mahalanobis distance

Mahalanobis distance is one such measure that used to measure the distance between the two groups of object. The idea of distance measure between two groups of objects can be represented graphically for better understanding.

Mahalanobis Distance

Given with the data depicts the above picture, Mahalanobis distance can calculate distance between the Group1 and Group2. This type distance measure is helpful in classification and clustering.

mahadist

What is correlation ?

Correlation is a statistical technique that gives a number telling how strongly or weekly the relationship between the objects. It is not a measure describing the distance but a measure describes the bound between the objects. Correlation value is usually represented by small letter ‘r’ and ‘r’ can ranges from -1 to +1.

If r is close to 0, it means there is no relationship between the objects.

If r is positive, it means that as one object gets larger the other gets larger.

If r is negative it means that as one gets larger, the other gets smaller.

r value =

+.70 or higher Very strong positive relationship

+.40 to +.69 Strong positive relationship

+.30 to +.39 Moderate positive relationship

+.20 to +.29 weak positive relationship

+.01 to +.19 No or negligible relationship

0 No relationship

-.01 to -.19 No or negligible relationship

-.20 to -.29 weak negative relationship

-.30 to -.39 Moderate negative relationship

-.40 to -.69 Strong negative relationship

-.70 or higher Very strong negative relationship

Today there are several proven methods (formulas) for computing correlation measures ‘r’, out of which Pearson’s correlation coefficient is commonly used method for computing the correlation.

Using Pearson’s correlation coefficient the correlation can calculate for objects that possess liner relationship.

It may be helpful to see graphically what these correlations look like:

pearson-correlation

Some time it is easy to confuse correlation with regression analysis. So in order to get better understanding of these terms we can say regression analysis helps to predict the value if at all there exists a relationship between the objects. Whereas correlation helps to understand and check whether there exists a relationship or not between the objects.

Thus a wise statistician always computes correlation first before doing any predication using regression analysis.

What is the difference between Artificial Intelligence, Machine Learning, Statistics, and Data Mining

Few day ago before I saw an interesting question on stats.stackexchange.com that got my attention for a while. After spending few minutes of readings and analyzing all answers on stack I felt writing my thoughts assuming what I would have answered if I really had too.

What is the difference between Artificial Intelligence, Machine Learning, Statistics, and Data Mining ?

 

Would it be accurate to say that they are 4 fields attempting to solve very similar problems but with different approaches? What exactly do they have in common and where do they differ? If there is some kind of hierarchy between them, what would it be?

I assume the author of that question is trying to get a clear picture by understanding the line of separation that distinguish each field from the other. So here is my take to explain it in a more simplified way that I ever could do.

Machine learning is a science that involves development of self-learning algorithms. These algorithms are more generic in nature that it can be applied to various domain related problems.

machine-learning

Data mining is a practice of applying algorithms (mostly Machine learning algorithms) with the data available from domain to solve domain related problems.

Statistics is a study of how to collect, organizes, analyze, and interpret numerical information from data. Statistics can slip into two taxonomy called Descriptive statistics and Inferential statistics. Descriptive statistics involves method of organizing, summering and picturing information from data. Inferential statistics invokes method of using information from sample to draw conclusion about the population.

statistics

Machine learning uses statistics (mostly inferential statistics) to develop self learning algorithms.

Data mining uses statistics (mostly Descriptive statistics) on results obtained from algorithms, it used to solve the problem.

Data mining as a field emerged to solve problems in the miscellaneous domain (particularity in business), acquired different techniques and practices that are used in different field of studies.

In 1960 practitioners who solved problems (mostly business problems) used term Data fishing to call the work they do. In 1989 Gregory Piatetsky Shapiro used term knowledge Discovery in the Database (KDD). In 1990 a company used term Data mining with the trademark to represent their work. Today data mining and KDD are used interchangeably.

Artificial Intelligence is a science to develop a system or software to mimic human to respond and behave in a circumference. As field with extremely broad scope, AI has defined its goal into multiple chunks. Later each chuck has become a separate field of study to solve its problem.

AI

Here is a major list of AI goal (a.k.a. AI problems)

1. Reasoning
2. Knowledge representation
3. Automated planning and scheduling
4. Machine learning
5. Natural language processing
6. Computer vision
7. Robotics
8. General intelligence, or strong AI

As mentioned in the list Machine learning is field emerged from one the AI goal to help machine or software to learn on it own to solve problems it’s can come across.

Natural language processing is another such field emerged from AI goal to help machine to communicate with real human.

Computer vision is a field emerged from AI goal to identify and distinguish objects that the machine could see.

Robotics is a field emerged from AI goal to give a physical appearance for a machine to do physical actions.

Is some kind of hierarchy between them, what would it be?

One way of representing the hierarchical relationship between these science and study can be drawn from historical facts when they have emerged.

Origin of science and study.

orgin

Statistics – 1749
Artificial Intelligence – 1940
Machine leaning – 1946
Data mining – 1980

History of statistics is believed to be started around 1749 to represent information. Practitioners use statistics to represent the economic status of states and to represent the material resource put on the military use. Later usage of statistics was leveraged to include data analysis and organization.

History of Artificial Intelligence happened to be existing has two types namely classic and modern. Classical Artificial Intelligence can be seen in ancient time stories and writings. However Modern AI emerged in 1940 when describing the idea of mimicking human like machine.

In 1946, Origin of Machine leaning emerged as branch of Artificial Intelligence with purpose to solve the goal of making machines to learning itself without programming/ hardwiring explicitly.

Would it be accurate to say that they are 4 fields attempting to solve very similar problems but with different approaches?

relationship

It would be appropriate to say they (Statistics, Artificial Intelligence and Machine Leaning) are highly inter dependent field that they can’t survive along without leading help from others. It is also good to see these 3 fields a one globe field instead of 3 diffident subjects.

As with this perception as one globe field these three fields have contributed their excellence in solving common goal. As a result the solution as such where applicable in many different domains where the core problem is same under the hood.

This is time data mining come into picture, it took the solution obtained from the globe field and applied it to different domains(business, military, medicine, space) to solve problems that has the same nature under the hood. This is also the time where data mining expanded its popularity.

I Hope my explanation has everything that need to answer the authors question and I believed it would have definitely helped anyone who is trying to understand the sweet spot of each field and how they are related. If you got anything to say or share about the article then please let me know your thoughts in the comment section.

 

 

 

Low-rank matrix approximation with SVD – Part 2

In part-1 we have revising the internal workings of singular value decomposition.  Here in part 2 lets see the practical usage of SVD.

All source code and corpus used in this excise can be found on my github-Account.

Singular value decomposition (SVD) has a wide range of usage in the statistical study however it is more popular for its usage called Low-rank matrix approximation (dimension reduction).

Singular value decomposition

Low rank matrix approximation

The SVD can be used to approximate the high-dimensional matrix to low-dimensional matrix by a lowering the rank of the original matrix.

In other words the rank ‘k’ of a matrix, helps to produce a least-square matrix approximation from the first k singular vectors of U and V and the k corresponding singular values. This approximate optimal matrix is more similar to original matrix and produce the same result as that be could produced from processing original high-dimensional matrix. As the result are going same processing the approximated low-dimensional matrix will help in reducing computational and processing resources.

To visualize this Low-rank matrix approximation, let works with matrix that represents an image. Going further lets reduce the rank of a matrix and reproduce the image from a low rank matrix and check its quality. To do this excise I used R language to compute SVD and play with a matrix representing the image.

All source code and corpus used in this excise can be found on my github-Account.

 svd-low-rank-matrix-approximation

All source code and corpus used in this excise can be found on my github-Account.

With only 10% of the real data, we are able to create a very good approximation of the real data. Moreover, with this method we can remove noise and linear dependent elements by using only the most important singular values. This is very useful on data mining problems as it is hard to identify which elements in the data set will be useful or not for to statistical analysis.

Other notable usage of SVD is Pseudoinver, Solving homogeneous linear equations,latent semantic index and etc. In fact I have written a another blog on Latent Latent Semantic Analysis that uses SVD technique.

Revising singular value decomposition – Part 1

Few days before I had small issues in understanding the internal workings of SVD. The Worst part is I have already done few projects which has SVD as a major module. This is the time I decided to revisit and recall every thing I need to know about decomposition techniques (Well, this is also he time I realized why missed to document about SVD so long ago).

svd-img

Today there is a lot of decomposition technique available to solve the statistical problems, however most popular and widely used decomposition technique are.

  1.  Eigenvalue decomposition
  2.  Singular value decomposition (SVD)
  3.  Principal component analysis (PCA)

And this post is dedicated for laymen who like to understand the internal workings of Singular value decomposition.

The word decomposition means reduction or breaking down, as the name indicates SVD is one the best factorization (decomposition) technique available for the matrix.

In simpler terms factorization help to break the bigger matrix into small entities such that when multiplying the entities we get the original matrix.  

Example, a matrix AMxN can be broken down and written in a form A = UDVt. Such that when multiplying U * D * vt we get the same matrix A.

matrix-svd

In this representation A = UDVt, U and Vare orthogonal matrix and D is diagonal matrix whose diagonal entries are (d1, d2 …. dn) eigenvalues of the original matrix A.

Proof

Theatrically to believe A = UDVt we need an accurate publishable proof.

To prove A = UDVt.Let say A=UDVT and apply transpose on both side equation to get At = UtDV.

Now we have two equations

A = UDVt

At = VDUt (For the sake of convince I have rearranged UtDV into VDUt)

Having these equations in hand, if we prove the result of AAt is exactly equal to UtDV * VDUt then our A = UDVt assumption is correct and holds good.

AtA = VDUtUDVt

AtA = VD2Vt

AtAV = VD2

Algorithm

To compute SVD find eigenvalues and eigenvectors of AAt and AtA.

Compute U : Find eigenvalues of matrix AAt, from eigenvalues compute eigenvector which is U.

Compute Vt : Find eigenvalue of matrix AtA, from eigenvalues compute eigenvectors which is Vt.

Compute D : Compute the square root of the common positive eigenvalues of AAt and AtA.

Assign computed values to U, V, and D.

Let try with an example. Let A be matrix. 

a

aat ata

Using eigenvalues from AAt U matrix is produced. 

u

Using eigenvalues from AtA V matrix is produced.

v

S is diagonal matrix produced from common positive eigenvalues of AAt and AtA

d

Continue with part 2 –  Low-rank-matrix-approximation-svd

Reference 

http://www.informatik.uni-konstanz.de/fileadmin/informatik/ag-saupe/Webpages/lehre/na_08/Lab1/8_SVD/html/SVD_05.png 

http://en.wikipedia.org/wiki/Singular_value_decomposition 

http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm

Dynamic scoping rules

R uses Lexical scoping rules, unlike R many other languages use Dynamic scoping for assigning values to their free variables.

Its good to know what is Dynamic scoping and how it is different from lexical scoping. So followed by my previous post on lexical scoping I took privilege to write a short post on dynamic scoping too.

static_vs_dynamic_object

In simple terms when applying dynamic scoping rules, search for free variable values behinds from calling function instead of parent environment.

For example in R the following code will print apple as red and values of apple is assigned by applying lexical scoping rules.

dynamic-scoping

In other languages which follows dynamic scoping will print apple as green This is because as soon as goo function starts executing, value for apple is searched from the current environment followed by the environment in which it is called (calling environment). Note that in dynamic scoping search process extents from current environment to calling environment instead of current environment to parent environment. So for the above example we get apple as green.

This difference in scoping rules gives completely different picture in the execution of programing languages. Even though this types scoping make certain tasks easier it brings the same amount of difficulty in introduce bugs and debugging headaches to programmer community

Lexical scoping rule in R

I am getting familiar in R and on this go I came across the concepts of lexical scoping rules in R functions. I had slightly difficult time in understanding the concepts as I came form a language that has dynamic scoping as a background. And below post (Lexical scoping rule in R) is a short summary of my understanding over lexical scoping.

Generally scoping rules determine how a value is bound to free variables in a function. Now question is what is free variables? To answer this, free variables are variables that are neither formal arguments of a function or a local variables that is initialized inside the function body.

lexical-function-free-variable

Here x , y are formal argument.

q is local argument.

z is a free variable.

With respect to R lexical scoping comes into picture when a values need to look up for variable like ‘z’ in the above function. 

Lexical scoping rule.

Search behinds from the current environment. If the value is not found then search process extents its search from current environment to its parent environment. If value is not found in the current parent environment then search is further extents to its parent environment. This process continues until empty environment is reached.

Empty environment does not have parent environment so search process stops there.

R uses lexical scoping or static scoping. In R every function, work-space, name-space, etc has some thing called environment. Every Environment is a collection of (symbol, values ) pairs. An environment may have one parent environment and more than one child environments. Only environment without a parent is called empty environment.

So when executing the above function, R uses lexical scoping rules for searching/finding the value for z. Below are two other example that illustrate the lexical scoping concepts even more precisely.

lexical-scoping-ex-1

Lexical scoping rule 2

In the above function value of ‘a’ is not defined in the current environment i.e inside the function. So search for the value ‘a’ extends to its parent environment that is global environment where it finds value for ‘a’ as 10.

 lexical-scoping-ex-2

Lexical scoping rule 1

In above slightly complicated example value of ‘a’ is not defined in the current environment i.e inside the goo function. So search for the value ‘a’ extends to its parent environment called p1 as show in the diagram. As it finds value for ‘a’ as 5. search process stops here.

 If value of ‘a’ is not found in p1 then search would have extents further to parent of p1 which is global environment in this case and value of ‘a’ would have become 10.

That’s it! Lexical scoping is as simple as it is. 
I doubt DM community says R data processing limit to size of physical memory. Is this limitation due to lexical scoping, let know your thoughts in the comments.

Word Cloud on Thirukural

Building word cloud isn’t that much scary until I know I could do this myself with some statistical packages provided in R.

For this expedition I decided to build word cloud on Thirukural. Thirukural is one the finest master pieces in the Tamil literature works which is believed to written during the Tamil sangam period. Thirukural constitutes of 133 Chapters each containing 10 couplets. Thus we get 133 * 10 = 1330 couplets in total. Depending on message or meaning, couplets are categorized into three groups namely Aram, Pourl, Inbam. Now let me stop here by giving more intrinsic facts about Thirukural and jump on to the actual notion of this of writing.

wordcloud

Below post will clearly explain the process I followed for building word could on Thirukural. All source code and corpus used in this excise can be found on my github account.

Steps in build the word cloud on Thirukural.

  1. Document preparation.

  2. Building Term Frequency matrix.

  3. Construing word cloud.

Document preparation.

I quickly jumped on the Internet to find archives(text, .pdf) of Thirukural couplets that exhibits only in Tamil. To my surprise I couldn’t find single corpus that will suit my needs. And  after surfing few minutes on Internet I decide to crawl some web-pages and create those documents am interested in.

Crawling web page is a trivial job only if the DOM structure is neatly arranged and target text are reachable either by HTML or CSS selectors. I picked http://www.gokulnath.com/thirukurals/1 website to Crawl for the two major reasons.

1. URL pattern

2. HTML selectors

URL pattern:

The URL pattern of gokulnath.com is very agile for web crawler i.e. I can loop through on URL patters to extract the couplets from each chapter just by increasing the number on last entity on the URL.

http://www.gokulnath.com/thirukurals/1 will point to chapeter 1
http://www.gokulnath.com/thirukurals/2 will points to chapter 2
http://www.gokulnath.com/thirukurals/133 will points to chapter 133

HTML selectors.

After reaching the chapter in the web-pages it time to extract couplets inside the chapters. HTML selectors of gokulnath.com is not as great as I thought but is reasonably good than other sites that I have gone so far. Here couplets are placed inside the table along with others information that I don’t look for. So I have make my crawl rules to focus on the target text and extract only what I want.

After analyzing the DOM structure I wrote a simple web-crawler that will iterate through URL patters and extract the couplets from each chapters. Followed by extraction I programed the crawler to write all those couplets from same chapter a separate files. Thus at the end of execution of the web-crawler I had 133 files in my machine containing 10 couples each.

I took 11 minutes to write the crawler. 

It took less than a 1 minute for crawler to finish the entire crawl job.

 thirukural-corpus

Term Frequency Document.

Next step in this expedition is to build Term Frequency document matrix from those corpus generated by web-crawler. Here rows corresponds to terms and column corresponds to document. And Each cell represents term frequency for the corresponding column(document). Thus TFD matrix gives us ability to sum the row count which is nothing but a total frequency count of a particular term in the over all corpus.

More detailed explanation of TFD could be found on my other post.

In this case, after building the TFD we will have 133 columns and 6566 rows in the matrix. After constructing the TFD matrix we have sum each rows to find the total frequency of terms in the entire document. And this total will goes as input to the word cloud function.

Sample TFD matrix.

D1

D2

D113

அகர

1

0

0

0

1

ஆதி

1

0

1

0

0

..

0

0

0

0

0

1

0

1

0

0

6566

0

0

1

1

0

Construing word could.

As I told in the introduction, building word cloud isn’t that much scary until I know I could do this myself with the statistical packages provided in R. Here we will use package “tm” and “wordcloud”. The package “tm” is used for text mining activity such as cleaning up the text data. Usually “tm” package is used to strip white space, stemming the text and building the TFD matrix. Since My corpus is in Tamil I will not worry about stemming. The package “wordcloud” is in presentation layer that used to create word cloud chats and graphs. Using the “wordcloud” properties I took liberty to customize the color and number of word that should be drawn on the graph. Here I customize to and said draw all those word whose frequency should be minimum of 3 and maximum number of word that could exhibit on cloud is 133.

wordcloud(lords, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, “Dark2″))

Important properties of above code snippet

scale is used to controls the difference between the largest and smallest font.

max.words is required to limit the number of words in the cloud.

rot.per controls the percentage of vertical text.

That’s it I have created the word cloud.  All source code and corpus used in this excise can be found on my github account.

thiruvalluvar

And I don’t know I did not find word cloud for Thirukural on INTERNET I searched a lot. If it true I think I would be worlds first person to create worlds first word could on Thirukural !!!!…isn’t cool, Let me know your thoughts on comments. 

 

Mining Associations with Apriori using R – Part 2

Prologue: I have been working and practicing various skills and algorithms as a progress to show on my road-map to become as a matured data scientist. As a part of this expedition I have decided to document all those stuffs I am going through. So whatever you read under this column will be either a summary of my understanding or a post explaining the details of my experiment.

In part-1 we have discussed introduction, terminology, algorithm and applications of Apriori. Now let’s discuss the code implementation of Apriori using R.

In R to play with association mining we use two important packages called arules and arulesViz written be Michael Hahsler. The package arules is used for mining transaction records and extracting association rules. The package arulesViz is used as a presentation layer to visualize the association rules.

Data set used in the experiment contains 1-lack record.
Each record represents a transaction containing different items.
To keep it simple Items in each transaction record is represented as numeric value.
Data set and entire source code of program can be found on my github repository

Usually in R we use data.frame to hold data for any data mining activity. However for association mining with arules package we need to hold data in transactional form. So our first process in association mining is to convert the data.frame to transaction form.

trans = read.transactions(file=” transaction-dataset.txt” , format=”basket”, sep=” “);

The data.frame can be in either a normalized single form or a flat file (basket) form. When the file is in basket form it means that each record represents a transaction when the dataset is in ‘single’ form it means that each record represents one single item and each item contains a transaction id.

rules <- apriori(trans, parameter=list(support=0.01, confidence=0.5))

Once we load the transactional data into workspace we call apriori function to extract association rules.

rules <- apriori(trans, parameter=list(support=0.01, confidence=0.5))

In the above snippet we are telling the algorithm to give only the rules that has support greater than 0.01 and confidence greater than 0.5. And then resulting rules are stored in the variable rules. To take a look of generated rules we can use inspect function.

inspect(head(sort(rules , by=”lift”)))

Visualization of rules

Even after filtering the rules using some constrains measure that we have discussed in Part -1
We might end with large list of interesting rules outputted by the aprior rule function. And it not viable to go through all rules one by one. We can use Visualization technique to get deep insight about the rules. We can use arulesViz package to visualize the association rules. The package arulesViz gives us ability to draw different charts and graphs without getting our hard much dirty in coding.

Scatter plot: Default plot for arulesViz package is scatter plot. In this plot association rules are plotted again axes. Usually X-axis corresponds to support and Y-axis corresponds to confidence. However these

plot(rules, measure=c(“support”, ” confidence “), shading=”lift “)

Shading represents an additional coordinate that can be represented in this two dimensional space. Here color coordinate represent the lift value for each point on the scatter plot.

Two key plot: Two-key-plot is similar to scatter plot. Here X-axis corresponds to support and Y-axis corresponds to confidence. And color of point is used represent the order i.e number of items in the rule.

plot(rules, shading=”order”, control=list(main = “Two-key plot”))

Matrix based plot: In these graphs we can see the two parts to an association rule: the antecedent (IF) and the consequent (THEN). These patterns are found by determining frequent patterns in the data and these are identified by the support and confidence.

plot(rules, method=”matrix” , measure=”lift” )
plot(rules, method=”matrix3D” ,measure=c(“lift”, “confidence”))
plot(rules, method=”matrix” , measure=”lift” , interactive=TRUE )

Grouped Matrix: Grouped Matrix is similar to matrix based plot. Here rules are grouped to present as an aggregate in the matrix which can be explored interactively by zooming into and out of groups. Matrix based plot and Grouped Matrix plot are mostly used in interactive mode to understand the data points on the visualization area.
plot(rules, method=”grouped” )
plot(rules, method=”grouped” , interactive=TRUE )

Graph plot: Used to visualize association rules using vertices and edges where vertices typically represent items or item-sets and edges indicate relationship in rules.
plot(rules, method=”graph”);
plot(rules, method=”graph”, control=list(type=”items”))

You can download the data set and entire source code of program from my github repository. Also you are welcomed to modify code to match with your expectation, thank you.

Reference
http://cran.r-project.org/web/packages/arules/vignettes/arules.pdf
http://cran.r-project.org/web/packages/arulesViz/vignettes/arulesViz.pdf
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_Pattern_Mining/The_Apriori_Algorithm
http://en.wikipedia.org/wiki/Association_rule_learning
http://www.allanalytics.com/author.asp?section_id=2037&doc_id=253387&image_number=1