## Distance Measures – What is Similarity, Dissimilarity and Correlation

The term distance measure has got wide variety of definitions among the math and data mining practitioners. As a result those terms, concepts and their usage went way beyond the head for beginner who try started understand them for first time. And today I write this post to give more simplified and very intuitive definitions for things that shouldn’t be harder to understand and follow.

In the simplest form distance measures are mathematical approaches to measure distance between the objects. Computing distance measures helps us to compare the objects from three different standpoints such as.

1. Similarity
2. Dissimilarity

3. Correlation

Similarity are measure that range from 0 to 1 [0, 1]

Dissimilarity is measures that range from 0 to INF [0, Infinity]

Correlation is measures that range from +1 to -1 [+1, -1]

More detailed explanation on these measures is covered in the later part of this post.

Application & motivation

Distance measures such as similarity and dissimilarity and correlation are basic building block for activities such as clustering, classification and anomaly detection. So getting familiar and understanding these metrics in depth helps in better insight knowledge in other advance data mining algorithms and analysis.

Given the data of objects representing the fruits, distance measure will help in classify them into apples and oranges.

Vocabulary

When talking about distance measure we could not avoid introduction the term Distance Transformation or simply Transformation. Distance Transformation refers to the activity of converting similarity score into dissimilarity score or vice versa. The necessity of Distance Transformation merged due to the reason that developer or practitioners might be familiar with computing dissimilarity score but actual algorithm they involved in developing expects similarity score for further proceedings. So Distance Transformation comes into picture and help developer to convert dissimilarity score into similarity score and pass on to algorithm.

Another common term comes to hears when discussing distance measure is Proximity. In simple from proximity is just another name that is interchangeably used to refer similarity and dissimilarity score particular.

Let’s consider simple objects with single attribute and discuss similarity, dissimilarity, correlation measures.

Similarity measures – a score that describe how much object are similar to each other. Similarity are measure that range from 0 to 1 [0,1]

Dissimilarity measures – a score that describe how much objects are dissimilarity to each other. Dissimilarity is measures that range from 0 to INF [0, Infinity]

Today there are variety of formulas for computing similarity and dissimilarity for simple objects and the choice of distance measures formulas that need to be used is determined by the type of attributes (Nominal, Ordinal, Interval or Ration) in the objects.

Below table summarizes the similarity and dissimilarity formulas for simple objects.

 Attribute Similarity (S) Dissimilarity (D) Nominal S = 1 if X = Y S = 0 if X ≠ Y D = 0 if X = Y D = 1 if X ≠ Y Ordinal S = 1 – D D = |X-Y| / (n-1) where n is the number of vales Interval or Ratio S = 1 / (1 + D) S = 1 – (D – min(D) ) / max(D) – min(D) D = |X – Y|

Note: As mentioned earlier, in some situation it’s easier to compute dissimilarity first and then dissimilarity is convert to similarity measure (example Ordinal type attribute) for further proceedings.

Now let’s consider more complex objects that is objects with multiple attributes and discuss various formulas and methods comes into picture when calculating the distance measure.

Euclidean distance – Euclidean distance is a classical method helps compute distance between two objects A and B in Euclidean space (1- or 2- or n- dimension space). In Euclidean geometry, the distance between the points can be found by traveling along the line connecting the points. Inherently in the calculation you use the Pythagorean Theorem to compute the distance.

Taxicab or Manhattan distance – Similar to Euclidean distance between point A and B but only difference is the distance is calculated by traversing the vertical and horizontal line in the grid base system. Example, Manhattan distance used to calculate distance between two points that are geographically separated by the building blocks in the city.

The difference between these two distance calculations is best seen visually. Figure illustrates the difference.

Minkowski

The Minkowski distance is a metric on Euclidean space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance.

Where r is a parameter.

When r =1 Minkowski formula tend to compute Manhattan distance.

When r =2 Minkowski formula tend to compute Euclidean distance.

When r =∞ Minkowski formula tend to compute Supremum.

Cosine similarity

The cosine similarity between two vectors (or two documents on the Vector Space) is a measure that calculates the cosine of the angle between them. This metric is a measurement of orientation and not magnitude; it can be seen as a comparison between documents in terms of angle between them.

Mahalanobis distance

Mahalanobis distance is one such measure that used to measure the distance between the two groups of object. The idea of distance measure between two groups of objects can be represented graphically for better understanding.

Given with the data depicts the above picture, Mahalanobis distance can calculate distance between the Group1 and Group2. This type distance measure is helpful in classification and clustering.

What is correlation ?

Correlation is a statistical technique that gives a number telling how strongly or weekly the relationship between the objects. It is not a measure describing the distance but a measure describes the bound between the objects. Correlation value is usually represented by small letter ‘r’ and ‘r’ can ranges from -1 to +1.

If r is close to 0, it means there is no relationship between the objects.

If r is positive, it means that as one object gets larger the other gets larger.

If r is negative it means that as one gets larger, the other gets smaller.

r value =

+.70 or higher Very strong positive relationship

+.40 to +.69 Strong positive relationship

+.30 to +.39 Moderate positive relationship

+.20 to +.29 weak positive relationship

+.01 to +.19 No or negligible relationship

0 No relationship

-.01 to -.19 No or negligible relationship

-.20 to -.29 weak negative relationship

-.30 to -.39 Moderate negative relationship

-.40 to -.69 Strong negative relationship

-.70 or higher Very strong negative relationship

Today there are several proven methods (formulas) for computing correlation measures ‘r’, out of which Pearson’s correlation coefficient is commonly used method for computing the correlation.

Using Pearson’s correlation coefficient the correlation can calculate for objects that possess liner relationship.

It may be helpful to see graphically what these correlations look like:

Some time it is easy to confuse correlation with regression analysis. So in order to get better understanding of these terms we can say regression analysis helps to predict the value if at all there exists a relationship between the objects. Whereas correlation helps to understand and check whether there exists a relationship or not between the objects.

Thus a wise statistician always computes correlation first before doing any predication using regression analysis.