Prologue: I have been working and practicing various skills and algorithms as a progress to show on my road-map to become as a matured data scientist. As a part of this expedition I have decided to document all those stuffs I am going through. So whatever you read under this column will be either a summary of my understanding or a post explaining the details of my experiment.

Source code and data-set used in this post can be found @ my github repository https://github.com/shakthydoss/simple-liner-regression-in-r

Objective of this writing is to show how to perform regression analysis in R.

Regression analysis is mathematical approach used in statistics and machine learning science to describe relationship between the variables X and Y.

X is an independent variable.

Y is a dependent variable.

By describing the relationship among the variables we will have ability to predict/find unknown variable Y using the known variable X.

Lets assume we have given a problem and a sample data-set by which we have to come up with a learning algorithm and regression model to predict unknown variable Y using the known variable X.

Given data-set containing 30 rows and 3 columns namely index, age of patient, blood pressure.

Index |
Age of patient |
Blood Pressure |

1 |
53 |
134 |

2 |
45 |
125 |

3 |
41 |
144 |

.. |
.. |
.. |

Using the above sample data set we have to predict the unknown Blood pressure of patient whose age is known(eg. age = 43).

**Representation of regression model.**

The prime step in the Regression analysis is representation of relationship among the variables

Here variable are

X –> Age

Y –> Blood pressure

This type of problem is called uni-variant regression analysis (because Y is determined by single variable X).

Lets say you have plotted the data set and it looks like this

Now what you want is to predict the blood pressure of patient who age is 43. So what our learning algorithms has to do is to put a line through a dataset or fit a line into dataset. And then we will predict the unknown blood pressure from the intercept between the line and the given X.

The line we have drawn into the data-set is called model, by conventions it is also called cost function.

And mathematical representation of the model is

**h(x) = ∅ _{0} + ∅_{1}x**

The parameter values of theta-one and theta-two determines how well the line fits into the data-set.

Again there are many ways to find and adjust the values of theta-one and theta-two with respect to the problems but lets not discuss that here. Lets assume we have got the values of theta-one and theta-two and procedure further to understand the big picture of prediction mechanism.

Now say by means you have got the values of theta-one and theta-two. Lets know try to predict the value of unknown blood pressure for a given age.

Substitute the value of theta and X.

h(x) = ∅_{0} + ∅_{1}x

Y = h(x)

Y = ∅_{0} + ∅_{1}x

Y = 98.7147 + 0.9709(43)

Where

∅_{0 }= 98.7147

∅_{1 }= 0.9709

x = 43

**In ‘R’ way **

Now lets get our hands dirty by start coding the regression expedition in R.

R is language that has excellent support for statistical computing and visualising the graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.

What we will do in the code.

1. Load the dataset

2. Plot the dataset into graph for visual interpretation.

3. Determine and fit the model into the dataset.

4. Predict the unknown using the mode.

**1. Loading the dataset. **

*Function read.csv *

Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.

By default first line of file is considered as header. If header is not present in the dataset then header = FALSE. And by default read.csv function assumes data are separated using commas if not delimiter details has to be passed in the function.

**2. For plotting the graph.**

Function *plot*

Generic function for plotting the dataset. Plot function takes X and Y coordinates.

X the coordinates of points in the plot. Alternatively, a single plotting structure, function or any R object with a plot method can be provided.

Y the y coordinates of points in the plot, optional if x is an appropriate structure.

**3. Model representation**

Function *lm *

lm (Liner model) is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance.

**4. ****P****redict the unknown using the mode.**

lm function outputs the *model*, using this prediction of unknown variable can be determined. Model object has vector variable called coefficient. The values in this vector variable corresponds to the values of theta-one and theta-two. Therefore substituting the values in cost function will determine the unknown variable Y.

Source code and data-set used in this post can be found @ my github repository https://github.com/shakthydoss/simple-liner-regression-in-r

Continue with part-2 for multivariate regression analysis: http://shakthydoss.com/prediction-using-simple-liner-regression-r-part-2/

## 4 Comments

at 10:48 PM - 12th January 2014 Permalink

sakthi, have you learnt how linear regression modeling works? consider a multivariate example and see how many challenges you will have to face.

at 10:57 PM - 12th January 2014 Permalink

yeah sure, I will definitely give a try 🙂

at 5:18 PM - 12th January 2014 Permalink

sakthi, have you learnt how linear regression modeling works? consider a multivariate example and see how many challenges you will have to face.

at 5:27 PM - 12th January 2014 Permalink

yeah sure, I will definitely give a try 🙂

## Post a Comment