An Introduction to Linear Regression

We present a short introduction to one of the most common regression approaches – a linear regression model.

Author: Goran Sukovic, PhD in Mathematics, Faculty of Natural Sciences and Mathematics, University

Linear regression is a special case of the regression analysis – a statistical technique for modeling the relationship between variables. The main idea underlying regression analysis is very simple; how an output feature (or response variable of special interest) depends on one or several other input features, predictors, or explanatory variables. Examples of applied problems and questions in which regression might be useful:

1. predict the happiness, on a scale from 1 to 10, based on the annual income
2. predict forced exhalation volume (FEV), a measure of how much air somebody can forcibly exhale from their lungs, based on the age in years. (“An Exhalent Problem for Teaching Statistics”, The Journal of Statistical Education, 13(2)).
3. determine the sales (in thousands of units) for a product as a function of advertising budgets (in thousands of dollars) for TV, radio, and newspaper media (Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani – “An Introduction to Statistical Learning with Applications in R”, 7th Edition, Springer, 2014)
4. determine the apartment price based on size, location, floor, closeness to bus station/subway station.
5. predict the growth of the plants based on soil quality, humidity, and fertilizer
6. find the height of a child if you know the heights of the parents and nutrition
7. predict the weight, blood pressure, and the cholesterol level based on the eating habits of students (e.g., number of ounces/grams of the red meat, chicken meat, fish, and diary consumed during the one week)

In this article, we review some of the key ideas underlying the linear regression model. Linear regression is one of the simplest parametric approaches for predicting a quantitative response. More formally, a linear regression model with k predictor variables X₁ , X₂ , …, X_k and a response Y, can be written as y = β₀ + β₁X₁ + β₂X₂ + · · · β_kX_k + ε, where the ε is the residual terms of the model. If we use a single predictor variable to predict the value of the output variable, then we talking about simple linear regression. Simple linear regression models are suitable for examples one and two above. The term multiple regression is reserved for models with two or more predictors and one response, such as examples three through six above. Example seven, which represents the models with two or more outputs, is usually called multivariate regression models.

Although there are many “fancier” machine learning and statistical approaches, linear regression is still a very useful and widely used method in medicine, engineering, social sciences, and economy, to mention just a few of the areas. As we see in the following articles, many other methods are developed as generalizations or extensions of the linear regression model.

Even if you are not good at math, it is a good idea to develop some geometric intuition about the linear regression model. Models with one predictor variable and a response variable Y = β₀ + β₁X₁ + ε can be understood as a line in the Cartesian coordinate system. The intercept β₀ and the slope β₁ are unknown constants, and ε is a random error component. In other words, we can assume that there is approximately a linear relationship between X₁ and Y, written as Y ≈ β₀ + β₁X₁, where “≈” can be understood as “is approximately modeled as”. The observations X₁are points in the plane and the line is “fitted” to best approximate the observations. In this case, we interpret variable Y as a continuous dependent variable: X₁ is independent variable assumed to be non-random; the random error ε has the following properties: E (ε) = 0 and V (ε) = σ². To make a prediction using a given formula, we must use a data set to estimate unknown coefficients β₀ and β₁. Let (x₁ , y₁), (x₂ , y₂), . . . , (x_n , y_n) denotes a data set with n observation pairs. The following figure presents a data set for n = 200 diﬀerent markets in the Advertising example, where input variable X₁ is TV advertising budget and the output variable is product sales in thousands.

Our goal is to find a line that is the best fit for the data set. In other words, our line should be as close as possible to the given data points. We can use many ways to measure “closeness”. One of the most common methods measuring “closeness” is to minimize the least-squares error. Soon we will discuss alternative methods.

Let f(x_i)= β₀+ β₁x_i be the prediction for output Y based on the ith value of X. Then, for each i, calculate the difference between the observed response value and the value predicted by our model: e_i = y_i − f(x_i). This value is known as ith residual. The residual sum of squares (RSS) is defined as e₁²+e₂²+…+e_n², or equivalently as (y₁-f(x₁))²+(y₂-f(x₂))²+…+(y_n-f(x_n))². Using common methods to minimize real-valued functions of two variables, we can easily find a closed formula for unknown coefficients.

Example: Advertising data set from example three above, from the book Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani – “An Introduction to Statistical Learning with Applications in R”, 7^th Edition, Springer, 2014. The following figure presents three different simple regression models, based respectively on variables TV, radio and newspaper:

Multiple regression models with two predictor variables X₁ and X₂ and a response variable y = β₀ + β₁X₁ + β₂X₂ + ε can be understood as a plane in space. Similarly, to the case of the simple linear regression, the observations are points in space and the plane is “fitted” to best approximate the observations.

Example: From the lecture notes M. Bremer, Math 261 A, 2012. For model y = 50β₀ + 10β₁X₁ + 7β₂X₂, plane looks like this:

Let f(x_i)= β₀+ β₁x₁+ β₂x₂+ … + β_nx_n be the prediction for output Y based on the ith value of X. The formula for the residual sum of squares (RSS) for multiple regression is the same as in a case for simple linear regression. As before, we could use some calculus to determine values of the model parameters: find derivatives with respect to β₀ , . . . , β_k , set them equal to zero, and derive the equations that our parameter would have to fulfill.

We gave a short introduction to one of the most common regression approaches – linear regression model. The model belongs to parametric supervised machine learning algorithms. Soon we will discuss how to assess the accuracy of the coefficient estimate.

Thanks for reading this article. If you like it, please recommend and share it.

UHURA IS AN ARTIFICIAL INTELLIGENCE PLATFORM THAT READS AND UNDERSTANDS CONTRACTS AND AGREEMENTS JUST AS HUMANS DO. IT OFFERS AUTOMATION CAPABILITIES TO HELP REDUCE COSTS AND SHORTEN DOCUMENT PROCESSING TIME FROM HOURS TO SECONDS.LOWER YOUR COSTS, SAVE TIME, AND ELIMINATE MANUAL PROCESSING OF CONTRACTS AND AGREEMENTS.

Uhura Solutions

Uhura Solutions LTD
6th Floor 9 Appold Street, London EC2A 2AP, United Kingdom

Uhura Solutions LTD 6th Floor 9 Appold Street, London EC2A 2AP, United Kingdom

Uhura Solutions LTD
6th Floor 9 Appold Street, London EC2A 2AP, United Kingdom