Beginning The Machine Learning Journey With Linear Regression

Linear Regression
Linear regression is a part of Statistics that defines the relationship between two numerical variables. It is a linear model that believes and justifies that there exists a linear relationship between two variables.
It takes into account the input variable and the output variable. It implies that one can calculate from a linear combination of input variables (x).
Linear Regression Model Representation
Linear regression can be expressed in terms of an equation as:
y=B0+B1*x
Where x is an input variable. ‘B’ is greek alphabet representing coefficients here which are a scalar factor assigned to each input variable. An additional coefficient has been added to incorporate the intercept or bias.
Types of Linear Regression
Simple Linear Regression: It takes into account a single x variable and helps in predicting output(y) variables.
Example: When we are trying to predict the price of a house based on the square footage of the area covered by it. Here, Square footage of the house is the input variable and the price of the home is the output variable.
Multiple Regression: There are more than 1 input variables involved to predict output(y) variables.
Example: When we take an area of a house, the number of rooms, HouseStyle to predict the house price. Here, multiple input variables like the area of the house, number of rooms, HouseStyle are used to predict house price which is the output variable.
Regularization
It is the technique where we add information to the regression equation or reduce coefficients to zero to avoid overfitting or the complex nature of the problem. It is used when there is collinearity in input values
Types Of Regularization Based Regression
Lasso Regression: It is also known as L1 Regularization. It is a procedure where Ordinary Least Squares is modified to reduce the absolute sum of the coefficients.
Example: There are 10,000 features to predict variables, the Lasso model selects only a few coefficients and converts the reset to zero.
Ridge Regression: It is also known as L2 Regularization. It is a procedure where Ordinary Least Squares squared the absolute sum of the coefficients. When coefficients used in the regression are unbalanced, we introduce alpha value to improve the model. Example: When we are trying to predict the sales of outlets, the type of outlet has higher weight compared to the weight of items sold there then we introduce alpha which reduces the sum of coefficients.
Gradient Descent
It is a process of optimizing coefficients by repeatedly minimizing the error of the model on your training data. The process involves adding learning rates and coefficients are updated for minimizing the error. It is iterated until a minimum sum square error is achieved or change is not possible.
Learning Rate () is the size of the improvement step for each iteration of the procedure and should be chosen decisively.
Types of Gradient Descent
Stochastic Gradient Descent: This method looks at every example in the entire training set on every step.
Example: The training data has 200 samples then the parameters are updated for the same number of samples. It means once every individual sample is used in the model.
Batch Gradient Descent: This method iterates through a training set, whenever you come across a training example, you update the parameters according to the error gradient based on a single training example only.
Example: The training set has 100 samples, then the parameters of the model are updated only once based on all examples.
Regression Line Properties
Considering regression coefficients as B0 and B1, the line has the following properties:
- The line minimizes the sum of squared differences between the actual values and predicted values.
- The regression line graphically passes through the mean of X and Y values.
- B0 means the y-intercept of the regression line.
- B1 is the average change in Y for 1-unit change in X. It is also known as the slope of the regression line.
The least-squares regression line is the only straight line that has all of these properties.
Defining The Relationship Between Input And Output Variable
When B1>0, x and y variables have positive relationships. It implies that x will increase y.
When B1<0, x and y variables have negative relationships. It implies that x and y are inversely related, if x increases, y will decrease.
For example, When we are trying to predict house price, house type, and several rooms used to define the model is known as input variables and house price is an output variable.
How To Check Model Performance?
We plot the actual values and predicted values on a graph. The main idea is to find a line that best fits the data. The best line would be where the total prediction error is the smallest. Error is the distance between the point of the regression line.
Source: https://towardsdatascience.com/linear-regression-detailed-view-ea73175f6e86
Error is squared so that positive and negative differences do not cancel each other.
R-Squared value
This value exists from a range of 0 to 1 where 0 points to predictor X does not affect y and 1 means predictor has full effect on changes in y.
1. Regression sum of squares(SSR)
It tells us the distance between the regression line and the actual output line.
2. Sum of Squared Error(SSE)
It tells how much y value differs from the predicted value.
3. The total sum of squares (SSTO)
It explains how much data points are close to mean.
Conclusion
We covered the grounds of linear regression in the article. We learned about its model representation. We know about various types of regression and how we can use them in a machine learning solution to predict values. We went through how we can predict based on one or more independent variables. Once we predict, we also know about how to check the model performance to know how much prediction varies from actual values.
No, we can only consider numerical values to find trends among the data. To consider categorical variables, we assign them numbers.
We can use more than 1 independent variable to predict a single dependent variable. We can’t predict more than one dependent variable.
The data required for analysis is linear. You can plot data using scatter plot to see if data can roughly fit around a line.
- Remove the outliers from the data
- Choose features/independent variables wisely to obtain a good model. Avoid redundant features from consideration.
- make sure that linear relationships exist between dependent and independent variables.