# Introduction to Machine Learning by WWCode Data Science, Part 4 Recap

Women Who Code is a fantastic non-profit organization whose goal is to expose more women to tech related careers. Their chapter Women Who Code Data Science is putting on a 6 week introduction to machine learning course on Saturdays from 5/16/20 – 6/20/20. Each week, on the following Monday, I will be posting a recap of the course on my blog. Here’s the recap of Part 4; the full video is available on YouTube.

## Part 4 Focus: Linear Regression

1. Linear Regression

• Linear regression is a form of supervised learning that makes predictions by establishing a linear relationship between a target response (y) and continuous features (xi).
• Since the relationships are linear, the basic equation used for predictions is:

#### y = mx + b

• Simple linear regression is a form of linear regression with only one feature (x variable), and can be visualized like the graph above. This equation lives up to its name, as it is simply:

#### y = b1x1 + b0

• Multiple linear regression is a more versatile form of regression, as it allows two or more features to help make the prediction for the target variable. The equation for multiple linear regression is only slightly different from the simpler version:

#### y = b1x1 + b2x2 + … + bnxn + b0

• In both formulas, the x term(s) will be input feature(s) given to the model.
• The y term will be the prediction made by the model.
• The b0 term and the b1 through bn terms (I’ll call them bi) are numerical values that are calculated as follows:

2. Residuals

• The residual of a prediction is the difference between the actual outcome and the expected outcome predicted by the linear regression model.
• Smaller residuals means that the model you have developed is more apt to make predictions that closely fit the data set you are training it with.

3. Loss Functions

• A loss function is a function that measures an event based on some “cost” associated with an event.
• When evaluating a regression model, we can consider the model performance as a loss function, and attempt to optimize the model by minimizing that loss.
• There are three possible loss functions that can be used to evaluate the performance of a linear regression model: mean absolute error, mean squared error, and root mean squared error.
• Mean absolute error (MAE) measures the average of the absolute value of the residuals.
• Mean squared error (MSE) measures the average of the squared value of the residuals.
• Root mean squared error (RMSE) measures the square root of the average of the squared value of the residuals.

Before we get into gradient descent, let’s touch on a couple key terms you’ll need:

• Global maximum: the highest point on a complete graph of data
• Global minimum: the lowest point on a complete graph of data
• Local maximum: the highest point on a subsection of a graph of data
• Local minimum: the lowest point on a subsection of a graph of data
• Gradient descent is an optimization algorithm that is specifically used to identify local minima.
• Basically, the goal of gradient descent is to get to the bottom of the deepest “valley” in the graph.
• There are three types of gradient descent: batch, stochastic, or mini batch.
• Batch gradient descent is used to iteratively optimize a given function by using the whole training dataset each iteration for its calculations.
• Stochastic gradient descent is used to approximate the optimization by randomly selecting a point from the training dataset each iteration for its calculations.
• Mini batch gradient descent is used to approximate the optimization by randomly selecting a subset of data form the training dataset each iteration for its calculations (think of it as a balance between batch and stochastic).
• You can determine which algorithm to use based on the type of dataset you have, as well ass the amount of computational power you have.

5. Sparse Learning

• Given the sheer amount of data available in this day and age, it is common for data sets to have hundreds, or even thousands of features – which can lead to sparse data (or data with many gaps).
• To deal with sparse data, there are sparse learning algorithms such as feature subsetting, regularization, and dimension reduction.
• Subsetting is determining the optimal way to model a data set based on a subset of all of the features available in the dataset.
• Regularization assigns weights to each feature and tries to shrink the features to zero, which prevents overfitting to insignificant, noisy features.
• Dimension reduction is literally the process of reducing the dimension of your dataset using methodologies such as linear transformations.

Wow, that was a lot of information! Thanks for sticking with me, and I really hope you learned something new about linear regression. If you enjoyed this week’s blog, consider signing up for the last two weeks of the course, or, you can check out last week’s blog on classification. Now, go out there and regress some lines!