Bias Vs Variance

Bias and Variance is a concept deep-rooted in Statistics and essential for data scientists. A
significant reason to understand these terms is that the right balance of variables is essential to
constructing machine-learning algorithms that create accurate results. Each algorithm develops
to some degree with Bias and Variance.

We define bias as the difference between the average prediction from our model and the actual
value that we are trying to predict. A model with high bias pays very little attention to the training
data and oversimplifies the model. It always leads to a high error on training and test data.
Variance is the variability of model prediction for a given data point or a value that tells us the
spread of our data. A model with high Variance pays a lot of attention to training data and does
not generalize on the data which it hasn’t seen before. As a result, such models perform well on
training data but have high test data error rates.

When a model doesn’t capture the underlying pattern of data, it is said to be underfitting in
supervised learning. These models usually have high bias and low Variance. We may encounter
it when there is not enough data to build an accurate model or if we try to build a linear model
based on nonlinear data. Also, this kind of model is straightforward to capture the complex
patterns in data like Linear and logistic regression.

In supervised learning, overfitting happens when our model captures the noise and the
underlying data pattern. It happens when we train our model a lot over noisy datasets. These
models have low bias and high Variance. These models are very complex, like Decision trees
which are prone to overfitting.

High Variance
Symptoms are as follows

Training error is much lower than test error
Training error is lower than ϵ
Test error is above ϵ
Remedies:
Add more training data
Reduce model complexity — complex models are prone to high Variance
Bagging (will be covered later in the course)

High Bias
Unlike the first regime, the second regime indicates high bias: the used model is not robust
enough to produce an accurate prediction.
Symptoms are as follows
Training error is higher than ϵ
Remedies:
Use more complex model (e.g. kernelize, use nonlinear models)
Add features
Boosting