# Machine Learning Notebook: Linear and Ridge Regression, Hyperparameter Tuning


In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.random_projection import GaussianRandomProjection
from sklearn.datasets import make_regression, load_diabetes
from sklearn.model_selection import KFold

# Set plot style
sns.set(style='whitegrid')

# 1. Introduction




In [None]:
# Generate a high-dimensional dataset
X, y = make_regression(n_samples=100, n_features=400, n_informative=1, noise=0.5, random_state=42)

diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# Add a column of ones to X for the intercept
X = np.c_[np.ones((X.shape[0], 1)), X]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression

Linear regression is a method used to model the relationship between a dependent variable $y$ and one or more independent variables $X$. The model assumes a linear relationship between the inputs and the output:

$$
y = X\beta + \epsilon
$$

Where:

- $X$ is the matrix of input features (with each row representing a data point).
    
- $\beta$ are the coefficients (parameters) we want to estimate.
    
- $\epsilon$ is the error term (assumed to be normally distributed).

The goal of linear regression is to find the parameters $\beta$ that minimize the sum of squared errors (SSE):

$$
\text{SSE} = \sum_{i=1}^n (y_i - X_i \beta)^2
$$

This minimization problem is solved by computing the ordinary least squares (OLS) estimate:

$$
\hat{\beta} = (X^T X)^{-1} X^T y
$$

This equation gives the optimal parameters $\beta$ that minimize the prediction error.

**Question 1** : Implement the linear regression using the class sklearn.linear_model.LinearRegression.

**Question 2** : Implement the linear regression by hand (using Numpy functions only).

# Ridge Regression

Ridge Regression is a regularized version of linear regression that addresses multicollinearity and prevents overfitting by adding a penalty term to the cost function. It modifies the linear regression objective by introducing an L2 regularization term to penalize large coefficients:

$$
\text{Ridge Cost Function} = \sum_{i=1}^n (y_i - X_i \beta)^2 + \alpha \sum_{j=1}^p \beta_j^2
$$

Where:

- $\alpha$ is the regularization parameter controlling the penalty strength.
    
- $\beta_j$ are the regression coefficients.

The second term $\alpha \sum_{j=1}^p \beta_j^2$ discourages large values of $\beta$, which helps prevent overfitting in high-dimensional or multicollinear datasets. The solution to the ridge regression is given by:

$$
\hat{\beta}_{\text{ridge}} = (X^T X + \alpha I)^{-1} X^T y
$$

Where $I$ is the identity matrix. The addition of $\alpha I$ ensures that the matrix is invertible, even in cases of multicollinearity.

**Question 1** : Implement the Ridge regression using the function class sklearn.linear_model.Ridge.

**Question 2** : Implement the Ridge regression by hand (using Numpy functions only).

**Question 3** : Plot the train and test errors of the model as a function of $\alpha$.

# Hyperparameter Tuning 1 : Optimization over the valitation set

Normal Validation (or train/validation split) is a common approach for evaluating a machine learning model. The dataset is split into two sets:

- Training set: Used to train the model.

- Validation set: Used to evaluate the model's performance on unseen data.

Mathematically, this can be represented as:

$$
X_{\text{train}}, y_{\text{train}} \quad \text{and} \quad X_{\text{val}}, y_{\text{val}}
$$

The model is trained on $(X_{\text{train}}, y_{\text{train}})$ and evaluated on $(X_{\text{val}}, y_{\text{val}})$. This process helps detect overfitting because the model is tested on data that it hasnâ€™t seen during training.

The performance metric (e.g., mean squared error) is calculated on the validation set:

$$
\text{MSE}_{\text{val}} = \frac{1}{n_{\text{val}}} \sum_{i=1}^{n_{\text{val}}} (y_{\text{val}_i} - \hat{y}_{\text{val}_i})^2
$$

This score is then used in hyperparameter tuning and model selection.

**Question** : Split the train set in a smaller train set and a validation set and then tune the Ridge parametter on.

# Hyperparameter Tuning 2 : Cross-Validation

Cross-Validation is a technique used to ensure that the model generalizes well to unseen data (more complex than regular validation). The most common form is $k$-fold cross-validation, where the dataset is split into $k$ subsets (folds). The model is trained on $k-1$ folds and evaluated on the remaining fold. This process is repeated $k$ times, with each fold serving as the validation set once.

Mathematically, for each fold $i$:

- Train the model on $k-1$ folds

- Validate the model on the $i$-th fold

The performance metric (e.g., MSE) is computed for each fold, and the average score (other statistics such as the median, percentiles, ... may be used alternatively) is calculated:

$$
\text{MSE}_{\text{cv}} = \frac{1}{k} \sum_{i=1}^{k} \text{MSE}_{\text{val}_i}
$$

Cross-validation helps in reducing the variability of the validation scores and ensures the model is tested on multiple subsets of data, leading to more robust model selection.


**Question :** Implement cross-validation using KFold from sklearn.model_selection and use it to determine a good projection dimension with random projections.