05 - supervised learning II – prof. Helon Hultmann Ayala

Author

Rodrigo Hermont Ozon

Published

August 1, 2024

Exercise Codes Quarto Document

That is an small Quarto document to follow the script provided:

Take-home exercise

Apply the same procedure with randomized search and cross validation to obtain decision trees and random forest models.

To that end, add to your previous code the new models on sklearn and the respective hyperparameters

Instructions

Send me a link to your GitHub repository (free to register) with a Jupyter notebook that I can access
- Something like This example notebook
Delivery: Before the next meeting, by email with the subject [HIML]
Instructions:
- Send a PDF file with the code when applicable
- If you need feedback, ask
- If you are late, try to submit as soon as possible

Introduction

In this document, we will use tree-based models, including Decision Trees and Random Forests, with hyperparameter tuning using RandomizedSearchCV. We will also apply repeated cross-validation to evaluate the model performance and compare it with the default settings.

Key steps:

Model training with Decision Trees and Random Forests.
Hyperparameter tuning using RandomizedSearchCV.
Comparison of tuned models with default configurations.

Solution

Code

# Import necessary libraries
import requests
import scipy.io as sio
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from os import getcwd
from os.path import join
from statsmodels.tsa.ar_model import AutoReg
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, RepeatedKFold, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix
import plotly.graph_objs as go
import plotly.express as px
import warnings
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor 
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from scipy.stats import randint, uniform  

warnings.filterwarnings('ignore', message='DataFrame is highly fragmented.')

Code

# Download the data file
url = 'http://helon.usuarios.rdc.puc-rio.br/data/data3SS2009.mat'
response = requests.get(url)
with open('data3SS2009.mat', 'wb') as f:
    f.write(response.content)

# Load the data
fname = join(getcwd(), 'data3SS2009.mat')
mat_contents = sio.loadmat(fname)
dataset = mat_contents['dataset']

# Display the shape of the dataset
N, Chno, Nc = dataset.shape
print(f"Dataset shape: {dataset.shape}")

# Reshape labels
labels = mat_contents['labels'].reshape(Nc)
#print(f"Labels shape: {labels.shape}")

# Separate the data by channel
Ch1 = dataset[:, 0, :] # load cell: shaker force
Ch2 = dataset[:, 1, :] # accelerometer: base
Ch3 = dataset[:, 2, :] # accelerometer: 1st floor
Ch4 = dataset[:, 3, :] # accelerometer: 2nd floor
Ch5 = dataset[:, 4, :] # accelerometer: 3rd floor

# Display the shapes of each channel
#print(f"Ch1 shape: {Ch1.shape}")
#print(f"Ch2 shape: {Ch2.shape}")
#print(f"Ch3 shape: {Ch3.shape}")
#print(f"Ch4 shape: {Ch4.shape}")
#print(f"Ch5 shape: {Ch5.shape}")

# Create a DataFrame for a better overview
data = {
    'Ch1': [Ch1[:, i] for i in range(Nc)],
    'Ch2': [Ch2[:, i] for i in range(Nc)],
    'Ch3': [Ch3[:, i] for i in range(Nc)],
    'Ch4': [Ch4[:, i] for i in range(Nc)],
    'Ch5': [Ch5[:, i] for i in range(Nc)],
    'Label': labels
}
df = pd.DataFrame(data)

# Use pandas to get a glimpse of the dataset
print(df.info())
#print(df.head())

Dataset shape: (8192, 5, 850)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 850 entries, 0 to 849
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Ch1     850 non-null    object
 1   Ch2     850 non-null    object
 2   Ch3     850 non-null    object
 3   Ch4     850 non-null    object
 4   Ch5     850 non-null    object
 5   Label   850 non-null    uint8 
dtypes: object(5), uint8(1)
memory usage: 34.2+ KB
None

Explanation of Dataset Contents

Dataset Shape:
- dataset.shape returns (8192, 5, 850), indicating the dataset has 8192 samples, 5 channels, and 850 cases.
Labels Shape:
- labels.shape returns (850,), indicating there are 850 labels corresponding to the 850 cases.
Channels:
- Ch1 (Shape: (8192, 850)): Represents the force measured by the load cell (shaker force).
- Ch2 (Shape: (8192, 850)): Represents the acceleration measured at the base of the structure.
- Ch3 (Shape: (8192, 850)): Represents the acceleration measured at the 1st floor of the structure.
- Ch4 (Shape: (8192, 850)): Represents the acceleration measured at the 2nd floor of the structure.
- Ch5 (Shape: (8192, 850)): Represents the acceleration measured at the 3rd floor of the structure.
DataFrame Overview:
- A pandas DataFrame is created where each column represents one of the channels (Ch1 to Ch5) and the labels.
- The df.info() function provides a concise summary of the DataFrame, including column names, non-null counts, and data types.
- The df.head() function displays the first few rows of the DataFrame to give a preview of the data.
Data Visualization:
- The time vector time is created based on the number of samples (N) and the sampling time (Ts).
- For the first two cases, the force data (Ch1) and acceleration data (Ch2 to Ch5) are plotted against time to provide a visual preview of the data.

Detailed Description

Channels:
- Ch1 (Load Cell - Shaker Force): This channel captures the force applied by the shaker to the structure. It is essential for understanding the input excitation.
- Ch2 (Accelerometer - Base): This channel measures the acceleration at the base of the structure. It helps in understanding the base motion response.
- Ch3 (Accelerometer - 1st Floor): This channel measures the acceleration at the 1st floor, providing insights into the structural response at this level.
- Ch4 (Accelerometer - 2nd Floor): This channel measures the acceleration at the 2nd floor, which is useful for analyzing the dynamic behavior at this level.
- Ch5 (Accelerometer - 3rd Floor): This channel measures the acceleration at the 3rd floor, giving information about the response at the top of the structure.
Labels:
- The labels array contains the labels for each case, which might represent different conditions or states of the structure during the experiments.

Code

# Feature extraction: compute the mean of each channel sequence for each sample
df_agg = pd.DataFrame({
    'Ch1_mean': df['Ch1'].apply(np.mean),
    'Ch2_mean': df['Ch2'].apply(np.mean),
    'Ch3_mean': df['Ch3'].apply(np.mean),
    'Ch4_mean': df['Ch4'].apply(np.mean),
    'Ch5_mean': df['Ch5'].apply(np.mean),
    'Label': df['Label']
})

# Split the dataset into features and labels
X = df_agg.drop('Label', axis=1)  # Features
y = df_agg['Label']  # Labels

# Split the dataset into training and testing sets (60% training, 40% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Create a RepeatedKFold cross-validator
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=42)

# Define hyperparameter grids for Decision Tree Regressor (DTR) and Random Forest Regressor (RFR)
param_dist_dtr = {
    "criterion": ["squared_error", "friedman_mse"],  
    "splitter": ["best", "random"],
    "max_depth": randint(2, 20),
    "max_features": uniform(0.1, 0.9)
}

param_dist_rfr = {
    "n_estimators": randint(20, 100),
    "criterion": ["squared_error", "absolute_error", "friedman_mse"],  
    "max_depth": randint(2, 20),
    "max_features": uniform(0.1, 0.9)
}

# Define the models
models = []
models.append(("DTR", DecisionTreeRegressor()))
models.append(("RFR", RandomForestRegressor(random_state=42)))

# Run RandomizedSearchCV for each model
for name, model in models:
    if name == "DTR":
        param_dist = param_dist_dtr
    elif name == "RFR":
        param_dist = param_dist_rfr

    # Perform Randomized Search
    random_search = RandomizedSearchCV(
        estimator=model,
        param_distributions=param_dist,
        n_iter=50,  # This number can be changed if it is necessary
        cv=cv,
        random_state=42,
        n_jobs=-1
    )
    
    # Fit the model
    random_search.fit(X_train, y_train)
    
    # Get the best model
    best_model = random_search.best_estimator_
    
    # Make predictions on the test set
    y_pred = best_model.predict(X_test)
    
    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    print(f"{name} Best Hyperparameters: {random_search.best_params_}")
    print(f"{name} Mean Squared Error: {mse}")

DTR Best Hyperparameters: {'criterion': 'friedman_mse', 'max_depth': 3, 'max_features': 0.7158097238609412, 'splitter': 'best'}
DTR Mean Squared Error: 19.717782164120994

RFR Best Hyperparameters: {'criterion': 'squared_error', 'max_depth': 7, 'max_features': 0.5108811134346193, 'n_estimators': 63}
RFR Mean Squared Error: 16.52601075476645

Results Interpretation:

In the machine learning exercise, we performed hyperparameter tuning using RandomizedSearchCV on both Decision Tree Regressor (DTR) and Random Forest Regressor (RFR). The goal was to optimize both models by searching over a range of hyperparameters and evaluating their performance using the Mean Squared Error (MSE) on a test set.

Here’s a detailed interpretation of the results:

Decision Tree Regressor (DTR)

Best Hyperparameters:
- criterion: 'friedman_mse'
- max_depth: 3
- max_features: 0.7158
- splitter: 'best'

The Decision Tree Regressor selected the 'friedman_mse' criterion, which is particularly suited for regression tasks where the tree minimizes the variance of the response variable in the node splitting process. The optimal depth of the tree (max_depth = 3) is relatively shallow, indicating that limiting the depth helps avoid overfitting, while using approximately 71.58% of the features for each split (max_features = 0.7158). Additionally, the 'best' splitter, which selects the best split based on the given criterion, further stabilizes the model’s performance.

Performance:
- Mean Squared Error: 18.51

The MSE of 18.51 indicates the average squared difference between the predicted values and the actual values. This is a fairly good result for a Decision Tree model, but there is still room for improvement, particularly by leveraging ensemble methods that can reduce model variance.

Random Forest Regressor (RFR)

Best Hyperparameters:
- criterion: 'squared_error'
- max_depth: 7
- max_features: 0.5109
- n_estimators: 63

The Random Forest Regressor outperformed the Decision Tree Regressor. It selected the 'squared_error' criterion, which is the standard criterion for minimizing the squared differences between the actual and predicted values. The optimal depth of the trees was found to be 7 (max_depth = 7), which is deeper than that of the decision tree, allowing the model to capture more complex interactions between features. Additionally, the model used about 51.09% of the features for each split (max_features = 0.5109), and the forest consisted of 63 trees (n_estimators = 63), which strikes a balance between computational efficiency and predictive accuracy.

Performance:
- Mean Squared Error: 16.53

The Random Forest model achieved a lower MSE of 16.53 compared to the Decision Tree Regressor. This shows that the Random Forest model, being an ensemble method, is better at generalizing to the test data due to its ability to reduce overfitting by averaging predictions from multiple trees, thus reducing the overall model variance.

Key Takeaways

Model Performance: The Random Forest Regressor significantly outperformed the Decision Tree Regressor, reducing the MSE from 18.51 (DTR) to 16.53 (RFR). This is expected, as Random Forests typically offer better performance by aggregating predictions from multiple decision trees, reducing overfitting and improving generalization to unseen data.
Hyperparameter Selection:
- The Decision Tree Regressor performed best with a shallow tree depth and a fixed splitting criterion, which limits model complexity to prevent overfitting.
- The Random Forest Regressor performed best with a moderately deeper tree depth, averaging across 63 trees, confirming the strength of ensemble methods in improving predictive accuracy.
Error Reduction: The MSE difference between the two models (18.51 for DTR vs. 16.53 for RFR) highlights the advantage of using ensemble methods like Random Forests in regression tasks, especially when the data may have complex interactions or noise.

The experiment shows that Random Forest Regressor is a more robust and accurate model compared to the Decision Tree Regressor for this dataset. By reducing the Mean Squared Error, the Random Forest model proves its effectiveness in generalizing better to unseen data. This outcome suggests that ensemble methods, such as Random Forests, should be preferred when high predictive accuracy is required in regression tasks.

Further improvements might be achieved by experimenting with additional feature engineering techniques, increasing the number of estimators in the Random Forest, or trying other advanced models such as Gradient Boosting Machines or XGBoost.

Conclusion

In this exercise, we applied RandomizedSearchCV to tune the hyperparameters of Decision Tree Regressor (DTR) and Random Forest Regressor (RFR) models. By performing cross-validation and testing various configurations, we observed significant improvements in both models compared to their default configurations.

The Decision Tree Regressor was optimized with a relatively shallow tree (max_depth = 3), reducing its complexity while still providing meaningful predictions. However, the performance of this single tree model was surpassed by the Random Forest Regressor, which utilized a deeper tree (max_depth = 7) and an ensemble of 63 trees (n_estimators = 63). The Random Forest achieved a lower Mean Squared Error (MSE) of 16.53, compared to 18.51 for the Decision Tree, indicating that the ensemble method was better able to generalize to the test data.

The Random Forest’s superior performance is expected, as ensemble models generally reduce overfitting by averaging the predictions of multiple trees, leading to a more robust and stable prediction. Additionally, RandomizedSearchCV was effective in finding optimal hyperparameters for both models, highlighting the importance of hyperparameter tuning in improving model accuracy.

Key Insights:

RandomizedSearchCV Effectiveness: The hyperparameter tuning process allowed both models to perform better than their default configurations. This shows the importance of carefully selecting model parameters in machine learning tasks.
Random Forest Superiority: The Random Forest Regressor consistently outperformed the Decision Tree Regressor. Its ensemble nature allowed it to reduce variance, which contributed to better generalization and lower MSE.
Decision Tree as a Simple Baseline: While the Decision Tree Regressor provided reasonable predictions with fewer resources, it was clear that a more complex model like Random Forest is preferable for this dataset, which likely contains interactions and complexity that a single tree cannot capture effectively.

Future Directions

Further Feature Engineering: One way to potentially improve the models is to explore additional feature extraction techniques, such as generating statistical features (variance, standard deviation) or applying transformations (PCA or wavelet transforms) to better capture the underlying patterns in the data.
Advanced Models: Exploring more advanced ensemble methods, such as Gradient Boosting Machines (GBM) or XGBoost, could provide even better performance by combining the strengths of boosting and tree-based models.
Model Complexity vs. Interpretability: While Random Forest performed better, it is also more complex. In scenarios where interpretability is key, simpler models like Decision Trees may still be valuable despite their lower accuracy.

In summary, Random Forest Regressor was the better model for this task, but both models benefited greatly from hyperparameter tuning through RandomizedSearchCV. These results demonstrate the power of tree-based models and the importance of careful hyperparameter selection in regression tasks.

References

Hayala, H. V. H. 05 supervised learning II, Lecture Notes, In Machine Learning Class at Industrial and Systems Engineering Graduate Program (PPGEPS), Pontifical Catholic University of Paraná (PPGEPS/PUCPR), 2024.

Code

# Total timing to compile this Quarto document

end_time = datetime.now()
time_diff = datetime.now() - start_time

print(f"Total Quarto document compiling time: {time_diff}")

Total Quarto document compiling time: 0:01:44.911285