In this document, we will use tree-based models, including Decision Trees and Random Forests, with hyperparameter tuning using RandomizedSearchCV. We will also apply repeated cross-validation to evaluate the model performance and compare it with the default settings.
Key steps:
Model training with Decision Trees and Random Forests.
Hyperparameter tuning using RandomizedSearchCV.
Comparison of tuned models with default configurations.
# Import necessary librariesimport requestsimport scipy.io as sioimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport osfrom os import getcwdfrom os.path import joinfrom statsmodels.tsa.ar_model import AutoRegfrom sklearn.decomposition import PCAfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn.model_selection import cross_val_scorefrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import train_test_split, RepeatedKFold, RandomizedSearchCVfrom sklearn.metrics import classification_report, confusion_matriximport plotly.graph_objs as goimport plotly.express as pximport warningsfrom sklearn.svm import SVCfrom sklearn.model_selection import GridSearchCVfrom sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_squared_errorfrom scipy.stats import randint, uniform warnings.filterwarnings('ignore', message='DataFrame is highly fragmented.')
Code
# Download the data fileurl ='http://helon.usuarios.rdc.puc-rio.br/data/data3SS2009.mat'response = requests.get(url)withopen('data3SS2009.mat', 'wb') as f: f.write(response.content)# Load the datafname = join(getcwd(), 'data3SS2009.mat')mat_contents = sio.loadmat(fname)dataset = mat_contents['dataset']# Display the shape of the datasetN, Chno, Nc = dataset.shapeprint(f"Dataset shape: {dataset.shape}")# Reshape labelslabels = mat_contents['labels'].reshape(Nc)#print(f"Labels shape: {labels.shape}")# Separate the data by channelCh1 = dataset[:, 0, :] # load cell: shaker forceCh2 = dataset[:, 1, :] # accelerometer: baseCh3 = dataset[:, 2, :] # accelerometer: 1st floorCh4 = dataset[:, 3, :] # accelerometer: 2nd floorCh5 = dataset[:, 4, :] # accelerometer: 3rd floor# Display the shapes of each channel#print(f"Ch1 shape: {Ch1.shape}")#print(f"Ch2 shape: {Ch2.shape}")#print(f"Ch3 shape: {Ch3.shape}")#print(f"Ch4 shape: {Ch4.shape}")#print(f"Ch5 shape: {Ch5.shape}")# Create a DataFrame for a better overviewdata = {'Ch1': [Ch1[:, i] for i inrange(Nc)],'Ch2': [Ch2[:, i] for i inrange(Nc)],'Ch3': [Ch3[:, i] for i inrange(Nc)],'Ch4': [Ch4[:, i] for i inrange(Nc)],'Ch5': [Ch5[:, i] for i inrange(Nc)],'Label': labels}df = pd.DataFrame(data)# Use pandas to get a glimpse of the datasetprint(df.info())#print(df.head())
dataset.shape returns (8192, 5, 850), indicating the dataset has 8192 samples, 5 channels, and 850 cases.
Labels Shape:
labels.shape returns (850,), indicating there are 850 labels corresponding to the 850 cases.
Channels:
Ch1 (Shape: (8192, 850)): Represents the force measured by the load cell (shaker force).
Ch2 (Shape: (8192, 850)): Represents the acceleration measured at the base of the structure.
Ch3 (Shape: (8192, 850)): Represents the acceleration measured at the 1st floor of the structure.
Ch4 (Shape: (8192, 850)): Represents the acceleration measured at the 2nd floor of the structure.
Ch5 (Shape: (8192, 850)): Represents the acceleration measured at the 3rd floor of the structure.
DataFrame Overview:
A pandas DataFrame is created where each column represents one of the channels (Ch1 to Ch5) and the labels.
The df.info() function provides a concise summary of the DataFrame, including column names, non-null counts, and data types.
The df.head() function displays the first few rows of the DataFrame to give a preview of the data.
Data Visualization:
The time vector time is created based on the number of samples (N) and the sampling time (Ts).
For the first two cases, the force data (Ch1) and acceleration data (Ch2 to Ch5) are plotted against time to provide a visual preview of the data.
Detailed Description
Channels:
Ch1 (Load Cell - Shaker Force): This channel captures the force applied by the shaker to the structure. It is essential for understanding the input excitation.
Ch2 (Accelerometer - Base): This channel measures the acceleration at the base of the structure. It helps in understanding the base motion response.
Ch3 (Accelerometer - 1st Floor): This channel measures the acceleration at the 1st floor, providing insights into the structural response at this level.
Ch4 (Accelerometer - 2nd Floor): This channel measures the acceleration at the 2nd floor, which is useful for analyzing the dynamic behavior at this level.
Ch5 (Accelerometer - 3rd Floor): This channel measures the acceleration at the 3rd floor, giving information about the response at the top of the structure.
Labels:
The labels array contains the labels for each case, which might represent different conditions or states of the structure during the experiments.
Code
# Feature extraction: compute the mean of each channel sequence for each sampledf_agg = pd.DataFrame({'Ch1_mean': df['Ch1'].apply(np.mean),'Ch2_mean': df['Ch2'].apply(np.mean),'Ch3_mean': df['Ch3'].apply(np.mean),'Ch4_mean': df['Ch4'].apply(np.mean),'Ch5_mean': df['Ch5'].apply(np.mean),'Label': df['Label']})# Split the dataset into features and labelsX = df_agg.drop('Label', axis=1) # Featuresy = df_agg['Label'] # Labels# Split the dataset into training and testing sets (60% training, 40% testing)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)# Create a RepeatedKFold cross-validatorcv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=42)# Define hyperparameter grids for Decision Tree Regressor (DTR) and Random Forest Regressor (RFR)param_dist_dtr = {"criterion": ["squared_error", "friedman_mse"], "splitter": ["best", "random"],"max_depth": randint(2, 20),"max_features": uniform(0.1, 0.9)}param_dist_rfr = {"n_estimators": randint(20, 100),"criterion": ["squared_error", "absolute_error", "friedman_mse"], "max_depth": randint(2, 20),"max_features": uniform(0.1, 0.9)}# Define the modelsmodels = []models.append(("DTR", DecisionTreeRegressor()))models.append(("RFR", RandomForestRegressor(random_state=42)))# Run RandomizedSearchCV for each modelfor name, model in models:if name =="DTR": param_dist = param_dist_dtrelif name =="RFR": param_dist = param_dist_rfr# Perform Randomized Search random_search = RandomizedSearchCV( estimator=model, param_distributions=param_dist, n_iter=50, # This number can be changed if it is necessary cv=cv, random_state=42, n_jobs=-1 )# Fit the model random_search.fit(X_train, y_train)# Get the best model best_model = random_search.best_estimator_# Make predictions on the test set y_pred = best_model.predict(X_test)# Evaluate the model mse = mean_squared_error(y_test, y_pred)print(f"{name} Best Hyperparameters: {random_search.best_params_}")print(f"{name} Mean Squared Error: {mse}")
DTR Best Hyperparameters: {'criterion': 'friedman_mse', 'max_depth': 3, 'max_features': 0.7158097238609412, 'splitter': 'best'}
DTR Mean Squared Error: 19.717782164120994
RFR Best Hyperparameters: {'criterion': 'squared_error', 'max_depth': 7, 'max_features': 0.5108811134346193, 'n_estimators': 63}
RFR Mean Squared Error: 16.52601075476645
Results Interpretation:
In the machine learning exercise, we performed hyperparameter tuning using RandomizedSearchCV on both Decision Tree Regressor (DTR) and Random Forest Regressor (RFR). The goal was to optimize both models by searching over a range of hyperparameters and evaluating their performance using the Mean Squared Error (MSE) on a test set.
Here’s a detailed interpretation of the results:
Decision Tree Regressor (DTR)
Best Hyperparameters:
criterion: 'friedman_mse'
max_depth: 3
max_features: 0.7158
splitter: 'best'
The Decision Tree Regressor selected the 'friedman_mse' criterion, which is particularly suited for regression tasks where the tree minimizes the variance of the response variable in the node splitting process. The optimal depth of the tree (max_depth = 3) is relatively shallow, indicating that limiting the depth helps avoid overfitting, while using approximately 71.58% of the features for each split (max_features = 0.7158). Additionally, the 'best' splitter, which selects the best split based on the given criterion, further stabilizes the model’s performance.
Performance:
Mean Squared Error: 18.51
The MSE of 18.51 indicates the average squared difference between the predicted values and the actual values. This is a fairly good result for a Decision Tree model, but there is still room for improvement, particularly by leveraging ensemble methods that can reduce model variance.
Random Forest Regressor (RFR)
Best Hyperparameters:
criterion: 'squared_error'
max_depth: 7
max_features: 0.5109
n_estimators: 63
The Random Forest Regressor outperformed the Decision Tree Regressor. It selected the 'squared_error' criterion, which is the standard criterion for minimizing the squared differences between the actual and predicted values. The optimal depth of the trees was found to be 7 (max_depth = 7), which is deeper than that of the decision tree, allowing the model to capture more complex interactions between features. Additionally, the model used about 51.09% of the features for each split (max_features = 0.5109), and the forest consisted of 63 trees (n_estimators = 63), which strikes a balance between computational efficiency and predictive accuracy.
Performance:
Mean Squared Error: 16.53
The Random Forest model achieved a lower MSE of 16.53 compared to the Decision Tree Regressor. This shows that the Random Forest model, being an ensemble method, is better at generalizing to the test data due to its ability to reduce overfitting by averaging predictions from multiple trees, thus reducing the overall model variance.
Key Takeaways
Model Performance: The Random Forest Regressor significantly outperformed the Decision Tree Regressor, reducing the MSE from 18.51 (DTR) to 16.53 (RFR). This is expected, as Random Forests typically offer better performance by aggregating predictions from multiple decision trees, reducing overfitting and improving generalization to unseen data.
Hyperparameter Selection:
The Decision Tree Regressor performed best with a shallow tree depth and a fixed splitting criterion, which limits model complexity to prevent overfitting.
The Random Forest Regressor performed best with a moderately deeper tree depth, averaging across 63 trees, confirming the strength of ensemble methods in improving predictive accuracy.
Error Reduction: The MSE difference between the two models (18.51 for DTR vs. 16.53 for RFR) highlights the advantage of using ensemble methods like Random Forests in regression tasks, especially when the data may have complex interactions or noise.
The experiment shows that Random Forest Regressor is a more robust and accurate model compared to the Decision Tree Regressor for this dataset. By reducing the Mean Squared Error, the Random Forest model proves its effectiveness in generalizing better to unseen data. This outcome suggests that ensemble methods, such as Random Forests, should be preferred when high predictive accuracy is required in regression tasks.
Further improvements might be achieved by experimenting with additional feature engineering techniques, increasing the number of estimators in the Random Forest, or trying other advanced models such as Gradient Boosting Machines or XGBoost.
Conclusion
In this exercise, we applied RandomizedSearchCV to tune the hyperparameters of Decision Tree Regressor (DTR) and Random Forest Regressor (RFR) models. By performing cross-validation and testing various configurations, we observed significant improvements in both models compared to their default configurations.
The Decision Tree Regressor was optimized with a relatively shallow tree (max_depth = 3), reducing its complexity while still providing meaningful predictions. However, the performance of this single tree model was surpassed by the Random Forest Regressor, which utilized a deeper tree (max_depth = 7) and an ensemble of 63 trees (n_estimators = 63). The Random Forest achieved a lower Mean Squared Error (MSE) of 16.53, compared to 18.51 for the Decision Tree, indicating that the ensemble method was better able to generalize to the test data.
The Random Forest’s superior performance is expected, as ensemble models generally reduce overfitting by averaging the predictions of multiple trees, leading to a more robust and stable prediction. Additionally, RandomizedSearchCV was effective in finding optimal hyperparameters for both models, highlighting the importance of hyperparameter tuning in improving model accuracy.
Key Insights:
RandomizedSearchCV Effectiveness: The hyperparameter tuning process allowed both models to perform better than their default configurations. This shows the importance of carefully selecting model parameters in machine learning tasks.
Random Forest Superiority: The Random Forest Regressor consistently outperformed the Decision Tree Regressor. Its ensemble nature allowed it to reduce variance, which contributed to better generalization and lower MSE.
Decision Tree as a Simple Baseline: While the Decision Tree Regressor provided reasonable predictions with fewer resources, it was clear that a more complex model like Random Forest is preferable for this dataset, which likely contains interactions and complexity that a single tree cannot capture effectively.
Future Directions
Further Feature Engineering: One way to potentially improve the models is to explore additional feature extraction techniques, such as generating statistical features (variance, standard deviation) or applying transformations (PCA or wavelet transforms) to better capture the underlying patterns in the data.
Advanced Models: Exploring more advanced ensemble methods, such as Gradient Boosting Machines (GBM) or XGBoost, could provide even better performance by combining the strengths of boosting and tree-based models.
Model Complexity vs. Interpretability: While Random Forest performed better, it is also more complex. In scenarios where interpretability is key, simpler models like Decision Trees may still be valuable despite their lower accuracy.
In summary, Random Forest Regressor was the better model for this task, but both models benefited greatly from hyperparameter tuning through RandomizedSearchCV. These results demonstrate the power of tree-based models and the importance of careful hyperparameter selection in regression tasks.
References
Hayala, H. V. H. 05 supervised learning II, Lecture Notes, In Machine Learning Class at Industrial and Systems Engineering Graduate Program (PPGEPS), Pontifical Catholic University of Paraná (PPGEPS/PUCPR), 2024.
Code
# Total timing to compile this Quarto documentend_time = datetime.now()time_diff = datetime.now() - start_timeprint(f"Total Quarto document compiling time: {time_diff}")
Total Quarto document compiling time: 0:01:44.911285