06 - supervised learning II – prof. Helon Hultmann Ayala

Author

Rodrigo Hermont Ozon

Published

August 1, 2024

Exercise Codes Quarto Document

This is the document for the Lesson 06 on unsupervised learning, where we will apply clustering methods, visualization techniques like t-SNE, and one-class classification for anomaly detection. The data used will be from the 3-storey dataset.

Key tasks include:

Clustering using K-means.
Visualizing clusters using t-SNE.
Building a one-class classification SVM for anomaly detection in the data.

Take-home exercise

Using the 3-storey dataset from previous activities, perform the following (having features constructed with AR models and PCA):

Apply the k-means algorithm with the reduced dimensions;
Check the results visually by plotting side by side your clustering results and the correct labels in 3D (using only 3 dimensions of the PCA in the input space);
Optional: try to assign to clusters the label which contain most instances of a class, and build the confusion matrices based on this information (this will require some coding).
Visualize your dataset with t-SNE;
Build a one-class classification SVM for nominal and failure modes.

Instructions

Send me a link to your GitHub repository (free to register) with a Jupyter notebook that I can access
- Something like This example notebook
Delivery: Before the next meeting, by email with the subject [HIML]
Instructions:
- Send a PDF file with the code when applicable
- If you need feedback, ask
- If you are late, try to submit as soon as possible

Introduction

This document covers Lesson 06, where we use unsupervised learning techniques on the 3-storey dataset, reduced through AR model features and PCA. The tasks involve applying k-means clustering, t-SNE visualization, and One-Class SVM for anomaly detection.

Tasks:

Apply K-means clustering using reduced dimensions (PCA).
Compare clustering results with actual labels in a 3D plot.
(Optional) Assign labels to clusters and compute confusion matrices.
Visualize data using t-SNE.
Build a One-Class SVM for nominal and failure modes detection.

Solution

Code

# Import necessary libraries
import requests
import scipy.io as sio
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from os import getcwd
from os.path import join
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.svm import OneClassSVM
from sklearn.metrics import classification_report, confusion_matrix
from statsmodels.tsa.ar_model import AutoReg

#warnings.filterwarnings('ignore', message='DataFrame is highly fragmented.')

Code

# Download the data file
url = 'http://helon.usuarios.rdc.puc-rio.br/data/data3SS2009.mat'
response = requests.get(url)
with open('data3SS2009.mat', 'wb') as f:
    f.write(response.content)

# Load the data
fname = join(getcwd(), 'data3SS2009.mat')
mat_contents = sio.loadmat(fname)
dataset = mat_contents['dataset']

# Display the shape of the dataset
N, Chno, Nc = dataset.shape
print(f"Dataset shape: {dataset.shape}")

# Reshape labels
labels = mat_contents['labels'].reshape(Nc)

# Separate the data by channel
Ch1 = dataset[:, 0, :]
Ch2 = dataset[:, 1, :]
Ch3 = dataset[:, 2, :]
Ch4 = dataset[:, 3, :]
Ch5 = dataset[:, 4, :]

# Apply AR model to each channel to extract features (AR coefficients)
def apply_ar_model(data, lags=5):
    ar_features = []
    for sample in data.T:  # Iterate over the cases, transpose needed for correct iteration
        model = AutoReg(sample, lags=lags)
        model_fitted = model.fit()
        ar_features.append(model_fitted.params)
    return np.array(ar_features)

# Apply AR model to all channels
ar_features_ch1 = apply_ar_model(Ch1)
ar_features_ch2 = apply_ar_model(Ch2)
ar_features_ch3 = apply_ar_model(Ch3)
ar_features_ch4 = apply_ar_model(Ch4)
ar_features_ch5 = apply_ar_model(Ch5)

# Combine AR features into a single dataset
ar_features = np.hstack([ar_features_ch1, ar_features_ch2, ar_features_ch3, ar_features_ch4, ar_features_ch5])
print(f"AR Features shape: {ar_features.shape}")

# The number of AR features now should be (Nc, total_ar_features)
# Ensure the AR features match the number of cases Nc
print(f"Labels shape: {labels.shape}")

Dataset shape: (8192, 5, 850)

AR Features shape: (850, 30)
Labels shape: (850,)

Explanation of Dataset Contents

Dataset Shape:
- dataset.shape returns (8192, 5, 850), indicating the dataset has 8192 samples, 5 channels, and 850 cases.
Labels Shape:
- labels.shape returns (850,), indicating there are 850 labels corresponding to the 850 cases.
Channels:
- Ch1 (Shape: (8192, 850)): Represents the force measured by the load cell (shaker force).
- Ch2 (Shape: (8192, 850)): Represents the acceleration measured at the base of the structure.
- Ch3 (Shape: (8192, 850)): Represents the acceleration measured at the 1st floor of the structure.
- Ch4 (Shape: (8192, 850)): Represents the acceleration measured at the 2nd floor of the structure.
- Ch5 (Shape: (8192, 850)): Represents the acceleration measured at the 3rd floor of the structure.
DataFrame Overview:
- A pandas DataFrame is created where each column represents one of the channels (Ch1 to Ch5) and the labels.
- The df.info() function provides a concise summary of the DataFrame, including column names, non-null counts, and data types.
- The df.head() function displays the first few rows of the DataFrame to give a preview of the data.
Data Visualization:
- The time vector time is created based on the number of samples (N) and the sampling time (Ts).
- For the first two cases, the force data (Ch1) and acceleration data (Ch2 to Ch5) are plotted against time to provide a visual preview of the data.

Detailed Description

Channels:
- Ch1 (Load Cell - Shaker Force): This channel captures the force applied by the shaker to the structure. It is essential for understanding the input excitation.
- Ch2 (Accelerometer - Base): This channel measures the acceleration at the base of the structure. It helps in understanding the base motion response.
- Ch3 (Accelerometer - 1st Floor): This channel measures the acceleration at the 1st floor, providing insights into the structural response at this level.
- Ch4 (Accelerometer - 2nd Floor): This channel measures the acceleration at the 2nd floor, which is useful for analyzing the dynamic behavior at this level.
- Ch5 (Accelerometer - 3rd Floor): This channel measures the acceleration at the 3rd floor, giving information about the response at the top of the structure.
Labels:
- The labels array contains the labels for each case, which might represent different conditions or states of the structure during the experiments.

Code

# Standardize the features
scaler = StandardScaler()
ar_features_scaled = scaler.fit_transform(ar_features)

# Apply PCA to reduce to 3 dimensions
pca = PCA(n_components=3)
pca_features = pca.fit_transform(ar_features_scaled)

# Visualize the explained variance
explained_variance = pca.explained_variance_ratio_
print(f"Explained Variance by 3 components: {explained_variance}")

Explained Variance by 3 components: [0.41750604 0.22289578 0.09093992]

Code

# Apply K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(pca_features)

# Add cluster labels to the dataframe
df_pca = pd.DataFrame(pca_features, columns=['PC1', 'PC2', 'PC3'])
df_pca['Cluster'] = kmeans.labels_
df_pca['Label'] = labels

# 3D Plot for K-means clusters vs true labels
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(14, 6))

# K-means clusters plot
ax1 = fig.add_subplot(121, projection='3d')
ax1.scatter(df_pca['PC1'], df_pca['PC2'], df_pca['PC3'], c=df_pca['Cluster'], cmap='viridis')
ax1.set_title('K-means Clusters')

# True labels plot
ax2 = fig.add_subplot(122, projection='3d')
ax2.scatter(df_pca['PC1'], df_pca['PC2'], df_pca['PC3'], c=df_pca['Label'], cmap='coolwarm')
ax2.set_title('True Labels')


plt.show()

C:\Users\c10218b\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

Code

# Assign labels to clusters based on the majority class in each cluster
cluster_to_label = {}
for cluster in np.unique(df_pca['Cluster']):
    true_labels_in_cluster = df_pca[df_pca['Cluster'] == cluster]['Label']
    most_common_label = true_labels_in_cluster.value_counts().idxmax()
    cluster_to_label[cluster] = most_common_label

# Assign the new labels based on cluster assignments
df_pca['Predicted_Label'] = df_pca['Cluster'].map(cluster_to_label)

# Compute the confusion matrix
conf_matrix = confusion_matrix(df_pca['Label'], df_pca['Predicted_Label'])
print("Confusion Matrix:\n", conf_matrix)

Confusion Matrix:
 [[50  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [50  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [50  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [50  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0 50  0  0  0  0  0  0  0  0  0  0  0  0]
 [50  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [50  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [50  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [50  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [49  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0 50  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0 50  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0 50  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0 50  0  0  0  0  0  0]
 [50  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [49  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0 50  0  0  0  0  0  0]]

Visualize with t-SNE

Code

# Apply t-SNE for dimensionality reduction and visualization
tsne = TSNE(n_components=2, random_state=42)
tsne_results = tsne.fit_transform(ar_features_scaled)

# Plot the t-SNE results
plt.figure(figsize=(8,6))
plt.scatter(tsne_results[:, 0], tsne_results[:, 1], c=kmeans.labels_, cmap='viridis')
plt.title('t-SNE Visualization of Clusters')
plt.show()

One-Class SVM for Anomaly Detection

Code

# Train a One-Class SVM for anomaly detection
ocsvm = OneClassSVM(kernel='rbf', gamma='auto')
ocsvm.fit(ar_features_scaled)

# Predict anomalies (1: normal, -1: anomaly)
df_pca['Anomaly'] = ocsvm.predict(ar_features_scaled)
df_pca['Anomaly'] = df_pca['Anomaly'].map({1: 'Normal', -1: 'Anomaly'})

# Visualize anomalies in 3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df_pca['PC1'], df_pca['PC2'], df_pca['PC3'], c=df_pca['Anomaly'].map({'Normal': 0, 'Anomaly': 1}), cmap='coolwarm')
plt.title("One-Class SVM: Anomaly Detection")
plt.show()

Results Interpretation:

Conclusion

References

Hayala, H. V. H. 06 supervised learning II, Lecture Notes, In Machine Learning Class at Industrial and Systems Engineering Graduate Program (PPGEPS), Pontifical Catholic University of Paraná (PPGEPS/PUCPR), 2024.

Code

# Total timing to compile this Quarto document

end_time = datetime.now()
time_diff = datetime.now() - start_time

print(f"Total Quarto document compiling time: {time_diff}")

Total Quarto document compiling time: 0:01:07.183028