06 - supervised learning II – prof. Helon Hultmann Ayala
Author
Rodrigo Hermont Ozon
Published
August 1, 2024
Exercise Codes Quarto Document
This is the document for the Lesson 06 on unsupervised learning, where we will apply clustering methods, visualization techniques like t-SNE, and one-class classification for anomaly detection. The data used will be from the 3-storey dataset.
Key tasks include:
Clustering using K-means.
Visualizing clusters using t-SNE.
Building a one-class classification SVM for anomaly detection in the data.
Take-home exercise
Using the 3-storey dataset from previous activities, perform the following (having features constructed with AR models and PCA):
Apply the k-means algorithm with the reduced dimensions;
Check the results visually by plotting side by side your clustering results and the correct labels in 3D (using only 3 dimensions of the PCA in the input space);
Optional: try to assign to clusters the label which contain most instances of a class, and build the confusion matrices based on this information (this will require some coding).
Visualize your dataset with t-SNE;
Build a one-class classification SVM for nominal and failure modes.
Instructions
Send me a link to your GitHub repository (free to register) with a Jupyter notebook that I can access
This document covers Lesson 06, where we use unsupervised learning techniques on the 3-storey dataset, reduced through AR model features and PCA. The tasks involve applying k-means clustering, t-SNE visualization, and One-Class SVM for anomaly detection.
Tasks:
Apply K-means clustering using reduced dimensions (PCA).
Compare clustering results with actual labels in a 3D plot.
(Optional) Assign labels to clusters and compute confusion matrices.
Visualize data using t-SNE.
Build a One-Class SVM for nominal and failure modes detection.
# Import necessary librariesimport requestsimport scipy.io as sioimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom os import getcwdfrom os.path import joinfrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAfrom sklearn.manifold import TSNEfrom sklearn.cluster import KMeansfrom sklearn.svm import OneClassSVMfrom sklearn.metrics import classification_report, confusion_matrixfrom statsmodels.tsa.ar_model import AutoReg#warnings.filterwarnings('ignore', message='DataFrame is highly fragmented.')
Code
# Download the data fileurl ='http://helon.usuarios.rdc.puc-rio.br/data/data3SS2009.mat'response = requests.get(url)withopen('data3SS2009.mat', 'wb') as f: f.write(response.content)# Load the datafname = join(getcwd(), 'data3SS2009.mat')mat_contents = sio.loadmat(fname)dataset = mat_contents['dataset']# Display the shape of the datasetN, Chno, Nc = dataset.shapeprint(f"Dataset shape: {dataset.shape}")# Reshape labelslabels = mat_contents['labels'].reshape(Nc)# Separate the data by channelCh1 = dataset[:, 0, :]Ch2 = dataset[:, 1, :]Ch3 = dataset[:, 2, :]Ch4 = dataset[:, 3, :]Ch5 = dataset[:, 4, :]# Apply AR model to each channel to extract features (AR coefficients)def apply_ar_model(data, lags=5): ar_features = []for sample in data.T: # Iterate over the cases, transpose needed for correct iteration model = AutoReg(sample, lags=lags) model_fitted = model.fit() ar_features.append(model_fitted.params)return np.array(ar_features)# Apply AR model to all channelsar_features_ch1 = apply_ar_model(Ch1)ar_features_ch2 = apply_ar_model(Ch2)ar_features_ch3 = apply_ar_model(Ch3)ar_features_ch4 = apply_ar_model(Ch4)ar_features_ch5 = apply_ar_model(Ch5)# Combine AR features into a single datasetar_features = np.hstack([ar_features_ch1, ar_features_ch2, ar_features_ch3, ar_features_ch4, ar_features_ch5])print(f"AR Features shape: {ar_features.shape}")# The number of AR features now should be (Nc, total_ar_features)# Ensure the AR features match the number of cases Ncprint(f"Labels shape: {labels.shape}")
Dataset shape: (8192, 5, 850)
AR Features shape: (850, 30)
Labels shape: (850,)
Explanation of Dataset Contents
Dataset Shape:
dataset.shape returns (8192, 5, 850), indicating the dataset has 8192 samples, 5 channels, and 850 cases.
Labels Shape:
labels.shape returns (850,), indicating there are 850 labels corresponding to the 850 cases.
Channels:
Ch1 (Shape: (8192, 850)): Represents the force measured by the load cell (shaker force).
Ch2 (Shape: (8192, 850)): Represents the acceleration measured at the base of the structure.
Ch3 (Shape: (8192, 850)): Represents the acceleration measured at the 1st floor of the structure.
Ch4 (Shape: (8192, 850)): Represents the acceleration measured at the 2nd floor of the structure.
Ch5 (Shape: (8192, 850)): Represents the acceleration measured at the 3rd floor of the structure.
DataFrame Overview:
A pandas DataFrame is created where each column represents one of the channels (Ch1 to Ch5) and the labels.
The df.info() function provides a concise summary of the DataFrame, including column names, non-null counts, and data types.
The df.head() function displays the first few rows of the DataFrame to give a preview of the data.
Data Visualization:
The time vector time is created based on the number of samples (N) and the sampling time (Ts).
For the first two cases, the force data (Ch1) and acceleration data (Ch2 to Ch5) are plotted against time to provide a visual preview of the data.
Detailed Description
Channels:
Ch1 (Load Cell - Shaker Force): This channel captures the force applied by the shaker to the structure. It is essential for understanding the input excitation.
Ch2 (Accelerometer - Base): This channel measures the acceleration at the base of the structure. It helps in understanding the base motion response.
Ch3 (Accelerometer - 1st Floor): This channel measures the acceleration at the 1st floor, providing insights into the structural response at this level.
Ch4 (Accelerometer - 2nd Floor): This channel measures the acceleration at the 2nd floor, which is useful for analyzing the dynamic behavior at this level.
Ch5 (Accelerometer - 3rd Floor): This channel measures the acceleration at the 3rd floor, giving information about the response at the top of the structure.
Labels:
The labels array contains the labels for each case, which might represent different conditions or states of the structure during the experiments.
Code
# Standardize the featuresscaler = StandardScaler()ar_features_scaled = scaler.fit_transform(ar_features)# Apply PCA to reduce to 3 dimensionspca = PCA(n_components=3)pca_features = pca.fit_transform(ar_features_scaled)# Visualize the explained varianceexplained_variance = pca.explained_variance_ratio_print(f"Explained Variance by 3 components: {explained_variance}")
Explained Variance by 3 components: [0.41750604 0.22289578 0.09093992]
C:\Users\c10218b\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning:
The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Code
# Assign labels to clusters based on the majority class in each clustercluster_to_label = {}for cluster in np.unique(df_pca['Cluster']): true_labels_in_cluster = df_pca[df_pca['Cluster'] == cluster]['Label'] most_common_label = true_labels_in_cluster.value_counts().idxmax() cluster_to_label[cluster] = most_common_label# Assign the new labels based on cluster assignmentsdf_pca['Predicted_Label'] = df_pca['Cluster'].map(cluster_to_label)# Compute the confusion matrixconf_matrix = confusion_matrix(df_pca['Label'], df_pca['Predicted_Label'])print("Confusion Matrix:\n", conf_matrix)
Hayala, H. V. H. 06 supervised learning II, Lecture Notes, In Machine Learning Class at Industrial and Systems Engineering Graduate Program (PPGEPS), Pontifical Catholic University of Paraná (PPGEPS/PUCPR), 2024.
Code
# Total timing to compile this Quarto documentend_time = datetime.now()time_diff = datetime.now() - start_timeprint(f"Total Quarto document compiling time: {time_diff}")
Total Quarto document compiling time: 0:01:07.183028