Spectral Classification#

# common
import warnings
warnings.filterwarnings('ignore')
import os
import os.path as op
import sys
import pickle as pk

# pip
import numpy as np
import pandas as pd
import xarray as xr
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# DEV: bluemath
sys.path.insert(0, op.join(op.abspath(''), '..', '..', '..', '..'))

# bluemath modules 
from bluemath.binwaves.plotting.classification import axplot_spectrum
from bluemath.binwaves.plotting.classification import Plot_spectra_pcs, Plot_kmeans_clusters, Plot_specs_inside_cluster
from bluemath.binwaves.reconstruction import calculate_mean_clusters

from bluemath.kma import kmeans_clustering_pcs

from bluemath.plotting.classification import Plot_kmeans_hs_stats, Plot_pc_space
Warning: cannot import cf-json, install "metocean" dependencies for full functionality

Classification

Since individually reconstructing the hindcast in every point of the grid following BinWaves would be computationally very expensive, we reduce the dimensionality of the problem by clustering the spectra into N number of clusters, which will be the ones reconstructed at every point of the grid, for later on being able of reconstructing the full hindcast in an eficient way.

Database and site parameters#

# database
p_data = '/media/administrador/HD2/SamoaTonga/data'
site = 'Tongatapu'
p_site = op.join(p_data, site)

# deliverable folder
p_deliv = op.join(p_site, 'd05_swell_inundation_forecast')

p_kma = op.join(p_deliv, 'kma')
p_spec = op.join(p_deliv, 'spec')

# superpoint
deg_sup = 15
p_superpoint = op.join(p_spec, 'super_point_superposition_{0}_.nc'.format(deg_sup))


# output files
num_clusters = 2000
p_spec_kma = op.join(p_kma, 'spec_KMA_Samoa_NS.pkl')
p_spec_pca = op.join(p_kma, 'spec_PCA_Samoa_NS.pkl')

p_superpoint_kma = op.join(p_kma, 'Spec_KMA_{0}_NS.nc'.format(num_clusters)) 

if not op.isdir(p_kma):
    os.makedirs(p_kma)
    

Spectral data

# Load superpoint
sp = xr.open_dataset(p_superpoint)
print(sp)
<xarray.Dataset>
Dimensions:  (dir: 24, freq: 29, time: 366477)
Coordinates:
  * time     (time) datetime64[ns] 1979-01-01 1979-01-01T01:00:00 ... 2020-10-01
  * dir      (dir) float32 262.5 247.5 232.5 217.5 ... 322.5 307.5 292.5 277.5
  * freq     (freq) float32 0.035 0.0385 0.04235 ... 0.4171 0.4589 0.5047
Data variables:
    efth     (time, freq, dir) float64 ...
    Wspeed   (time) float32 ...
    Wdir     (time) float32 ...
    Depth    (time) float32 ...

Principal component analysis#

PCA

Principal component analysis (PCA), is employed to reduce the high dimensionality of the original data space and simplify the classification process, transforming the information of the full directional spectra into its spatial and temporal modes.
PCA projects the original data on a new space searching for the maximum variance of the sample data.

  • The PCs represent the temporal variability.

  • The Empirical Orthogonal Functions, (EOFs) of the data covariance matrix define the vectors of the new space.

  • The EOFs represent the spatial variability.

In order to classify the spectra following KMA, we keep the n PCs that represent 95% of the explained variance (EV).


# Convert spec data into a matrix (time x (freq * dir))
m = np.reshape(np.sqrt(sp.efth.values), (len(sp.time), len(sp.freq) * len(sp.dir)))
p = 95 # explained variance

ipca = PCA(n_components = np.shape(m)[1])
pk.dump(ipca.fit(m), open(p_spec_pca, 'wb'))

# add PCA data to superpoint
sp['PCs'] =  (('time','pc'), ipca.fit_transform(m))
sp['EOFs'] = (('pc','components'), ipca.components_)
sp['pc'] =   range(np.shape(m)[1])
sp['EV'] =   (('pc'), ipca.explained_variance_)

APEV = np.cumsum(sp.EV.values) / np.sum(sp.EV.values) * 100.0
n_pcs = np.where(APEV > p)[0][0]
sp['n_pcs'] = n_pcs

print('Number of PCs explaining {0}% of the EV is: {1}'.format(p, n_pcs))
Number of PCs explaining 95% of the EV is: 60
# Plot PCs spectra
Plot_spectra_pcs(sp, n_pcs, figsize = [14, 12]);
_images/03_Spectral_Classification_12_0.png

K-Means Clustering#

K-Means

Spectral clusters are obtained using the K-means clustering algorithm. We divide the data space into 2000 clusters, a number that must be a compromise between the available data to populate the different clusters, and a number large enough for capturing the full variability of the spectra.

Each cluster is defined by a prototype and formed by the data for which the prototype is the nearest. The centroid, which is the mean of all the data that falls within the same cluster, is then represented into a lattice following a geometric criteria, so that similar clusters are placed next to each other for a more intuitive visualization


# minimun data for each cluster
min_data_frac = 100
min_data = np.int(len(sp.time) / num_clusters / min_data_frac)

# KMeans clustering for spectral data
kma, kma_order, bmus, sorted_bmus, centers = kmeans_clustering_pcs(
    sp.PCs.values[:], n_pcs, num_clusters, min_data = min_data,
    kma_tol = 0.001, kma_n_init = 5,
)

# store kma output
pk.dump([kma, kma_order, bmus, sorted_bmus, centers], open(p_spec_kma, 'wb'))


# load kma output
#kma, kma_order, bmus, sorted_bmus, centers = pk.load(open(p_spec_kma, 'rb'))
Iteration: 0
Number of sims: 115
Minimum number of data: 3
# add KMA data to superpoint
sp['kma_order'] =  kma_order
sp['n_pcs'] =  n_pcs
sp['bmus'] = (('time'), sorted_bmus)
sp['centroids'] = (('clusters', 'n_pcs_'), centers)

The resulting classification can be seen in the first PCs space, through a scatter plot of PCs and the representative KMA centroids. The obtained centroids (red dots), span the wide variability of the data.

Scatter plot of PCs and the representative KMA centroids:

Plot_pc_space(sp);
_images/03_Spectral_Classification_20_0.png

Mean#

The mean spectra associated to each of the different clusters is shown below.

Plot_kmeans_clusters(sp,  vmax=0.6, num_clusters=225, ylim=0.3, figsize=[22, 14]);
_images/03_Spectral_Classification_23_0.png

Anomaly#

This is the mean from which annomalies are calculated

z = np.sqrt(np.mean(sp.efth.values, axis=0))
fig = plt.figure(figsize = [8, 8])
ax = fig.add_subplot(1,1,1, projection = 'polar')

y = sp.freq.values
x = np.deg2rad(sp.dir.values)

axplot_spectrum(
    ax,
    x, y, z,
    vmin = 0, vmax = 0.2,
    ylim = np.nanmax(sp.freq),
    remove_axis = 1,
    cmap = 'magma_r',
);
_images/03_Spectral_Classification_27_0.png
Plot_kmeans_clusters(sp, annomaly=True, vmax=0.4,  ylim=0.3, num_clusters=225, figsize=[22, 14]);
_images/03_Spectral_Classification_28_0.png
# calculate mean from KMeans clusters
efth_clust = calculate_mean_clusters(sp, annomaly=False)

# add mean to superpoint
sp['efth_cluster'] = (('freq', 'dir', 'clusters'), efth_clust)
# Plot counts of data inside clusters
values, counts = np.unique(sp.bmus.values, return_counts=True)

plt.figure(figsize = [25, 6])
plt.bar(values, counts, color='darkmagenta')
plt.grid('both', color='plum')
_images/03_Spectral_Classification_30_0.png

Cluster homogeneity

In the following figures, the spectra that fall in the same cluster have been plotted, to validate the homogeneity within each cluster.
As it can be observed, the classification correctly groups the different spectra and the differences within groups are almost negligible

cluster = 1
Plot_specs_inside_cluster(sp, cluster, gr=5, vmax=0.7);
_images/03_Spectral_Classification_32_0.png
cluster = 16
Plot_specs_inside_cluster(sp, cluster, gr=5, vmax=0.7);
_images/03_Spectral_Classification_33_0.png
cluster = 40
Plot_specs_inside_cluster(sp, cluster, gr=5, vmax=0.7);
_images/03_Spectral_Classification_34_0.png
cluster = 600
Plot_specs_inside_cluster(sp, cluster, gr=5, vmax=0.7);
_images/03_Spectral_Classification_35_0.png
cluster = 1800
Plot_specs_inside_cluster(sp, cluster, gr=5, vmax=0.7);
_images/03_Spectral_Classification_36_0.png

Statistics of Hs within each group#

A quantitative analysis of those differences has been made, analyzing the standard deviation of the Hs representative of each cluster against the mean of the cluster. The results indicate a very low relationship, which means that the clusters are homogeneus.

import wavespectra

sp['Hs'] = sp.efth.spec.hs()
sp
<xarray.Dataset>
Dimensions:       (clusters: 2000, components: 696, dir: 24, freq: 29, kma_order: 2000, n_pcs_: 60, pc: 696, time: 366477)
Coordinates:
  * time          (time) datetime64[ns] 1979-01-01 ... 2020-10-01
  * dir           (dir) float32 262.5 247.5 232.5 217.5 ... 307.5 292.5 277.5
  * freq          (freq) float32 0.035 0.0385 0.04235 ... 0.4171 0.4589 0.5047
  * pc            (pc) int64 0 1 2 3 4 5 6 7 ... 688 689 690 691 692 693 694 695
  * kma_order     (kma_order) int64 318 187 1289 761 1648 ... 222 565 1690 841
Dimensions without coordinates: clusters, components, n_pcs_
Data variables:
    efth          (time, freq, dir) float64 0.0 0.0 0.0 ... 1.464e-05 1.636e-05
    Wspeed        (time) float32 ...
    Wdir          (time) float32 ...
    Depth         (time) float32 ...
    PCs           (time, pc) float64 -0.1756 -0.2968 ... 1.249e-06 2.14e-07
    EOFs          (pc, components) float64 -6.575e-05 -0.0001366 ... -7.586e-05
    EV            (pc) float64 0.1626 0.1375 0.09958 ... 1.052e-10 1.589e-11
    n_pcs         int64 60
    bmus          (time) float64 1.68e+03 1.68e+03 ... 1.803e+03 1.803e+03
    centroids     (clusters, n_pcs_) float64 0.006685 -0.0671 ... 0.03855
    efth_cluster  (freq, dir, clusters) float64 0.0 9.745e-06 ... 1.55e-05
    Hs            (time) float64 0.0 0.1002 0.17 0.2146 ... 2.524 2.499 2.474
Plot_kmeans_hs_stats(sp, figsize=[24, 10]);
_images/03_Spectral_Classification_40_0.png
# store modified superpoint with KMA data
sp.to_netcdf(p_superpoint_kma)