Spectral Classification
Contents
Spectral Classification#
# common
import warnings
warnings.filterwarnings('ignore')
import os
import os.path as op
import sys
import pickle as pk
# pip
import numpy as np
import pandas as pd
import xarray as xr
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# DEV: bluemath
sys.path.insert(0, op.join(op.abspath(''), '..', '..', '..', '..'))
# bluemath modules
from bluemath.binwaves.plotting.classification import axplot_spectrum
from bluemath.binwaves.plotting.classification import Plot_spectra_pcs, Plot_kmeans_clusters, Plot_specs_inside_cluster
from bluemath.binwaves.reconstruction import calculate_mean_clusters
from bluemath.kma import kmeans_clustering_pcs
from bluemath.plotting.classification import Plot_kmeans_hs_stats, Plot_pc_space
Warning: cannot import cf-json, install "metocean" dependencies for full functionality
Classification
Since individually reconstructing the hindcast in every point of the grid following BinWaves would be computationally very expensive, we reduce the dimensionality of the problem by clustering the spectra into N number of clusters, which will be the ones reconstructed at every point of the grid, for later on being able of reconstructing the full hindcast in an eficient way.
Database and site parameters#
# database
p_data = '/media/administrador/HD2/SamoaTonga/data'
site = 'Samoa'
p_site = op.join(p_data, site)
# deliverable folder
p_deliv = op.join(p_site, 'd05_swell_inundation_forecast')
p_kma = op.join(p_deliv, 'kma')
p_spec = op.join(p_deliv, 'spec')
# superpoint
deg_sup = 15
p_superpoint = op.join(p_spec, 'super_point_superposition_{0}_.nc'.format(deg_sup))
# output files
num_clusters = 2000
p_spec_kma = op.join(p_kma, 'spec_KMA_Samoa_NS.pkl')
p_spec_pca = op.join(p_kma, 'spec_PCA_Samoa_NS.pkl')
p_superpoint_kma = op.join(p_kma, 'Spec_KMA_{0}_NS.nc'.format(num_clusters))
if not op.isdir(p_kma):
os.makedirs(p_kma)
Spectral data
# Load superpoint
sp = xr.open_dataset(p_superpoint)
print(sp)
<xarray.Dataset>
Dimensions: (dir: 24, freq: 29, time: 366477)
Coordinates:
* time (time) datetime64[ns] 1979-01-01 1979-01-01T01:00:00 ... 2020-10-01
* dir (dir) float32 262.5 247.5 232.5 217.5 ... 322.5 307.5 292.5 277.5
* freq (freq) float32 0.035 0.0385 0.04235 ... 0.4171 0.4589 0.5047
Data variables:
efth (time, freq, dir) float64 ...
Wspeed (time) float32 ...
Wdir (time) float32 ...
Depth (time) float32 ...
Principal component analysis#
PCA
Principal component analysis (PCA), is employed to reduce the high dimensionality of the original data space and simplify the classification process, transforming the information of the full directional spectra into its spatial and temporal modes.
PCA projects the original data on a new space searching for the maximum variance of the sample data.
The PCs represent the temporal variability.
The Empirical Orthogonal Functions, (EOFs) of the data covariance matrix define the vectors of the new space.
The EOFs represent the spatial variability.
In order to classify the spectra following KMA, we keep the n PCs that represent 95% of the explained variance (EV).
# Convert spec data into a matrix (time x (freq * dir))
m = np.reshape(np.sqrt(sp.efth.values), (len(sp.time), len(sp.freq) * len(sp.dir)))
p = 95 # explained variance
ipca = PCA(n_components = np.shape(m)[1])
pk.dump(ipca.fit(m), open(p_spec_pca, 'wb'))
# add PCA data to superpoint
sp['PCs'] = (('time','pc'), ipca.fit_transform(m))
sp['EOFs'] = (('pc','components'), ipca.components_)
sp['pc'] = range(np.shape(m)[1])
sp['EV'] = (('pc'), ipca.explained_variance_)
APEV = np.cumsum(sp.EV.values) / np.sum(sp.EV.values) * 100.0
n_pcs = np.where(APEV > p)[0][0]
sp['n_pcs'] = n_pcs
print('Number of PCs explaining {0}% of the EV is: {1}'.format(p, n_pcs))
Number of PCs explaining 95% of the EV is: 68
# Plot PCs spectra
Plot_spectra_pcs(sp, n_pcs, figsize = [14, 12]);
K-Means Clustering#
K-Means
Spectral clusters are obtained using the K-means clustering algorithm. We divide the data space into 2000 clusters, a number that must be a compromise between the available data to populate the different clusters, and a number large enough for capturing the full variability of the spectra.
Each cluster is defined by a prototype and formed by the data for which the prototype is the nearest. The centroid, which is the mean of all the data that falls within the same cluster, is then represented into a lattice following a geometric criteria, so that similar clusters are placed next to each other for a more intuitive visualization
# minimun data for each cluster
min_data_frac = 100
min_data = np.int(len(sp.time) / num_clusters / min_data_frac)
# KMeans clustering for spectral data
kma, kma_order, bmus, sorted_bmus, centers = kmeans_clustering_pcs(
sp.PCs.values[:], n_pcs, num_clusters, min_data = min_data,
kma_tol = 0.001, kma_n_init = 5,
)
# store kma output
pk.dump([kma, kma_order, bmus, sorted_bmus, centers], open(p_spec_kma, 'wb'))
# load kma output
#kma, kma_order, bmus, sorted_bmus, centers = pk.load(open(p_spec_kma, 'rb'))
Iteration: 0
Number of sims: 186
Minimum number of data: 7
# add KMA data to superpoint
sp['kma_order'] = kma_order
sp['n_pcs'] = n_pcs
sp['bmus'] = (('time'), sorted_bmus)
sp['centroids'] = (('clusters', 'n_pcs_'), centers)
The resulting classification can be seen in the first PCs space, through a scatter plot of PCs and the representative KMA centroids. The obtained centroids (red dots), span the wide variability of the data.
Scatter plot of PCs and the representative KMA centroids:
Plot_pc_space(sp);
Mean#
The mean spectra associated to each of the different clusters is shown below.
Plot_kmeans_clusters(sp, vmax=0.6, num_clusters=225, ylim=0.3, figsize=[22, 14]);
Anomaly#
This is the mean from which annomalies are calculated
z = np.sqrt(np.mean(sp.efth.values, axis=0))
fig = plt.figure(figsize = [8, 8])
ax = fig.add_subplot(1,1,1, projection = 'polar')
y = sp.freq.values
x = np.deg2rad(sp.dir.values)
axplot_spectrum(
ax,
x, y, z,
vmin = 0, vmax = 0.2,
ylim = np.nanmax(sp.freq),
remove_axis = 1,
cmap = 'magma_r',
);
Plot_kmeans_clusters(sp, annomaly=True, vmax=0.4, ylim=0.3, num_clusters=225, figsize=[22, 14]);
# calculate mean from KMeans clusters
efth_clust = calculate_mean_clusters(sp, annomaly=False)
# add mean to superpoint
sp['efth_cluster'] = (('freq', 'dir', 'clusters'), efth_clust)
# Plot counts of data inside clusters
values, counts = np.unique(sp.bmus.values, return_counts=True)
plt.figure(figsize = [25, 6])
plt.bar(values, counts, color='darkmagenta')
plt.grid('both', color='plum')
Cluster homogeneity
In the following figures, the spectra that fall in the same cluster have been plotted, to validate the homogeneity within each cluster.
As it can be observed, the classification correctly groups the different spectra and the differences within groups are almost negligible
cluster = 1
Plot_specs_inside_cluster(sp, cluster, gr=5, vmax=0.7);
cluster = 16
Plot_specs_inside_cluster(sp, cluster, gr=5, vmax=0.7);
cluster = 40
Plot_specs_inside_cluster(sp, cluster, gr=5, vmax=0.7);
cluster = 600
Plot_specs_inside_cluster(sp, cluster, gr=5, vmax=0.7);
cluster = 1800
Plot_specs_inside_cluster(sp, cluster, gr=5, vmax=0.7);
Statistics of Hs within each group#
A quantitative analysis of those differences has been made, analyzing the standard deviation of the Hs representative of each cluster against the mean of the cluster. The results indicate a very low relationship, which means that the clusters are homogeneus.
import wavespectra
sp['Hs'] = sp.efth.spec.hs()
sp
<xarray.Dataset> Dimensions: (clusters: 2000, components: 696, dir: 24, freq: 29, kma_order: 2000, n_pcs_: 68, pc: 696, time: 366477) Coordinates: * time (time) datetime64[ns] 1979-01-01 ... 2020-10-01 * dir (dir) float32 262.5 247.5 232.5 217.5 ... 307.5 292.5 277.5 * freq (freq) float32 0.035 0.0385 0.04235 ... 0.4171 0.4589 0.5047 * pc (pc) int64 0 1 2 3 4 5 6 7 ... 688 689 690 691 692 693 694 695 * kma_order (kma_order) int64 100 448 916 719 1043 ... 1496 1374 1431 1631 Dimensions without coordinates: clusters, components, n_pcs_ Data variables: efth (time, freq, dir) float64 0.0 0.0 0.0 ... 2.09e-08 1.097e-09 Wspeed (time) float32 ... Wdir (time) float32 ... Depth (time) float32 ... PCs (time, pc) float64 0.3274 -0.1723 ... -5.763e-08 -2.713e-07 EOFs (pc, components) float64 -2.045e-05 -0.0001816 ... -1.263e-06 EV (pc) float64 0.1249 0.08618 0.07879 ... 2.135e-11 3.458e-12 n_pcs int64 68 bmus (time) float64 1.999e+03 1.999e+03 ... 1.872e+03 1.872e+03 centroids (clusters, n_pcs_) float64 0.3912 -0.05225 ... -0.00468 efth_cluster (freq, dir, clusters) float64 3.948e-11 1.37e-10 ... 5.094e-06 Hs (time) float64 0.0 0.1399 0.2343 0.2396 ... 1.789 1.775 1.763
- clusters: 2000
- components: 696
- dir: 24
- freq: 29
- kma_order: 2000
- n_pcs_: 68
- pc: 696
- time: 366477
- time(time)datetime64[ns]1979-01-01 ... 2020-10-01
array(['1979-01-01T00:00:00.000000000', '1979-01-01T01:00:00.000000000', '1979-01-01T02:00:00.000000000', ..., '2020-09-30T22:00:00.000000000', '2020-09-30T23:00:00.000000000', '2020-10-01T00:00:00.000000000'], dtype='datetime64[ns]')
- dir(dir)float32262.5 247.5 232.5 ... 292.5 277.5
array([262.5, 247.5, 232.5, 217.5, 202.5, 187.5, 172.5, 157.5, 142.5, 127.5, 112.5, 97.5, 82.5, 67.5, 52.5, 37.5, 22.5, 7.5, 352.5, 337.5, 322.5, 307.5, 292.5, 277.5], dtype=float32)
- freq(freq)float320.035 0.0385 ... 0.4589 0.5047
array([0.035 , 0.0385 , 0.04235 , 0.046585, 0.051244, 0.056368, 0.062005, 0.068205, 0.075026, 0.082528, 0.090781, 0.099859, 0.109845, 0.12083 , 0.132912, 0.146204, 0.160824, 0.176907, 0.194597, 0.214057, 0.235463, 0.259009, 0.28491 , 0.313401, 0.344741, 0.379215, 0.417136, 0.45885 , 0.504735], dtype=float32)
- pc(pc)int640 1 2 3 4 5 ... 691 692 693 694 695
array([ 0, 1, 2, ..., 693, 694, 695])
- kma_order(kma_order)int64100 448 916 719 ... 1374 1431 1631
array([ 100, 448, 916, ..., 1374, 1431, 1631])
- efth(time, freq, dir)float640.0 0.0 0.0 ... 2.09e-08 1.097e-09
array([[[0.000000e+00, 0.000000e+00, ..., 0.000000e+00, 0.000000e+00], [0.000000e+00, 0.000000e+00, ..., 0.000000e+00, 0.000000e+00], ..., [0.000000e+00, 0.000000e+00, ..., 0.000000e+00, 0.000000e+00], [0.000000e+00, 0.000000e+00, ..., 0.000000e+00, 0.000000e+00]], [[0.000000e+00, 0.000000e+00, ..., 0.000000e+00, 0.000000e+00], [0.000000e+00, 0.000000e+00, ..., 0.000000e+00, 0.000000e+00], ..., [9.232702e-09, 0.000000e+00, ..., 1.081292e-07, 1.041479e-08], [5.232688e-06, 0.000000e+00, ..., 6.966572e-05, 1.061838e-05]], ..., [[0.000000e+00, 0.000000e+00, ..., 0.000000e+00, 0.000000e+00], [3.142042e-13, 4.318506e-08, ..., 6.079282e-15, 4.215339e-15], ..., [4.090678e-11, 3.683937e-12, ..., 4.999912e-08, 3.891374e-09], [1.562913e-11, 1.990030e-12, ..., 2.910472e-08, 1.551509e-09]], [[0.000000e+00, 0.000000e+00, ..., 0.000000e+00, 0.000000e+00], [4.786974e-13, 4.474837e-08, ..., 4.855222e-15, 5.386854e-15], ..., [2.826256e-11, 4.906279e-12, ..., 3.826593e-08, 2.913403e-09], [9.936315e-12, 2.689866e-12, ..., 2.090017e-08, 1.097412e-09]]])
- Wspeed(time)float32...
[366477 values with dtype=float32]
- Wdir(time)float32...
[366477 values with dtype=float32]
- Depth(time)float32...
[366477 values with dtype=float32]
- PCs(time, pc)float640.3274 -0.1723 ... -2.713e-07
array([[ 3.27448177e-01, -1.72276065e-01, -8.28960831e-02, ..., 2.82528804e-07, 2.03106170e-07, -4.57840552e-07], [ 3.27695468e-01, -1.72209913e-01, -8.29016079e-02, ..., -8.94648555e-07, 2.61810487e-07, -9.21739393e-07], [ 3.28423909e-01, -1.72054071e-01, -8.29255632e-02, ..., 3.68597396e-07, 6.42990951e-07, -8.24674218e-07], ..., [-3.58162384e-02, -4.10074697e-01, 2.69400227e-02, ..., -1.80273613e-07, 6.34945624e-08, -2.56879341e-07], [-3.53542994e-02, -4.11126737e-01, 1.88344958e-02, ..., -1.83965624e-07, 1.55952117e-08, -2.66102486e-07], [-3.60740405e-02, -4.14010379e-01, 1.00173801e-02, ..., -1.68777389e-07, -5.76312058e-08, -2.71267122e-07]])
- EOFs(pc, components)float64-2.045e-05 ... -1.263e-06
array([[-2.04485264e-05, -1.81604290e-04, -2.58336027e-04, ..., 4.02094123e-03, 3.69450037e-03, 3.17081664e-03], [-3.17177370e-05, -1.16084513e-04, -1.67386716e-04, ..., 1.51940364e-03, 1.56269235e-03, 1.39033803e-03], [-1.67195564e-05, -6.87637811e-05, -1.07792129e-04, ..., 3.80776034e-05, 1.84159301e-04, 2.80521457e-05], ..., [-2.49852639e-04, 1.42706916e-04, -4.22297207e-05, ..., -1.05831570e-04, 1.68106922e-04, 1.96373768e-04], [-8.97180695e-04, 3.29137481e-04, 1.88734198e-04, ..., -1.65342263e-05, 8.04399586e-05, -7.48425889e-05], [-4.44482605e-04, -7.20495043e-05, 3.87336319e-05, ..., 3.80322736e-06, -9.85434207e-06, -1.26308008e-06]])
- EV(pc)float640.1249 0.08618 ... 3.458e-12
array([1.24916525e-01, 8.61752042e-02, 7.87903247e-02, 5.32653141e-02, 4.35218977e-02, 3.47865075e-02, 3.26410756e-02, 2.67025158e-02, 2.43823413e-02, 2.10839109e-02, 1.94264615e-02, 1.88564954e-02, 1.61055599e-02, 1.41692458e-02, 1.26856589e-02, 1.18782212e-02, 1.10390958e-02, 1.02529363e-02, 9.48877671e-03, 8.96130861e-03, 8.09406769e-03, 7.87476476e-03, 7.51331302e-03, 7.29261514e-03, 6.40022401e-03, 6.12247905e-03, 5.30104467e-03, 5.10138017e-03, 4.91186922e-03, 4.66896945e-03, 4.34139674e-03, 4.08520132e-03, 3.88851408e-03, 3.80354977e-03, 3.48568589e-03, 3.42183950e-03, 3.28543618e-03, 3.19048576e-03, 2.94615789e-03, 2.81276608e-03, 2.74572633e-03, 2.69399570e-03, 2.62400899e-03, 2.34744960e-03, 2.28109232e-03, 2.15971699e-03, 2.14072459e-03, 2.13803488e-03, 2.00381565e-03, 1.88485555e-03, 1.83926092e-03, 1.80430714e-03, 1.70966443e-03, 1.63062235e-03, 1.56548335e-03, 1.48051140e-03, 1.47652777e-03, 1.36351657e-03, 1.35052662e-03, 1.31291308e-03, 1.23586168e-03, 1.21413680e-03, 1.18773037e-03, 1.18080078e-03, 1.14549815e-03, 1.10704391e-03, 1.06858211e-03, 1.05955483e-03, 1.02092694e-03, 9.64327844e-04, 9.37558960e-04, 8.94457450e-04, 8.71437370e-04, 8.69284913e-04, 8.39011206e-04, 8.04801652e-04, 7.76971696e-04, 7.59115208e-04, 7.31877491e-04, 7.20088786e-04, ... 1.32324681e-08, 1.25694623e-08, 1.25032877e-08, 1.24170735e-08, 1.22269222e-08, 1.21186672e-08, 1.19405211e-08, 1.14553188e-08, 1.11090355e-08, 1.09436730e-08, 1.03479788e-08, 1.02200887e-08, 1.01028106e-08, 9.93759398e-09, 9.88180043e-09, 9.47512064e-09, 9.18511540e-09, 9.00985089e-09, 8.64485423e-09, 8.51863941e-09, 7.95758474e-09, 7.89543335e-09, 7.33522876e-09, 7.21278955e-09, 7.09133270e-09, 6.52490341e-09, 6.50617311e-09, 6.26505348e-09, 6.08434429e-09, 5.90581343e-09, 5.74757509e-09, 5.50445007e-09, 5.23620764e-09, 5.03082588e-09, 4.91049204e-09, 4.64960029e-09, 4.54923146e-09, 4.47517726e-09, 4.43539619e-09, 4.30288350e-09, 3.94356981e-09, 3.84809815e-09, 3.67462417e-09, 3.63557813e-09, 3.38863627e-09, 3.33478052e-09, 3.18810954e-09, 3.14483379e-09, 3.03450787e-09, 3.00264922e-09, 2.86833804e-09, 2.62724569e-09, 2.57690477e-09, 2.43779312e-09, 2.40710353e-09, 2.20027981e-09, 2.16804924e-09, 1.89257129e-09, 1.80226568e-09, 1.75256207e-09, 1.50003401e-09, 1.45541436e-09, 1.35724168e-09, 1.29458134e-09, 1.27902522e-09, 1.20210752e-09, 1.16011430e-09, 1.14042671e-09, 9.96420390e-10, 7.01898610e-10, 6.16833224e-10, 5.62916430e-10, 5.13426345e-10, 3.14820409e-10, 2.10625091e-10, 1.52531837e-10, 7.70162368e-11, 2.93887845e-11, 2.13503378e-11, 3.45831106e-12])
- n_pcs()int6468
array(68)
- bmus(time)float641.999e+03 1.999e+03 ... 1.872e+03
array([1999., 1999., 1999., ..., 1872., 1872., 1872.])
- centroids(clusters, n_pcs_)float640.3912 -0.05225 ... -0.00468
array([[ 0.39116207, -0.05225096, 0.20392263, ..., -0.02325803, -0.03261734, 0.02304597], [-0.14339167, -0.00637115, -0.22882482, ..., -0.01019995, 0.00913081, -0.00286359], [ 0.29918755, -0.04716066, 0.03720763, ..., -0.00554972, -0.01771303, -0.00741939], ..., [ 0.04998288, 0.25896646, 0.07031031, ..., -0.01942799, -0.00196507, -0.00432398], [-0.13266474, -0.19656658, 0.1925747 , ..., -0.00777675, -0.00716334, 0.01248819], [-0.12860088, 0.40282264, 0.04884855, ..., -0.010403 , 0.01272119, -0.00467977]])
- efth_cluster(freq, dir, clusters)float643.948e-11 1.37e-10 ... 5.094e-06
array([[[3.94830090e-11, 1.36953640e-10, 5.76533650e-09, ..., 7.09611565e-09, 1.26149532e-09, 1.23860490e-16], [9.43017622e-11, 1.58768791e-09, 1.73395103e-10, ..., 2.17439369e-07, 1.45152021e-07, 1.98967499e-15], [1.85241226e-10, 2.90258264e-09, 1.98959373e-11, ..., 5.85739104e-07, 5.82321409e-07, 4.83521459e-15], ..., [5.50534024e-10, 8.34145909e-11, 8.41187894e-08, ..., 5.30725231e-09, 1.46331941e-12, 1.38668340e-15], [1.53615529e-09, 2.43887751e-10, 3.75255436e-07, ..., 1.85973578e-10, 1.24981019e-12, 2.09119865e-16], [7.65975740e-10, 1.39561960e-09, 6.17123999e-08, ..., 2.56356919e-11, 1.01481923e-10, 1.06869655e-16]], [[2.97286578e-09, 1.82379192e-08, 7.38181465e-07, ..., 4.67739093e-08, 7.59805017e-08, 9.80924323e-16], [2.89448315e-09, 4.54416774e-09, 1.10143833e-08, ..., 2.39436747e-06, 6.40555291e-06, 3.01808240e-13], [6.36934900e-09, 1.03557316e-08, 1.13392053e-08, ..., 6.84200764e-06, 2.27902474e-05, 7.40592839e-13], ... [2.34988569e-04, 6.16282752e-04, 3.23272932e-06, ..., 3.83457074e-06, 5.22872659e-06, 1.71719776e-05], [1.11552605e-04, 5.19462331e-04, 8.31262709e-06, ..., 2.68787813e-06, 2.86992054e-06, 1.20102199e-05], [4.13116234e-06, 2.46885578e-04, 2.22164782e-06, ..., 1.65848429e-06, 5.02096919e-06, 4.44775231e-06]], [[9.77302374e-08, 9.32897987e-05, 1.21220692e-05, ..., 9.35185038e-07, 2.60400184e-06, 4.25459152e-06], [2.74627713e-14, 1.26854250e-05, 6.26808865e-06, ..., 1.79792200e-06, 3.93399077e-06, 1.04025097e-05], [5.49924684e-11, 4.55199044e-06, 2.31911547e-05, ..., 4.76406526e-06, 5.93532464e-06, 1.37050001e-05], ..., [1.52647337e-04, 3.98368888e-04, 1.54658211e-06, ..., 2.56501081e-06, 5.17313888e-06, 1.11623020e-05], [7.29097960e-05, 3.36352547e-04, 6.53760436e-06, ..., 1.62683170e-06, 3.12612689e-06, 8.79066118e-06], [2.66267917e-06, 1.58943585e-04, 2.01114183e-06, ..., 1.04009022e-06, 3.82340172e-06, 5.09368488e-06]]])
- Hs(time)float640.0 0.1399 0.2343 ... 1.775 1.763
- standard_name :
- sea_surface_wave_significant_height
- units :
- m
array([0. , 0.13988716, 0.23426723, ..., 1.78942364, 1.77536693, 1.7631081 ])
Plot_kmeans_hs_stats(sp, figsize=[24, 10]);
# store modified superpoint with KMA data
sp.to_netcdf(p_superpoint_kma)