Model Validation#

# common
import os
import os.path as op
import sys
import pickle as pk
#import pickle5 as pk5

# pip
import numpy as np
import pandas as pd
import xarray as xr

import warnings
warnings.filterwarnings("ignore")

# DEV: bluemath
sys.path.insert(0, op.join(op.abspath(''), '..', '..', '..', '..'))

# bluemath modules 
from bluemath.seasonal_forecast_tcs.validation import change_sst_resolution, \
ds_index_over_time, ds_monthly_probabilities, PCA_k_means_val

from bluemath.seasonal_forecast_tcs.pca import PCA_k_means

from bluemath.seasonal_forecast_tcs.plotting.validation import Plot_bmus_chronology, Plot_scatter_kmeans, \
Plot_bmus_comparison_validation_calibration, Plot_validation_year, Plot_validation_full_season

Database and site parameters#

# mld and sst
p_mld = '/media/administrador/HD1/DATABASES/SeasonalForecast_TCs/CFS/ocnmld'
p_sst = '/media/administrador/HD1/DATABASES/SeasonalForecast_TCs/SST'
p_mld_1982 = op.join(p_mld+'/ocnmld.day.mean.1982.nc')

# database

p_data = r'/media/administrador/HD2/SamoaTonga/data'
site = 'SamoaTonga'
p_site = op.join(p_data, site)

# deliverable
p_deliv = op.join(p_site, 'seasonal_forecast_tcs')

# IBTrACS database
p_ibtracs = op.join(p_data, 'IBTrACS.ALL.v04r00.nc')

#KMA model
p_kma = op.join(p_deliv, 'kma')
p_kma_model = op.join(p_deliv, 'kma', 'kma_index_okb.pkl')

# index predictor values
p_sst_mld_slp_calibration = op.join(p_deliv, 'sst_mld_slp_calibration.nc')

# precipitation in target area
p_xs_trmm = op.join(p_deliv, 'xs_trmm.nc')
xs = xr.open_dataset(p_sst_mld_slp_calibration)

xds_kma =  xr.open_dataset(p_kma+'/xds_kma_index_vars.nc')
xds_kma_sel =  xr.open_dataset(p_kma+'/xds_kma_sel.nc')
xds_kma_ord = xr.open_dataset(p_kma+'/xds_kma_ord.nc')
    
xds_count_tcs_8 = xr.open_dataset(p_deliv+'/xds_count_tcs_8.nc')
xds_count_tcs_8_964 = xr.open_dataset(p_deliv+'/xds_count_tcs_8_964.nc')
xds_count_tcs_8_979 = xr.open_dataset(p_deliv+'/xds_count_tcs_8_979.nc')

xds_PCA = xr.open_dataset(p_deliv+'/xds_pca.nc')
xds_timeM = xr.open_dataset(p_deliv+'/xds_timeM.nc')

df_2021 = pd.read_pickle(p_deliv+'/df_coordinates_pmin_sst_mld_2021.pkl')
df_2019 = pd.read_pickle(p_deliv+'/df_coordinates_pmin_sst_mld.pkl')
# predictor parameters
lo_area = [160, 210]
la_area = [-30, 0]

#lo_SP, la_SP = [130, 250], [-60, 0]
lop1 = 160.25
lop2 = 211
lap1 = -0.25
lap2 = -32
delta = 2

After analizing the tailor-made predictor along the hindcast data for the calibration period (1982-2019), the performace of the model will be validated for year 2020, which has not been included in the predictor calibration process.

Steps:

  • 1. Download and preprocess (file conversion and resolution interpolation) SST and MLD data for the validation time period.

  • 2. Generation of the index predictor based on the index function obtained at the calibration period.

  • 3. The fitted Principal Component Analysis for the calibration is used to predict the index principal components in that same temporal-spatial space.

  • 4. The predicted PCs are assigned to the best match unit group from the fitted K-means clustering -> based on the index predictor a DWT is assigned to each day.

  • 5. From the DWT the expected daily mean number of TCs in 8x8º cells map in the target area is known.

Index predictor and DWTs#

Download and preprocess (file conversion and resolution interpolation) SST and MLD data for the validation time period.

# load mld and sst data for preprocessing
mld_ref = xr.open_dataset(p_mld_1982)

year_val = 2020
p_mld_val  = op.join(p_mld+'/ocnmld.day.mean.{0}.nc'.format(year_val))
p_sst_val  = op.join(p_sst+ '/sst.day.mean.{0}.nc'.format(year_val))

sst_val = xr.open_dataset(p_sst_val)
mld_val = xr.open_dataset(p_mld_val)
sst_val_0505 = change_sst_resolution(mld_ref, sst_val)
sst_val_0505.to_netcdf(p_deliv+'/sst.day.mean.{0}.0505.nc'.format(year_val))

Generation of the index predictor based on the index function obtained at the calibration period.

# generate index for validation period
xs_val = ds_index_over_time(df_2019, sst_val_0505, mld_val, lop1, lop2, lap1, lap2, delta)
xs_val
<xarray.Dataset>
Dimensions:  (lat: 16, lon: 26, time: 366)
Coordinates:
  * time     (time) datetime64[ns] 2020-01-01 2020-01-02 ... 2020-12-31
  * lat      (lat) float64 -0.25 -2.25 -4.25 -6.25 ... -26.25 -28.25 -30.25
  * lon      (lon) float64 160.2 162.2 164.2 166.2 ... 204.2 206.2 208.2 210.2
Data variables:
    index    (time, lat, lon) float64 0.0 0.05833 0.0 0.0 ... 0.0 0.0 0.0 0.0
    sst      (time, lat, lon) float32 30.49 29.95 29.35 ... 23.67 23.79 24.02
    dbss     (time, lat, lon) float32 98.42 106.0 109.8 ... 13.58 13.92 12.21
    mask     (lat, lon) float64 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0

The fitted Principal Component Analysis for the calibration is used to predict the index principal components in that same temporal-spatial space and the predicted PCs are assigned to the best match unit group from the fitted K-means clustering -> based on the index predictor a DWT is assigned to each day.

val_bmus = PCA_k_means_val(p_kma_model,xs_val,df_2019,xs,lop1,lop2,lap1,lap2,delta)
K-means order previously obtained in the calibration period: 

[25 11 39 17  7  4 36 18 43 14 28 21  0  2 19 34 26 40 46 12 37  6 31 24
 29  3 41 10 23 22 33 47 27 13 30 16  9  1 38 44 15 48  8 35 42 32 45  5
 20]


 DWTs for the validation period and their counting: 

(array([ 6,  8, 10, 13, 15, 16, 19, 24, 25, 26, 27, 31, 32, 34, 38, 39, 41,
       43, 45, 49]), array([24, 36, 31,  1, 19, 25, 26, 10, 12, 16,  4, 15,  1,  3, 28, 15, 25,
        5, 50, 20]))


Chronology of the DWTs:

Plot_bmus_chronology(xs_val, val_bmus, year_val);
_images/05_Model_Validation_19_0.png

The resulting classification can be seen in the PCs space of the predictor index data. The obtained centroids (black dots), span the wide variability of the data.

Plot_scatter_kmeans(xds_kma_ord, val_bmus, xds_kma_ord.cenEOFs.values, figsize=(12,10));
_images/05_Model_Validation_21_0.png

Cluster comparison#

Plot_bmus_comparison_validation_calibration(xs, xds_kma, xs_val, val_bmus, 25, 49);
_images/05_Model_Validation_23_0.png

Predictand computation and plotting#

From the DWT the daily expected mean number of TCs in 8x8º cells in the target area is known for each day and thus maps at different time scales can be computed.

Daily mean expected number of TCs

xds_timeline_val, xs_M_val = ds_monthly_probabilities(
    df_2021, val_bmus, xs_val, 
    xds_count_tcs_8, xds_count_tcs_8_964,
)
print(xds_timeline_val)
<xarray.Dataset>
Dimensions:         (lat: 5, lon: 8, time: 366)
Coordinates:
  * time            (time) datetime64[ns] 2020-01-01 2020-01-02 ... 2020-12-31
  * lat             (lat) int64 -30 -22 -14 -6 2
  * lon             (lon) int64 160 168 176 184 192 200 208 216
Data variables:
    bmus            (time) int64 18 23 23 23 23 23 18 ... 15 23 15 15 15 18 15
    mask_tcs        (time) bool False False False False ... False False False
    id_tcs          (time) object nan nan nan nan nan ... nan nan nan nan nan
    counts_tcs      (time, lat, lon) float64 6.0 3.0 3.0 2.0 ... nan nan nan nan
    counts_tcs_964  (time, lat, lon) float64 1.0 1.0 nan nan ... nan nan nan nan
    probs_tcs       (time, lat, lon) float64 0.01942 0.009709 ... nan nan
    probs_tcs_964   (time, lat, lon) float64 0.003236 0.003236 nan ... nan nan

Monthly aggregated mean expected number of TCs

print(xs_M_val)
<xarray.Dataset>
Dimensions:         (lat: 5, lon: 8, time: 12)
Coordinates:
  * time            (time) datetime64[ns] 2020-01-01 2020-02-01 ... 2020-12-01
  * lat             (lat) int64 -30 -22 -14 -6 2
  * lon             (lon) int64 160 168 176 184 192 200 208 216
Data variables:
    mask_tcs        (time, lat, lon) float64 4.0 4.0 4.0 4.0 ... nan nan nan nan
    counts_tcs      (time, lat, lon) float64 110.0 111.0 82.0 ... nan nan nan
    counts_tcs_964  (time, lat, lon) float64 13.0 45.0 18.0 44.0 ... nan nan nan
    probs_tcs       (time, lat, lon) float64 0.388 0.4207 0.3044 ... nan nan nan
    probs_tcs_964   (time, lat, lon) float64 0.04207 0.1776 0.07361 ... nan nan
Plot_validation_year(df_2021, xs_M_val, xds_timeline_val, 35, lo_area, la_area);
_images/05_Model_Validation_30_0.png

Whole period aggregated mean expected number of TCs

Plot_validation_full_season(df_2021, xs_M_val, xds_timeline_val, 35, lo_area, la_area);
_images/05_Model_Validation_32_0.png
  • The model performs very well when estimating the expected TC activity (number and intensity of TCs), not understimating the threat.

  • In some cells adjacents to the cells including TC tracks it overstimates TC activity.