Fitting Multiple Spline Terms in Python using glum

In my last post I covered how you can fit Penalized Splines using the glum library in Python. Notionally glum was built to fit Generalized Linear Models. However it was designed to give the user the option to pass in a custom penalty matrix. We took advantage of this capability to penalize a sequence of Basis Splines and also fit Cyclic splines which allow the user to model a symmetric effect. In this post I’d like to cover how we can use this method to fit multiple spline terms. My end goal would be to develop a framework to actually incorporate into the glum library, but that will be in a later post.

The Data

Since we will be including multiple terms it will probably be helpful to actually go over the data we are using. I am using a dataset that contains the hourly solar power generation in the state of Texas for 2022. ERCOT puts a ton of data on their website so if you are ever in need of an open dataset and want to use some renewable energy data you should definitely explore their data portal.

We will be building a simple model just to show off how we can fit multiple terms. Our model will predict the hourly solar power generation as a function of the hour of the day and the day of the year. I will also include a linear term for the total amount of solar installed that is available to help the model pick up an increase in installed solar throughout the year. $power \sim BS(HourOfDay) + BS(DayOfYear) + TotalSolar$ This is probably a really bad model of how solar power actually works :) but my only goal here is to build a framework for fitting multiple spline terms using glum, not solve the world’s energy crisis. Our column of interest is the ERCOT.PVGR.GEN which shows the total MWs of solar generated in that hour but I’m going to make a more convenient power_gw field for use in this script.

import numpy as np
import pandas as pd
from plotnine import *

from sklearn.preprocessing import SplineTransformer
from glum import GeneralizedLinearRegressor, GeneralizedLinearRegressorCV

## Source: https://www.ercot.com/mp/data-products/data-product-details?id=PG7-126-M
DATA_FILE = '../../data/ERCOT_2022_Hourly_Solar_Output.csv'

solar_df['power_gw'] = solar_df['ERCOT.PVGR.GEN'] / 1000
solar_df.head(3).to_markdown()

	Time (Hour-Ending)	Date	ERCOT.LOAD	Total Solar Installed, MW	Solar 1-hr MW change	Solar 1-hr % change	Daytime Hour	Ramping Daytime Hour	time	hour	day	week	day_of_week
0	01/01/2022 01:00:00	Jan-01	38124	9323	nan	nan	False	False	2022-01-01 01:00:00	1	1	52	5
1	01/01/2022 02:00:00	Jan-01	37123	9323	0	0	False	False	2022-01-01 02:00:00	2	1	52	5
2	01/01/2022 03:00:00	Jan-01	35937	9323	0	0	False	False	2022-01-01 03:00:00	3	1	52	5

Building a spline

Just like before we need to build our spline terms for each feature using the SplineTransformer function from the scikit-learn.preprocessing module. Then for each spline we need to build a 2nd order difference penalty matrix. I’m sure there is a better way to do this but I’m just going to keep track of everything in a dictionary for each term.

## n_knots = 26 so there is a knot every other week :shrug:
spline_info = dict(daily = dict(), hourly = dict())
spline_info['daily'] = dict(bsplines = SplineTransformer(n_knots = 26).fit_transform(solar_df[['day']]))
spline_info['hourly'] = dict(bsplines = SplineTransformer(n_knots = 12).fit_transform(solar_df[['hour']]))
for k,v in spline_info.items():
    spline_info[k]['num_splines'] = v['bsplines'].shape[1]

for k in spline_info.keys():
    print(f'Number of Basis Splines for {k} feature: {spline_info[k]["num_splines"]}')

for k, v in spline_info.items():
    spline_info[k]['diff_matr'] = np.diff(np.eye(v['num_splines']), n = 2, axis = 0)

Number of Basis Splines for daily feature: 28
Number of Basis Splines for hourly feature: 14

Next is our combined penalty matrix. To recap from my last post, the penalty matrix enforces smoothness on the spline coefficients. This acts as a regularizer so that the model doesn’t interpolate too much and end up overfitting to the training data. To calculate the penalty matrix we first calculate the difference matrix which tracks the differences between successive spline terms. The penalty matrix for a single spline term is simply the inner transpose product of this difference matrix, which you can also multiply by a penalty value, $\lambda$, to control the level of smoothness:

$\mathbf{P} = \lambda D^T D$

Now we have multiple spline terms and a linear term instead of a single spline term. How can we combine the difference matrics that we have for each term into one penalty matrix? We get lucky and actually all we need to do is “stack” our penalty matrices diagonally surrounded by zero matrices. This takes advantage of how the penalty matrix gets included in the loss function that the model optimizes ($\beta^T P \beta$ where $\beta$ is the coefficient vector). This way each penalty only interacts with its own corresponding spline coefficients and no other term’s coefficients. If $D_h$ is the difference matrix for the hours of the day coefficients, $D_d$ is the penalty matrix for the day of the year coefficient, and $\lambda_{i}$ is the penalty for each term (including the non-spline terms), then our combined penalty matrix is just:

\[\begin{bmatrix} \lambda_1 & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \lambda_2 D_{h}^T D_{h} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \lambda_3 D_{d}^T D_{d} \end{bmatrix}\]

This allows us to combine any number of spline terms in one model. More terms will obviously increase the time it takes to fit each model. I would love to test this further but my hunch is that it actually won’t slow down a model fit too much. The reason is that both the model matrix containing the spline values and the penalty matrix will be “mostly sparse”. What I mean by that is that they aren’t completely diagonal matrices, but most sections of the matrix are only non-zero near the diagonal. The glum library was designed to handle sparse and nearly-sparse matrices more efficiently than other libraries. I’m hoping that these improvements will flow through to fitting GAMs, but we will have to test that on a later date.

In thinking through how to do this in code I believe the best option is to accept a list of penalty matrices. Then iteratively fill in a matrix of zeros that is the full size of the combined penalties. This also allows us to include non-spline terms by including a 2d matrix of shape (1, 1) that will penalize the size of the linear coefficient. In my research I found that there is actually a np.block function, but it would force me to compute the zero matrices in the uppper and lower triangles first to then manually create the block matrix. That seems more complicated than filling in a square matrix with the penalty matrices instead.

def build_multiterm_penalty(penalty_matr_list):
    ## Need to use the column shapes because the difference matrix removes rows
    num_features_list = list(map(lambda x: x.shape[1], penalty_matr_list))
    num_features = sum(num_features_list)
    ## Pre-create the matrix for efficient memory allocation
    penalty_matrix = np.zeros(shape = [num_features, num_features])
    current_row = 0
    for m in penalty_matr_list:
        size = m.shape[1]
        end_row = current_row + size
        m_square = np.dot(m.T, m)
        penalty_matrix[current_row:end_row, current_row:end_row] = m_square
        current_row = end_row

    return penalty_matrix
## simple test
build_multiterm_penalty([np.eye(2) * 2, np.eye(1) * 3])

array([[4., 0., 0.],
       [0., 4., 0.],
       [0., 0., 9.]])

So this will give us our combined penalty matrix. Now lets calculate our real one.

full_penalty_list = [np.eye(1), 
                     spline_info['hourly']['diff_matr'],
                     spline_info['daily']['diff_matr']]
gam_penalty = build_multiterm_penalty(full_penalty_list)
print(gam_penalty.shape)

(43, 43)

Our model matrix is a lot easier; we can simply stack the spline values we got from our transformer together. Here you can see the first feature values:

## build model matrix
model_matrix = np.hstack([
    solar_df[['Total Solar Installed, MW']], 
    spline_info['hourly']['bsplines'],
    spline_info['daily']['bsplines']
    ])
print(model_matrix.shape)
np.round(model_matrix[:3, :10], 2)

(8760, 43)

array([[9.323e+03, 2.000e-02, 4.900e-01, 4.700e-01, 2.000e-02, 0.000e+00,
        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [9.323e+03, 0.000e+00, 1.900e-01, 6.600e-01, 1.500e-01, 0.000e+00,
        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [9.323e+03, 0.000e+00, 3.000e-02, 5.200e-01, 4.400e-01, 1.000e-02,
        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00]])

Fitting the Model

Now that we have our penalty matrix and model matrix all that we have left to do is actually fit the model. We can visualize our first day’s worth of predictions to see how the model does. While this doesn’t technically show only the effect of the hourly coefficients we can basically interpret it as such anyway; both the day-of-the-year spline and the linear solar capacity terms will add a fixed amount to each day. So any within-day differences are due only to the hourly smoothing spline.

gam_model = GeneralizedLinearRegressor(P2 = gam_penalty, alpha = 1, fit_intercept = False).fit(X = model_matrix, y = solar_df['power_gw'])
solar_df['preds_baseline'] = gam_model.predict(model_matrix)

The model certainly picks up on the general trend of solar power generation rising during the day before falling in the evening. There are many reasons why this does a poor job of actually modeling whats going on in the real world. One example is that the hourly term is fixed throughout the year, so the model can’t pick up on the fact that summer days are longer than days in the winter. In addition the model seems to be predicting negative numbers for some hours which doesn’t make any sense in the real world. All of those could be fixed with more realistic modeling choices. One thing we can fix with just our spline penalties is the fact that moving from midnight to 1am there is a discontinuity, but in actuality the predictions should basically be the same. We did this in our last post using a cyclic penalty where we penalize the difference between the first and last coefficient. We aren’t going to do anything different here, but I just want to show how easy it is even with multiple spline terms. We just replace the prior difference matrix with the new cyclic difference matrix and the additional penalty will be picked up automatically when we create our m_squared matrix in the build_multiterm_penalty function. The only thing that may be different in this code is I’m going to multiply the new cyclic penalty matrix by an additional penalty term so that the model is forced to respect this new constraint.

def add_cyc_penalty(diff_matr):
    num_rows, num_cols = diff_matr.shape
    ## create an empty row
    cyc_row = np.zeros(num_cols)
    ## \beta @ diff_matr will penalize (\beta_{0} - \beta_{-1})
    cyc_row[0] = 1
    cyc_row[-1] = -1
    ## add the cyclic penalty row to the penalty matrix
    diff_matr_cyc = np.vstack([diff_matr, cyc_row])
    return diff_matr_cyc

cyclic_penalty = np.sqrt(3.5)
hourly_penalty_cyc = add_cyc_penalty(spline_info['hourly']['diff_matr'])
hourly_penalty_cyc = hourly_penalty_cyc * cyclic_penalty

full_penalty_list_cyc = [np.eye(1), 
                     hourly_penalty_cyc,
                     spline_info['daily']['diff_matr']]
gam_penalty_cyc = build_multiterm_penalty(full_penalty_list_cyc)

gam_model_cyc = GeneralizedLinearRegressor(P2 = gam_penalty_cyc, alpha = 1, fit_intercept = False).fit(X = model_matrix, y = solar_df['power_gw'])
solar_df['preds_cyc'] = gam_model_cyc.predict(model_matrix)

As you can see our hourly coefficients are more symmetric, but also much more muted than the baseline model; the baseline gam_model predicts a max solar output of ~4.1GW while the cyclic gam_model_cyc only predicts a value of ~3.2. The reason for this is that when we multiplied our cyclic penalty matrix (hourly_penalty_cyc) by an additional penalty value (cyclic_penalty) we increased the weight on the cyclic penalty but also increased the weight on the original difference penalty. This makes it harder for the model to justify consecutive spline coefficients with large differences, which makes the overall curve less, well, curvy. We can fix this by rewriting our add_cyc_penalty function to take the additional penalty value as an input and multiplying only the row that corresponds to the cyclic penalty (the last row) by our penalty value.

def add_cyc_penalty(diff_matr, penalty = 1):
    num_rows, num_cols = diff_matr.shape
    ## create an empty row
    cyc_row = np.zeros(num_cols)
    ## \beta @ diff_matr will penalize (\beta_{0} - \beta_{-1})
    cyc_row[0] = 1
    cyc_row[-1] = -1
    ## add the cyclic penalty row to the penalty matrix
    cyc_row = cyc_row * penalty
    diff_matr_cyc = np.vstack([diff_matr, cyc_row])
    return diff_matr_cyc

cyclic_penalty = np.sqrt(10)
## Now our cyclic_penalty is an input instead of an additional step
hourly_penalty_cyc = add_cyc_penalty(spline_info['hourly']['diff_matr'], cyclic_penalty)

There is still some discontinuity between 11pm and midnight, but our predictions have maintained their more accurate predictions during the middle of the day while shrinking the gap. In fact, I can’t seem to figure out how to close this gap. If I increase the penalty value in the updated add_cyclic_penalty function the coefficients really don’t change. If I make it too large then glum will throw errors about how the P2 matrix must be positive semi-definite. I will have to look into this but wanted to wrap this post up regardless since the core idea was just including multiple splines, which we have done.

From here I would love to actually look at the internals in the glum library to see if its feasible to implement this capability directly into the library. For now hopefully this explains a little more about P-splines and fitting models with glum.