Gradient Boosting Parameters

In this post I’ll walk through how you can use the same principles behind Gradient Boosting Machines (GBM) to predict parameters of models in the same way traditional GBMs predict 1-dimensional targets. This allows us to get the benefits of Gradient Boosting (large feature sets, high dimensional interactions, modeling at scale) to fit things like smoothing splines, survival analysis, and probabilistic models.

It’s not a modeling silver bullet, but I think this approach can be a useful tool to add to your modeling toolkit. Plus walking through this helped me understand gradient boosting at a deeper level than I did before, so hopefully you feel the same.

Gradient Boosting

Gradient Boosting is a Machine Learning algorithm that fits a series of decision trees; each successive tree attempts to improve the predictions from the set of previous trees. It (generally, there are many GBM implementations out there now) does this with two steps:

Before fitting a new tree calculate the gradient of the loss function we are minimizing at each observation.
Fit a new decision tree that predicts the gradient and then add this prediction to the previous predictions.

(*) technically fit the negative gradient, we want the loss to go down

Recently I was reading about Generalized Random Forests (GRF) to learn how they used Random Forests to fit more complicated models than just regression and classification. This post started as a walkthrough of that technique, but I ended up trying something I’ve always had an idea for first; can we use a GBM to fit distributions, not just make individual predictions? To test this I am going to walk through using Gradient Boosting to learn the coefficients for a spline function.

The Data

I’ve downloaded 4 years of the Citi Bike usage data and aggregated by the number of rides in each hour of each day. I’ve also added the high and low temperature for that day as well as the inches of rain. If this were a proper forecasting model you’d want the predicted weather for a day so your model was properly calibrated, but this is fine for having features to learn parameters from. One unique thing about this data is that I have stored the hourly ride counts and the hour of the day in an aggregated polars array. This is so the splits will work on the daily level, since I need to generate spline coefficients for each day and apply them to the 24-hour range. If I was using a traditional GBM I would use the ride_hour as a feature and predict the hourly counts.

dow	woy	year	high_temp	low_temp	precip	ride_date	ride_hour	ride_count
i8	i8	i32	f64	f64	f64	date	array[i64, 24]	array[u32, 24]
5	38	2022	51.98	64.04	0.0	2022-09-23	[0, 1, … 23]	[1458, 782, … 2465]
1	3	2023	30.2	48.02	0.0	2023-01-16	[0, 1, … 23]	[564, 318, … 827]
4	34	2024	62.06	77.0	0.0	2024-08-22	[0, 1, … 23]	[1840, 999, … 4037]
5	26	2025	64.04	73.94	0.0787402	2025-06-27	[0, 1, … 23]	[2791, 1683, … 5395]
3	34	2024	59.0	75.02	0.0	2024-08-21	[0, 1, … 23]	[1744, 903, … 3445]

Let’s use this data to fit a traditional GBM on the hourly data and visualize some results.

hourly_df = hourly_df.join(zero_days_df, how='anti', on='ride_date')

X_hourly = hourly_df[FEATURE_COLS + ['ride_hour']].to_numpy()
y_hourly = hourly_df['ride_count'].to_numpy()
y_hourly_log = np.log1p(y_hourly)

hourly_gbm = GradientBoostingRegressor(min_samples_leaf=4, validation_fraction=0.05)
hourly_gbm.fit(X_hourly, y_hourly_log)

GBM Regression Feature Importances:
ride_hour: 77.99%
low_temp: 6.20%
high_temp: 4.57%
dow: 3.46%
precip: 2.98%
year: 2.65%
woy: 2.15%

We can see this model has a hard time picking up on the different shapes between weekdays and weekends. It also has some sharp corners that maybe we should try and smooth out. But what if we could use the same process of identifying optimal splits for a single hour’s prediction in order to learn an entire coefficient vector? And yes, I’m just going to get ahead of any comments now; I’m not trying to make either model “the best” I’m only trying to work through the model fitting process. We are here for the journey, not the final validation loss value, ok?

Gradient Boosting as Optimization

Traditionally Gradient Boosting Models are fit by learning the direction to move each prediction individually at each iteration. By calculating the loss value for each observation, and then using the gradient of that observation’s prediction as a “pseudo-label” we can learn a model to make predictions at that step. We can use the same procedure to fit a set of parameters and optimize them with the same algorithm. Then when we need to make a prediction we can use the predicted parameter values for that observation.

In our example we will calculate our parameter vectors $\theta$ for an observation $x$ at any iteration $m-1$ as the sum of all the individual predictions from our base learners $f_m(x)$

\[ \hat{\theta}_{m-1} = \sum_{i \in m-1} f_{m-1}(x) \]

The base learners are models learned to predict the gradients of these parameters against our loss function

\[ f_m(x) \sim \nabla L(y, \hat{\theta}_{m-1}) \]

In our example we need one extra step. If $b(\theta, x)$ is an individual model that generates predictions for a set of parameters $\theta$ and an observation $x$ then we calculate the gradients by how well the predictions from $b(\theta, x)$ predict our loss function $L$:

\[ \nabla L(y_i, b(\theta_{m-1}, x_i)) \]

Then our decision tree will use $-g_{\theta}$ as the multi-target outputs and we can fit a decision tree to these targets. Since we have multiple parameter values in a smoothing spline we use a multitarget algorithm to make predictions for each dimension.

And our updates to our fitted parameters are

\[ \hat{\theta}_{m} = \hat{\theta}_{m-1} + \alpha \cdot f_m(x) \]

Let’s ease into this by mimicking what the tree splitting algorithm does to find optimal splits but with our parameters; coefficients for a smoothing spline of the hours of the day.

Optimal Spline Splits

Let’s walk through this without worrying about trees and gradients. The goal here is to pick up on which feature value results in the most unique daily ride shape. We’ll loop through the unique values in a feature as the tree splitting algorithm will do. But for a first pass we will just measure how much variance there is from using the group average to predict the hourly ride counts. So for example we will measure the squared error from the observed counts with a single day of the week’s average, and then do the same for all the other days as one group. The day with the lowest combined Sum of Squared Errors does the best job of splitting the data into coherent days of the week.

shape: (7, 2)
  Day of Week   SSE
  ------------- -----------
  i64           f64
  1             2.1539e11
  2             2.1197e11
  3             2.1273e11
  4             2.1379e11
  5             2.1624e11
  6             2.0468e11
  7             2.0292e11

So Sunday has the most unique shape by this measure. The other days would have had more total error across all observations if we had split on one of them instead. We can do the same thing by the Weeks of the Year.

shape: (5, 2)
  Week of Year   SSE
  -------------- -----------
  i64            f64
  14             1.7068e11
  15             1.7146e11
  13             1.7213e11
  12             1.7354e11
  16             1.7379e11

I think this may be a data problem with some values missing or something. I don’t know why early April would have the least amount of rides? Maybe spring break for the kids? Oh wow, that is the answer haha.

Smoothing Splines

Now we can actually get to fitting our Gradient Boosting Spline Model. If you want an introduction to splines I’ve written a couple different posts on this topic:

For our model we have to do a little preprocessing to be ready for the fitting process

Build our spline basis for the 24 hour period of day
Write a function to generate predictions with our flattened hourly data
Write a function to calculate the gradient of our loss function

N_KNOTS = 12
x_day = hourly_df['ride_hour'].unique().to_numpy().reshape(-1, 1)
spline_hour = SplineTransformer(n_knots=N_KNOTS, include_bias=True).fit(X=x_day)
N_SPLINES = spline_hour.n_features_out_

bs_day = spline_hour.transform(x_day)
print('Example of Basis Activations for different hours')
for i in range(6):
    print(f'Hour {i*4}: {display_vec(bs_day[i*4, :])}')

Example of Basis Activations for different hours
Hour 0: ▂█▂▁▁▁▁▁▁▁▁▁▁▁
Hour 4: ▁▁▃█▂▁▁▁▁▁▁▁▁▁
Hour 8: ▁▁▁▁▃█▂▁▁▁▁▁▁▁
Hour 12: ▁▁▁▁▁▁▄█▁▁▁▁▁▁
Hour 16: ▁▁▁▁▁▁▁▁▅█▁▁▁▁
Hour 20: ▁▁▁▁▁▁▁▁▁▁▆█▁▁

Here is where things start to diverge from traditional models though. I do not want each tree $f_m(x)$ to produce a prediction of $\hat{y}_m$. I want each tree to predict the optimal spline coefficients for that gradient step $\hat{\theta}_t$. So we will need to know the gradient of the loss function for each spline coefficient and each observation. Luckily, jax makes this easy. We can define our loss function once that operates on a single observation. Jax will then calculate the partial derivatives automatically with grad. And we can scale this to an array with the vmap function.

def loss_i(coefs, bs_array, y_array):
    """
    Calculate the scalar loss for the input, spline transformed data
    
    Parameters:
    coefs: [b,] 1d array from numerical procedure
    bs_array: [24, b] spline transformed data
    y_array: [24, 1] target values
    """
    #x_array_ = x_array.reshape(-1, 1)
    y_array_ = y_array.reshape(-1, 1)

    preds = jnp.dot(bs_array, coefs.reshape(-1, 1))
    error = jnp.power(preds - y_array_, 2)
    penalty = jnp.sum(jnp.square(jnp.diff(coefs, 1)))
    # using mean instead of sum to keep the total loss numbers low...I think that's ok
    error_array = jnp.mean(error)
    return error_array + 0.01 * penalty

# vectorize our loss function to work on arrays
loss_array = vmap(loss_i)
# Calculate the gradient of a single observation
# grad only works on scalar output, so we have to vectorize them separately
grad_i = grad(loss_i)
grad_array = vmap(grad_i)

v, g = value_and_grad(loss_i)(coefs_init, BS_daily[0], y_log[0])
print(f'Loss function value for one day: {v:.2f}')
print('Gradients for each spline coefficient for that day:')
print(np.round(g, 2))

Loss function value for one day: 1.28
Gradients for each spline coefficient for that day:
[ 0.01        0.11        0.29        0.28        0.         -0.14
 -0.13       -0.16       -0.19999999 -0.22999999 -0.17       -0.06
 -0.          0.        ]

I’ve initialized some coefficients to predict a flat line, coefs_init. We can calculate our initial gradients and thus pseudo-labels to use in the first tree with our gradient functions. One technical note is the coefficient vector needs to be the same length as the X and Y arrays I’m using for vmap to accept it. Once we start fitting trees this will be the case, but when I created my coefs_init values originally I just used a 1d array for all values. That’s why in the following code cell I repeat it to match the length of our data.

coefs_init_array = np.tile(coefs_init, (daily_df.shape[0], 1))
labels_init = -grad_array(coefs_init_array, BS_daily, y_log)
print(labels_init.shape)

(1432, 14)

Gradient Boosting Splines

We now have everything we need to update our spline coefficients:

coefs_init: an initial set of smoothing splines to produce a flat prediction
grad_array: a function to calculate the gradients of our predictions with respect to the coefficients at each observation
labels_init: The initial gradient values to use as a our pseudo-labels for our first iteration

Let’s fit one decision tree with our features to predict the gradients of the splines for each observation

from sklearn.tree import DecisionTreeRegressor

x_features = daily_df[FEATURE_COLS].to_numpy()

tree_init = DecisionTreeRegressor(max_depth=2).fit(x_features, labels_init)

Tree splits:
low_temp <= 57.5
  dow <= 5.5
  else:
else:
  dow <= 5.5
  else:

First Tree Feature Importances:
low_temp: 64.93%
dow: 35.07%
woy: 0.00%
year: 0.00%
high_temp: 0.00%
precip: 0.00%

We can see how the different predictions look for this leaf nodes of this simple tree:

What I love is we can immediately see the different shapes in the weekends and weekdays. Our classic GBM had to learn the trend at each hour of the day so it took a while until the differences in curves were the largest residual signal left to optimize against. But with our Gradient Boosted Splines our model can learn the full curve in one go.

I also want to point out I’m not clustering the data and then fitting a model to each subset of data to learn the coefficients. These predictions are only coming from predicting the gradients of the loss function on each day from a flat line. The model learns the optimal splits to cluster the data on for optimizing our loss function. We’ve basically created a supervised clustering algorithm.

Gradient Boosted Splines

We now have everything we need to fit our Gradient Boosting models to try and improve our predictions.

def fit_tree_and_update(
        coefs,
        bs_array=BS_daily,
        y_array=y_log,
        x_features=x_features,
        lr=0.1):
    """Fit a single tree and update the predictions"""
    # 1. Calculate Gradients
    labels = grad_array(coefs, bs_array, y_array)
    # 2. Fit a tree
    tree = DecisionTreeRegressor(max_depth=2).fit(x_features, labels)
    # 3. Get predictions of the spline coefficients
    tree_preds = tree.predict(x_features)
    # 4. Update our model fits
    new_coefs = coefs - lr * tree_preds
    return new_coefs, tree, tree_preds

N_TREES = 500

preds_init = predict_array(coefs_init_array, BS_daily).squeeze(1)
loss_init = loss_array(coefs_init_array, BS_daily, y_log).mean()

tree_params = coefs_init_array
spline_trees = []
for i in range(N_TREES):
    result = fit_tree_and_update(tree_params, BS_daily, y_log, x_features)
    tree_params = result[0]
    spline_trees.append(result)

# get the last tree's predictions
coef_fit = spline_trees[-1][0]
preds_fit = predict_array(coef_fit, BS_daily).squeeze(1)
loss_fit = loss_array(coef_fit, BS_daily, y_log).mean()

print(f'Initial loss: {loss_init}')
print(f'Final loss: {loss_fit}')

Initial loss: 1.5850720405578613
Final loss: 0.2547142207622528

So the model (spline_trees) predicts a coefficient vector for each observation. To get predictions for each hour at each observation (an observation is a single day’s hourly totals) we need to apply each coefficient vector against the Spline Basis we created for the 24 hour period initially. predict_array is a function I wrote to generate the actual hourly predictions based off a customized coefficient vector.

[6.8599997 6.47      6.1099997 5.95      6.12      6.67      7.3999996
 8.0199995 8.42      8.57      8.559999  8.55      8.599999  8.69
 8.79      8.9       8.99      9.01      8.92      8.73      8.48
 8.21      7.98      7.8199997]
[7.29 6.66 6.08 5.59 5.62 6.67 7.82 8.59 8.9  8.67 8.48 8.55 8.7  8.81
 8.95 9.05 9.15 9.3  9.17 8.83 8.38 8.04 7.9  7.81]

Let’s compare our fitted Gradient Boosted Spline to our initial GBM. The model is still undershooting the peaks and misses something about the January 1st day (I should probably add a Is_Holiday feature). But we get smooth curves! From Gradient Boosting! It’s amazing that this approach can be accomplished so easily with code by leveraging open source tools like Jax and Scikit-Learn. All you need is a loss function and a way to express individual predictions from a set of parameters and you can fit any model this way.

I’m sure there are tons of valid inference questions you could ask of this approach; I don’t have those answers. But I think that will be it for today’s post. I’m excited to take this concept even further to fit more types of models. If you want to make sure you see any future posts I write you can follow me @statmills on Twitter or BlueSky. If you are interested in the code for this blog post I’ve added everything to my GitHub repo.

Humanity Oath: I solemnly swear that I did not use AI to write the words in this piece.