reworked penalty matrix explanation

mattmills49 · Dec 16, 2023 · f9fdb97 · f9fdb97
1 parent ac12c83
commit f9fdb97
Show file tree

Hide file tree

Showing 3 changed files with 109 additions and 71 deletions.
diff --git a/glum_splines/glum_multi_splines.ipynb b/glum_splines/glum_multi_splines.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "raw",
-   "id": "04fe5867",
+   "id": "9b573911",
    "metadata": {},
    "source": [
     "---\n",
@@ -17,7 +17,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "6e68ea29",
+   "id": "cf3ded8f",
    "metadata": {},
    "source": [
     "In my last post I covered how you can fit Penalized Splines using the `glum` library in Python. Notionally `glum` was built to fit Generalized Linear Models. However it was designed to give the user the option to pass in a custom penalty matrix. We took advantage of this capability to penalize a sequence of Basis Splines and also fit Cyclic splines which allow the user to model a symmetric effect. In this post I'd like to cover how we can use this method to fit multiple spline terms. My end goal would be to develop a framework to actually incorporate into the `glum` library, but that will be in a later post. \n",
@@ -34,7 +34,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "54858a1d",
+   "id": "57b0ba73",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -52,7 +52,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "0c2092f2",
+   "id": "153daf55",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -88,7 +88,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "e63a72ae",
+   "id": "7efdd965",
    "metadata": {},
    "outputs": [
     {
@@ -117,29 +117,28 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "23de3ae5",
+   "id": "98213bdb",
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "'|    | Time (Hour-Ending)   | Date   |   ERCOT.LOAD |   ERCOT.PVGR.GEN |   Total Solar Installed, MW |   Solar Output, % of Load |   Solar Output, % of Installed |   Solar 1-hr MW change |   Solar 1-hr % change | Daytime Hour   | Ramping Daytime Hour   | time                |   hour |   day |   week |   day_of_week |   power_gw |\\n|---:|:---------------------|:-------|-------------:|-----------------:|----------------------------:|--------------------------:|-------------------------------:|-----------------------:|----------------------:|:---------------|:-----------------------|:--------------------|-------:|------:|-------:|--------------:|-----------:|\\n|  0 | 01/01/2022 01:00:00  | Jan-01 |        38124 |                0 |                        9323 |                         0 |                              0 |                    nan |                   nan | False          | False                  | 2022-01-01 01:00:00 |      1 |     1 |     52 |             5 |          0 |\\n|  1 | 01/01/2022 02:00:00  | Jan-01 |        37123 |                0 |                        9323 |                         0 |                              0 |                      0 |                     0 | False          | False                  | 2022-01-01 02:00:00 |      2 |     1 |     52 |             5 |          0 |\\n|  2 | 01/01/2022 03:00:00  | Jan-01 |        35937 |                0 |                        9323 |                         0 |                              0 |                      0 |                     0 | False          | False                  | 2022-01-01 03:00:00 |      3 |     1 |     52 |             5 |          0 |\\n|  3 | 01/01/2022 04:00:00  | Jan-01 |        35133 |                0 |                        9323 |                         0 |                              0 |                      0 |                     0 | False          | False                  | 2022-01-01 04:00:00 |      4 |     1 |     52 |             5 |          0 |\\n|  4 | 01/01/2022 05:00:00  | Jan-01 |        34603 |                0 |                        9323 |                         0 |                              0 |                      0 |                     0 | False          | False                  | 2022-01-01 05:00:00 |      5 |     1 |     52 |             5 |          0 |'"
+       "'|    | Time (Hour-Ending)   | Date   |   ERCOT.LOAD |   ERCOT.PVGR.GEN |   Total Solar Installed, MW |   Solar Output, % of Load |   Solar Output, % of Installed |   Solar 1-hr MW change |   Solar 1-hr % change | Daytime Hour   | Ramping Daytime Hour   | time                |   hour |   day |   week |   day_of_week |   power_gw |\\n|---:|:---------------------|:-------|-------------:|-----------------:|----------------------------:|--------------------------:|-------------------------------:|-----------------------:|----------------------:|:---------------|:-----------------------|:--------------------|-------:|------:|-------:|--------------:|-----------:|\\n|  0 | 01/01/2022 01:00:00  | Jan-01 |        38124 |                0 |                        9323 |                         0 |                              0 |                    nan |                   nan | False          | False                  | 2022-01-01 01:00:00 |      1 |     1 |     52 |             5 |          0 |\\n|  1 | 01/01/2022 02:00:00  | Jan-01 |        37123 |                0 |                        9323 |                         0 |                              0 |                      0 |                     0 | False          | False                  | 2022-01-01 02:00:00 |      2 |     1 |     52 |             5 |          0 |\\n|  2 | 01/01/2022 03:00:00  | Jan-01 |        35937 |                0 |                        9323 |                         0 |                              0 |                      0 |                     0 | False          | False                  | 2022-01-01 03:00:00 |      3 |     1 |     52 |             5 |          0 |'"
       ]
      },
-     "execution_count": 34,
+     "execution_count": 4,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "#| output: asis\n",
     "solar_df['power_gw'] = solar_df['ERCOT.PVGR.GEN'] / 1000\n",
-    "solar_df.head().to_markdown()"
+    "solar_df.head(3).to_markdown()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "e057cb6f",
+   "id": "21702b82",
    "metadata": {},
    "source": [
     "#### Building a spline \n",
@@ -150,7 +149,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "871688da",
+   "id": "7c2886ac",
    "metadata": {},
    "outputs": [
     {
@@ -179,28 +178,35 @@
   },
   {
    "cell_type": "markdown",
-   "id": "1f9e7665",
+   "id": "70434bc4",
    "metadata": {},
    "source": [
-    "Next is our combined penalty matrix. When we had one spline term we could just pass in the inner transpose product of the difference matrix (I'm not sure if that's the correct term, but the $D^TD$ matrix is what I'm referring to). Now we have two spline terms and a simple linear term. Since each spline term has its own penalty we can structure the full penalty matrix as a \"sequence\" of individual penalty matrices. So the difference matrices will be on the (large) diagonal and the outer triangles are filled with zeros. This way each penalty only interacts with its own corresponding spline coefficients and no other term's coefficients. If $mathbf{D_hourly}$ is the penalty matrix for the hours of the day coefficients, and $mathbf{D_daily}$ is the penalty matrix for the day of the year coefficient then our combined penalty matrix is just:\n",
+    "Next is our combined penalty matrix. To recap from my last post, the penalty matrix enforces smoothness on the spline coefficients. This acts as a regularizer so that the model doesn't interpolate too much and end up overfitting to the training data. To calculate the penalty matrix we first calculate the difference matrix which tracks the differences between successive spline terms. The penalty matrix for a single spline term is simply the inner transpose product of this difference matrix, which you can also multiply by a penalty value, $\\lambda$, to control the level of smoothness:\n",
+    "\n",
+    "$$\n",
+    "\\mathbf{P} = \\lambda D^T D\n",
+    "$$\n",
+    "\n",
+    " Now we have multiple spline terms and a linear term instead of a single spline term. How can we combine the difference matrics that we have for each term into one penalty matrix? We get lucky and actually all we need to do is \"stack\" our penalty matrices diagonally surrounded by zero matrices. This takes advantage of how the penalty matrix gets included in the loss function that the model optimizes ($\\beta^T P \\beta$ where $\\beta$ is the coefficient vector). This way each penalty only interacts with its own corresponding spline coefficients and no other term's coefficients. If $D_h$ is the difference matrix for the hours of the day coefficients, $D_d$ is the penalty matrix for the day of the year coefficient, and $\\lambda_{i}$ is the penalty for each term (including the non-spline terms), then our combined penalty matrix is just:\n",
     "\n",
     "$$\n",
     "\\begin{bmatrix}\n",
-    "\\mathbf{D_hourly} & \\mathbf{0} \\\\\n",
-    "\\mathbf{0} & \\mathbf{D_{daily}}\n",
+    "\\lambda_1 & \\mathbf{0} & \\mathbf{0} \\\\\n",
+    "\\mathbf{0} & \\lambda_2 D_{h}^T D_{h} & \\mathbf{0} \\\\\n",
+    "\\mathbf{0} & \\mathbf{0} & \\lambda_3 D_{d}^T D_{d} \n",
     "\n",
     "\\end{bmatrix}\n",
     "$$\n",
     "\n",
-    "This allows us to combine any number of individual terms. More terms will obviously increase the time it takes to fit each model. I would love to test this further but my hunch is that it actually won't slow down a model fit too much. The reason is that both the model matrix containing the spline values and the penalty matrix will be \"mostly sparse\". What I mean by that is that they aren't completely diagonal matrices, but most sections of the matrix are only non-zero near the diagonal. The `glum` library was designed to handle sparse and nearly-sparse matrices more efficiently than other libraries. I'm hoping that these improvements will flow through to fitting GAMs, but we will have to test that on a later date. \n",
+    "This allows us to fit any number of spline terms in one model. More terms will obviously increase the time it takes to fit each model. I would love to test this further but my hunch is that it actually won't slow down a model fit too much. The reason is that both the model matrix containing the spline values and the penalty matrix will be \"mostly sparse\". What I mean by that is that they aren't completely diagonal matrices, but most sections of the matrix are only non-zero near the diagonal. The `glum` library was designed to handle sparse and nearly-sparse matrices more efficiently than other libraries. I'm hoping that these improvements will flow through to fitting GAMs, but we will have to test that on a later date. \n",
     "\n",
     "In thinking through how to do this in code I believe the best option is to accept a list of penalty matrices. Then iteratively fill in a matrix of zeros that is the full size of the combined penalties. This also allows us to include non-spline terms by including a 2d matrix of shape (1, 1) that will penalize the size of the linear coefficient. In my research I found that there is actually a `np.block` function, but it would force me to compute the zero matrices in the uppper and lower triangles first to then manually create the block matrix. That seems more complicated than filling in a square matrix with the penalty matrices instead. \n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "1fb2fe56",
+   "id": "b2ab48d5",
    "metadata": {},
    "outputs": [
     {
@@ -211,7 +217,7 @@
        "       [0., 0., 9.]])"
       ]
      },
-     "execution_count": 36,
+     "execution_count": 6,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -238,7 +244,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "c1501e22",
+   "id": "b63678b8",
    "metadata": {},
    "source": [
     "So this will give us our combined penalty matrix. Now lets calculate our real one.\n"
@@ -247,7 +253,7 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "302a3cc1",
+   "id": "2fed2e41",
    "metadata": {},
    "outputs": [
     {
@@ -268,7 +274,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ca1c8fbc",
+   "id": "3e024df1",
    "metadata": {},
    "source": [
     "Our model matrix is a lot easier; we can simply stack the spline values we got from our transformer together. Here you can see the first feature values:\n"
@@ -277,7 +283,7 @@
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "bd3a26cd",
+   "id": "ef06669a",
    "metadata": {},
    "outputs": [
     {
@@ -298,7 +304,7 @@
        "        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00]])"
       ]
      },
-     "execution_count": 38,
+     "execution_count": 8,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -316,7 +322,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "f7d39228",
+   "id": "3e4bce36",
    "metadata": {},
    "source": [
     "#### Fitting the Model\n",
@@ -327,14 +333,14 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "14f58e71",
+   "id": "f7bbbb39",
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/Users/mm/Documents/Data Science/Blog Posts/lib/python3.7/site-packages/glum/_solvers.py:52: LinAlgWarning: Ill-conditioned matrix (rcond=1.28887e-18): result may not be accurate.\n"
+      "/Users/mm/Documents/Data Science/Blog Posts/lib/python3.7/site-packages/glum/_solvers.py:52: LinAlgWarning: Ill-conditioned matrix (rcond=1.38434e-18): result may not be accurate.\n"
      ]
     }
    ],
@@ -346,7 +352,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "e0d6b3d6",
+   "id": "c4c7278d",
    "metadata": {},
    "outputs": [
     {
@@ -369,10 +375,10 @@
     {
      "data": {
       "text/plain": [
-       "<ggplot: (324812485)>"
+       "<ggplot: (-9223372036545605169)>"
       ]
      },
-     "execution_count": 40,
+     "execution_count": 10,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -392,7 +398,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "74b66c4c",
+   "id": "a3a06a22",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -403,7 +409,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "f237ad56",
+   "id": "0854c432",
    "metadata": {},
    "source": [
     "The model certainly picks up on the general trend of solar power generation rising during the day before falling in the evening. There are *many* reasons why this does a poor job of actually modeling whats going on in the real world. One example is that the hourly term is fixed throughout the year, so the model can't pick up on the fact that summer days are longer than days in the winter. In addition the model seems to be predicting negative numbers for some hours which doesn't make any sense in the real world. All of those could be fixed with more realistic modeling choices. One thing we can fix with just our spline penalties is the fact that moving from midnight to 1am there is a discontinuity, but in actuality the predictions should basically be the same. We did this in our last post using a cyclic penalty where we penalize the difference between the first and last coefficient. We aren't going to do anything different here, but I just want to show how easy it is even with multiple spline terms. We just replace the prior difference matrix with the new cyclic difference matrix and the additional penalty will be picked up automatically when we create our `m_squared` matrix in the `build_multiterm_penalty` function. The only thing that may be different in this code is I'm going to multiply the new cyclic penalty matrix by an additional penalty term so that the model is forced to respect this new constraint.\n"
@@ -412,7 +418,7 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "4bd274ba",
+   "id": "2ee7f6b4",
    "metadata": {},
    "outputs": [
     {
@@ -451,7 +457,7 @@
   {
    "cell_type": "code",
    "execution_count": 13,
-   "id": "3578c77a",
+   "id": "0975e338",
    "metadata": {},
    "outputs": [
     {
@@ -474,10 +480,10 @@
     {
      "data": {
       "text/plain": [
-       "<ggplot: (324297087)>"
+       "<ggplot: (-9223372036545528972)>"
       ]
      },
-     "execution_count": 42,
+     "execution_count": 12,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -509,7 +515,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "3f438685",
+   "id": "d230dec5",
    "metadata": {},
    "source": [
     "As you can see our hourly coefficients are more symmetric, but also much more muted than the baseline model; the baseline `gam_model` predicts a max solar output of ~4.1GW while the cyclic `gam_model_cyc` only predicts a value of ~3.2. The reason for this is that when we multiplied our cyclic penalty matrix (`hourly_penalty_cyc`) by an additional penalty value (`cyclic_penalty`) we increased the weight on the cyclic penalty but also increased the weight on the original difference penalty. This makes it harder for the model to justify consecutive spline coefficients with large differences, which makes the overall curve less, well, curvy. We can fix this by rewriting our `add_cyc_penalty` function to take the additional penalty value as an input and multiplying only the row that corresponds to the cyclic penalty (the last row) by our penalty value. \n"
@@ -518,7 +524,7 @@
   {
    "cell_type": "code",
    "execution_count": 14,
-   "id": "cbd29d2a",
+   "id": "865f3e56",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -542,14 +548,14 @@
   {
    "cell_type": "code",
    "execution_count": 15,
-   "id": "bd4105b6",
+   "id": "02ff9474",
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/Users/mm/Documents/Data Science/Blog Posts/lib/python3.7/site-packages/glum/_solvers.py:52: LinAlgWarning: Ill-conditioned matrix (rcond=6.15175e-20): result may not be accurate.\n"
+      "/Users/mm/Documents/Data Science/Blog Posts/lib/python3.7/site-packages/glum/_solvers.py:52: LinAlgWarning: Ill-conditioned matrix (rcond=3.51529e-19): result may not be accurate.\n"
      ]
     }
    ],
@@ -567,7 +573,7 @@
   {
    "cell_type": "code",
    "execution_count": 16,
-   "id": "bc1aa493",
+   "id": "cebe3581",
    "metadata": {},
    "outputs": [
     {
@@ -590,10 +596,10 @@
     {
      "data": {
       "text/plain": [
-       "<ggplot: (324888773)>"
+       "<ggplot: (308755432)>"
       ]
      },
-     "execution_count": 45,
+     "execution_count": 15,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -627,7 +633,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "eb56402e",
+   "id": "61e332a4",
    "metadata": {},
    "source": [
     "There is still some discontinuity between 11pm and midnight, but our predictions have maintained their more accurate predictions during the middle of the day while shrinking the gap. In fact, I can't seem to figure out how to close this gap. If I increase the penalty value in the updated `add_cyclic_penalty` function the coefficients really don't change. If I make it too large then `glum` will throw errors about how the `P2` matrix must be positive semi-definite. I will have to look into this but wanted to wrap this post up regardless since the core idea was just including multiple splines, which we have done. \n",
@@ -638,7 +644,7 @@
   {
    "cell_type": "code",
    "execution_count": 17,
-   "id": "d522114c",
+   "id": "2eb118af",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -669,7 +675,7 @@
   {
    "cell_type": "code",
    "execution_count": 18,
-   "id": "eab7c50b",
+   "id": "17a143d1",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -697,7 +703,7 @@
   {
    "cell_type": "code",
    "execution_count": 19,
-   "id": "1a4ea5c7",
+   "id": "26cd9091",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -725,6 +731,18 @@
    "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
   }
  },
  "nbformat": 4,