clock menu more-arrow no yes mobile

Filed under:

BRB University: Winning Football, Gary Kubiak Style

Gary Kubiak and the mighty Logistic Regression. Take a look inside the Texans' offensive numbers.

"Just give the ball to Foster again. Remember, that's how we win football games."
"Just give the ball to Foster again. Remember, that's how we win football games."
Chris Humphreys-US PRESSWIRE

Once upon a time, I was college student at Texas State Universitym going to class and filling my head with knowledge and the wonders of the world. During this time I learned some stuff, like the difference between NGDP and GDP, credits go on the right of the balance sheet and debits on the left, Keynesian economic philosophy and that Tim Donaghy referees the rigged sport of intramural basketball. Like most education, there are multitudes of nonsensical information that goes up to the brain and is quickly flushed like a toilet once the test is over with. However, there is still an assortment of knowledge and skills I have gained that I may use in a future job of some sort. One of these is how to use advanced statistics to make predictions and analyze data via the statistical program R.

R is a free open source statistical program filled with data packages and the ability to run regressions and build models. It is a little tricky to get used to because of its lack of a traditional graphical user interface, but it makes up for that with its capabilities.

When I took the class, we used it to look at various situations like: would this person buy the house or not, what temperature would have caused a meltdown in the o-ring on the Challenger Space Shuttle, and how many people would attend a football game. Each of these had data with different parameters that need different regressions in order to analyze the data. During the school yearm I was at the mercy of the person with a Ph.D to decide what we would learn or what examples were used to demonstrate the information. Now that I am nothing more than a man with a summer off, growing a cobweb for a beard, with a piece of paper laying in its womb of an envelope on my floor, I can actually take some of this stuff in my head and utilize it to learn about material that piques my interest. Like football.

Gary Kubiak has been the head coach of the Houston Texans since 2006 and has amassed a record of 61-55 (including the playoffs) in 116 games. His career has been a tumultuous ride of grody and underperforming teams until the past two years when Houston morphed into a Super Bowl contender. Both of these factors make Kubiak interesting to study. and the large sample gives me the ability to actually come up with some type of conclusion. Using the advanced stat techniques I mentioned earlier, I wanted to see what has turned Houston into the team it is today and what factors are most important for his team to win. The first part of this two-part series will focus solely on the offensive side of the ball. The defensive side will be covered whenever I get the time to input 116 games of data.

Logistic Regression is a regression where the dependent variable you are trying to explain is a binomial distribution, an event with one or two outcomes (e.g., live or die, heads or tails, acceptance or rejection and what I am measuring: win or lose). The independent variables then act as a predictor variable that affect and influence the dependent variable. I know most want to dive straight into the results, but I would like to describe the process to those who care to learn about this regression or have the ability to tell me where I went wrong.

First, data entry. The easiest way to do it is to convert an Excel spreadsheet into a CSV (comma-separated values) file and plug away numbers. At first you want a wide variety of variables to play with to see what best explains the dependent variable. When you start building your model, you will get rid of a lot of these variables in order to capture the signal and get rid of the noise. For this model, I looked up pass attempts, pass completions, pass yards, pass touchdowns, rush attempts, rushing yards, rushing touchdowns, times sacked, yards lost on sacks, penalties, penalty yards, punts, field goal attempts, field goals made, turnovers, and first downs.

Notice that I did not add the score because its gargantuan effect would overshadow the rest of the variables. It is a strange occurrence that the more points a team scores the greater their chances are to win. Crazy, I know. The dependent variable, win or lose, will be described as a 1 for a win and a 0 for a loss. I used Pro Football Reference to gather all of my data because the site is easy to use, lacks flamboyant ads and has most of the info you need. I hate to complain, but I would be frolicking through a field of sunflowers if they added individual big play numbers, red zone scoring/defense, and time of possession.

After all the data is entered, you read your file into R and the work starts to pay off. The best way to see which variables to add is to create a correlation table of all of your variables. You would like to see how they correlate with the dependent variable and with each other. This is because you might run into a problem of multicolinearity. It is a phenomenon that occurs when two variables that are highly correlated with each other can linearly predict one another. Subsequently, the coefficent estimates (known as betas) will respond weirdly with small changes to the model. It does not effect the overall predictive ability of the model overall, though. My model more than likely has some problems with multicolinearity due to the high correlations between variables like pass attempts and pass yards. I decided to leave it, since this model is only looking at the Texans' offense and most of these variables will be discarded when I add the defense variables to it.

The variables with the highest correlations are the first ones you will add to the model.

Gary_kubiak_offensive_correlations_medium

Correlation is just a number between zero and one that measures how two variables move together, whether it is negatively or positively. Everyone knows how Houston's running game drives their offense, but I would not expect the correlations to look like this. This should add some burning coal to those of you who are all aboard the Matt Schaub Hate Train. I would hesitate to attribute it to Schaub's play since the negative correlations could be attributed to David Carr, Sage Rosenfels, and/or the zone run play action pass the offense is built on. It would be interesting to look at games only Matt Schaub played in; that could be something for me to tackle on a rainy day. Based on the graph, rushing attempts, rushing yards, and rushing touchdowns will be the first independent variables I will add to the model.

From the first model, the most important thing to look at is the hypothesis tests. In a logistic regression, there are two null hypotheses: does the model fit the data and whether complicating the model increases the explainability of the dependent variable. Both of these are tested by comparing a P Value to a given level of alpha. One rejects or accepts the null hypothesis if the P Value is less or greater than alpha. Alpha is in the eye of the beholder; what level of type one error are you willing to accept? A type one error is the incorrect rejection of a null hypothesis. So if the P value is greater than alpha, one would fail to reject the null hypothesis and vice versa.

Both of these are measured in R by a simple equation that utilizes the null and residual deviances. The deviance measures how close the "predicted values from the fitted model match the actual values from the raw data" and the residual deviance is how well the likelihood is measured when the variables are added. The null deviance is basically the fit of the data without any variables. It is also used as a way to check if adding variables actually helps the model by seeing how much the gap between them increases. We will dive more into this later.

Hypothesis Test 1

null- H:o The model fits the data.

alternate- H:a The model does not fit the data.

This is tested by the equation 1-pchisq (residual deviance, degrees of freedom).

1-pchisq (85.42,112)=.9709 I will test the hypothesis at an alpha level of .1.

,9709>.1

These numbers are given as an output when you run your model.

So you fail to reject the null hypothesis and the model does fit the data. Now we can continue with our model testing the data since it follows the parameters of general logistic regression.

Hypothesis Test 2

H:o Complicating the model does not increase the explainability of the dependent variable.

H:a Complicating the model does increase the explainability of the independent variable.

This hypothesis test entails that adding more independent variables does not make the model better. It is tested by the following equation.

1-pchisq (null deviance-residual deviance,1)

1-pchisq (160.5-85.42,1)=0

Since the P Value generated is less than the alpha, one would reject the null hypothesis that complicating the model does not increase the explainability of the model. It is a weird way of saying adding more variables will make the model better. The result does not surprise me because of all the variables and components integral to the game of football, but I have never seen a value of zero when testing this hypothesis.

Now there is one more descriptive statistic I would like to look at, and that's R squared. It is another way to see how well the independent variables explains variation of the dependent variable.

The output in R does not give the R squared in a logistic regression, but it can be calculated fairly easily:

R squared= 1- (residual deviance/null deviance)

=1- (85.42/160.5) = .4677

The model does a decent job, but there is still a large amount of room for improvement.

For Model 2, I will add pass attempts, pass completions, and passing touchdowns, but not passing yards. I won't add passing yards right away because of the small correlation it has with wins. The model brought me the following output:

Coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) - 8.859119 2.516550 -3.520 0.000431 ***
Rush. Att. 0.348100 0.077584 4.487 7.23e-06 ***
Rush. TD 0.514702 0.384286 1.339 0.180449
Rush. Yards -0.007193 0.009439 -0.762 0.445999
Pass. Att.-0.008560 0.062067 -0.138 0.890312
Pass. Comp -0.046664 0.096709 -0.483 0.629435
Pass. TD 0.752947 0.362596 2.077 0.037844 *

The coefficents are the beta levels generated used to create an equation to generate the probabilities and actual predictions. This will be discussed more later. It is worth noting the coefficients are not that important by themselves since the model probably deals with an issue of multicolinearity, which messes with their values some.

Null deviance: 160.500 on 115 degrees of freedom.
Residual deviance: 80.555 on 109 degrees of freedom.

The residual deviance dropped by about 5 and both of the hypothesis tests acted nearly identical as the first model. Since the residual deviance dropped, it means that adding the variables has improved the model. Usually the rule of thumb is to add the variable if the residual drops by one or more. Right now the model is messier than I would like it to be. The goal is to make the model as simple as possible while having the greatest predictive power. However, some of these variables will be forced out when I add the defensive variables at a later date. Now even though the passing yards do not have a high correlation let's see how much they affect the model.

Coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.094783 2.637749 -3.448 0.000565 ***
Rush. Att.- 0.351263 0.079222 4.434 9.25e-06 ***
Rush. TD 0.426246 0.395646 1.077 0.281328
Rush. Yards -0.004787 0.009735 -0.492 0.622897
Pass. Att.-0.076473 0.073278 -1.044 0.296668
Pass. Comp -0.084817 0.102063 -0.831 0.405961
Pass. TD 0.449488 0.401310 1.120 0.262692
Pass. Yards 0.013360 0.006874 1.944 0.051950

Null deviance: 160.500 on 115 degrees of freedom.
Residual deviance: 76.593 on 108 degrees of freedom.

R Squared= 1-(76.593/160.5)=.52227.

Adding the passing yards dropped the residual deviance by 4, which was unexpected since the correlation was miniscule. My assumption is that since there are over 300 yards affecting a variable like 1 or 0, the effect of one extra yard is much smaller than one extra rushing yard. Now that I have a strong foundation built, I will add variables and see the effect they have on the residual deviance.

After adding the rest of the unused variables, I found the following table. A quick reminder: the residual deviance from the first model was 76.593.

Independent Variable Residual Deviance After Change
FG Attempts 75.557 1.036
Turnovers 76.519 .074
Times Sacked 76.499 .094
Times Punted 76.274 .319
FG Made 76.591 .002
# of Penalties 75.913 .68
Penalty Yards 76.578 .015

I did not add sacked yards lost because it blew up my computer and gave me a warning message stating, "The algorithm did not converge." I looked into what happened, but I still have been unable to find a solution to the problem. After looking at the table, the only variable I will add will be FG Attempts. It is interesting to see how field goal attempts better explains the ability to win than FGs made during the Kubiak years. Getting into field goal range and gaining the yards to get into FG position is more important to winning than actually making the field goals. Reason # 1,359,987 why scoring touchdowns is better than kicking field goals, and pretty much sums up why Houston's red zone woes killed them at the end of the season.

Here is the final piece of advanced stat info left for me to share. Below are the R outputs, hypothesis test results, and R Squared for the finished model. Now we will get into actual predictions and see if the model is any good at predictions.

Coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.782721 2.688337 -3.267 0.00109 **
Rush.Att. 0.331816 0.080951 4.099 4.15e-05 ***
Rush.TD 0.642550 0.458751 1.401 0.16132
Rush.Yards -0.007670 0.010514 -0.729 0.46572
Pass.Att.-0.075811 0.074178 -1.022 0.30678
Pass.Comp.-0.073717 0.104642 -0.704 0.48114
Pass.TD 0.652558 0.460910 1.416 0.15683
Pass.Yards 0.010104 0.007602 1.329 0.18381
FG.Att. 0.299164 0.302092 0.990 0.32202

Null deviance: 160.500 on 115 degrees of freedom.
Residual deviance: 75.557 on 107 degrees of freedom.

Hypothesis Test 1
1-pchisq (75.557,107,1)=.9922
Fail to reject the null hypothesis that the model does fit the data well.

Hypothesis Test 2
1-pchisq (160.5-75.557,1)=0
Reject the null hypothesis that complicating the model does not increase the explainabliity of the dependent variable. Adding defense and special team numbers is a must to improve its accuracy.

R Squared
1 - (75.557/160.5 )= .5292

Our final regression equation is done by taking the coefficients of all of our variables and multiplying them by X (e.g., what actually occurred).

Basic outline is Y=β0+β1*X1+β2*X2...+βn*Xn+εi

-8.7827+.3318*X1+.6425*X2+-.00767*X3+-.0758*X4+-.0737*X5+.6525*X6+.0101*X7

The above is our regression equation, but in order to get a probability, you have to transform it because probability is a number between zero and one.

Probability= 1/1+e (natural log) ^ regression equation

This can be done in the computer or by hand. The other way to do find the probability is in R by running the code predict (logistic regression name ,type="response") ,which is far easier. However, it is important to know the process and why you have to transform the equation. Now that the nasty stuff is over, we finally have some results in the form of the probability of Houston winning a football game while Gary Kubiak is the head coach.

Click the link here.

If you want to compare games through Pro Football Reference, remember that bye weeks are added on the site. For example, Game #10 is on Week 9 in 2006 on the site because of the bye week. It is best to cross-reference scores to find what game matches each entry on the spreadsheet.

What I did was create a spreadsheet and insert the probability of winning, if the Texans won or not (win=1 lose=0), the score, whether the model was right (P>50% Win P<50% Loss), and tallied the times the model was wrong. After everything was over, the model went 101-15, which is equal to being correct 87.06% of the time.The model is surprisingly stronger than what I thought it would be, considering defensive and special team variables were missing, the issue of multicolinearity, and it being the first time I took on a task like this by myself.

What can be learned from the model, other than the winning probabilities, is how much the run game drives the rest of the offense and how detrimental it is for them to win. It is one of those things you can see and feel while watching the game, but you don't really know how much of an impact it actually has until you see something like this.

There are also some interesting tidbits of information I found throughout this process.

The luckiest win Houston had under Gary Kubiak according to the model was in 2007, when Houston beat the Tampa Bucaneers 28-14 despite having just a 1.17% chance to win based on their offensive production. Even though the game was not that close, the low probability occurred because of Houston's lack of rushing yards and the model not considering kickoff return yards and return touchdowns. Houston ran for 71 yards, 0 rushing touchdowns, and the great André Davis ran a kickoff return back 97 yards for a score. Oh, and I forgot to mention Sage Rosenfels started and that automatically warrants a 1% chance to win.

The unluckiest loss under Kubiak was a 23-29 loss to San Diego in Week 9 of the 2010 season. 2010, of course, is also known as the year we had to watch the worst defense in the entire history of the NFL. Even though Schaub threw for 267 yards and Arian Foster ran for 127 yards/2 touchdowns, Houston managed to lose, thanks to a 295 yard/4 touchdown performance by Philip Rivers. The model suggests Houston had a 91.15% chance of winning the game, but defensive measures are not accounted for yet. Man, that was an awful defense.

Based on the offensive model, Houston had a 9.89% chance of beating New England in the playoffs last year and a 25.2% chance to beat the Patriots in the regular season. Not surprisingly, based on the offensive numbers, in the Minnesota game the Texans had a .57% chance to win; that might have been the worst offensive performance in the Gary Kubiak era.

In a 2006 Week 12 loss to the Jets, the model gave Houston a .11% chance of winning in a 26-11 loss. This probability occurred even though Carr went 39-54 and threw for 323 yards. If you read this stat and nothing else, one would think the Texans crushed the Jets with the arm of David Duke Carr. The truth is Houston ran the ball 14 times for 25 yards, and the probability of Houston winning is linked directly to the rushing game.

Finally, here's a nice comparison of Houston's run game to show the greatness of Arian Foster and the improvements on the offensive line. Arain Foster became the starter in Game 63 of the Kubiak era.

Before Arian Foster- 62 games, 1,618 attempts, 6,352 yards, 53 TD, 3.92 YPC.

After Arian Foster - 54 games, 1,661 attempts, 7,451 yards, 66 TD, 4.48 YPC.

Arian Foster is officially the greatest player on Houston's offense. He's also Gary Kubiak's savior.

The real fun will come in the future when the model can be used to predict games by plugging the numbers into the transformed regression equation that I made. The model is just a representation of reality in a bubble and needs future games to see if it has any worth. For example, Houston is playing the Chargers in Week One. If they are projected to have so many rushing yards, passing touchdowns, etc., you would plug the numbers into the equation and get a probability. These projections will have to come from ESPN or someone else who has better capabilities and is much smarter than I am (Football Outsiders?), since just adding season averages does not adjust to opponents as well.

As said earlier, I will fine tune the model by adding defense and special team variables. I would love to add stats like big plays, time of possession, and average starting drive, but there is not a good set of historical data of these numbers and I would have to create 116 games worth of data by hand. That is something even I am not crazy enough to take on unless I had a throng of people at my disposal. i would also like to figure out a way to adjust for opponents, but it would be something I would have to read into in the future.

If you found the work above interesting, R is free to download, and there are many online sources and books to teach yourself how to create something like this. It is a little difficult to get started since it lacks a real graphics interface like Excel. I would describe R as rummaging through the cupboards to get something to eat. You know there is food back there, but you have to put the code in to find the exact bag of pretzels you want. With some time and effort, anyone can grasp it.

If you have any questions or need any clarifications, let me know in the comments.