17 Generalized Linear Models

The basic idea behind Generalized Linear Models (not to be confused with General Linear Models) is to specify a link function that transforms the response space into a modeling space where we can perform our usual linear regression, and to capture the dependence of the variance on the mean through a variance function. The parameters of the model will be expressed on the scale of the modeling space, but we can always transform it back into our original response space using the inverse link function or mean function.

Let’s see what this looks like. A GLM specifies three ingredients. A linear combination of coefficients and predictors (just like we have been using throughout our exploration of regression so far):

\[ \eta = b_0 + b_1x_1 + b_2x_2 + ... \]

We then have link function:

\[ g(\mu = E(y|x_i)) = \eta \]

Note that the link function is applied to the left-hand side of this expression, or the portion pertaining to our outcome variable (e.g., \(y\)). Often, people prefer to think about their outcome variable in its natural scale. Thus, we can rearrange so that the left-hand side is left untransformed:

\[ \mu = E(y|x_i) = g^{-1}(\eta) \]

So, to summarize, \(g()\) is the link function and is applied to the outcome variable, \(y\), whereas \(g^{-1}\) is the inverse link function or mean function and is applied to the linear combination of coefficients and predictors (e.g., \(\eta\)).

The last ingredient is the distribution for modeling the variance, \(Var(y|x)\). There are many choices here. The ones most commonly used in psychology are logistic regression and Poisson regression, the former being used for binary data (Bernoulli trials) and the latter being used for count data, where the number of trials is not well-defined.

17.1 Logistic regression

17.1.1 Terminology

Term	Definition
Bernoulli trial	An event with a binary outcome, with one outcome considered ‘success’
proportion	The ratio of successes to the total number of Bernoulli trials
odds	The ratio of p(success) and p(failure)
log odds	The (natural) log of the odds

In logistic regression, we are modeling the relationship between the response and a set of predictors in log odds space.

Logistic regression is used when the individual outcomes are Bernoulli trials, or events with binary outcomes. Typically one of the two outcomes is referred to as ‘success’ and is coded as a 1; the other is referred to as ‘failure’ and is coded as 0. Note that the terms ‘success’ and ‘failure’ are completely arbitrary, and should not be taken to imply that the more desirable category should always be coded as 1. For instance, when flipping a coin we could equivalently choose ‘heads’ as success and ‘tails’ as failure or vice versa.

Often the outcome of a sequence of Bernoulli trials is communicated as a proportion: the ratio of successes to the total number of trials. For instance, if we flip a coin 100 times and get 47 heads, we would have a proportion of 47/100 or .47, which would also be our estimate of the probability of the event. For events coded as 1s and 0s, a shortcut way of getting the proportion is to use the mean() function.

We can also talk about the odds of success, i.e., that the odds of heads versus tails are one to one, or 1:1. The odds of it raining on a given day in Glasgow would be 170:195; the denominator is the number of days it did not rain (365 - 170 = 195). Expressed as a decimal number, the ratio 170/195 is about 0.87, and is known as the natural odds. Natural odds ranges from 0 to \(+\inf\). Given \(k\) successes on \(n\) trials, we can represent the natural odds as \(\frac{k}{n - k}\). Or, given a probability \(p\), we can represent the odds as \(\frac{p}{1-p}\).

The natural log of the odds, or logit is the scale on which logistic regression is performed. Recall that the logarithm of some value \(k\) gives the exponent that would yield \(k\) for a given base. For instance, the \(log_2\) (log to the base 2) of 16 is 4, because \(2^4 = 16\). In logistic regression, the base that is typically used is \(e\) (also known as Euler’s number). To get the log odds from odds of, say, Glasgow rainfall, we would use log(170/195), which yields -0.1372011; to get natural odds back from log odds, we would use the inverse, exp(-.137), which returns about 0.872.

17.1.2 Properties of log odds

log odds = \(\log \left(\frac{p}{1-p}\right)\)

Log odds has some nice properties for linear modeling.

First, it is symmetric around zero, and zero log odds corresponds to maximum uncertainty, i.e., a probability of .5. Positive log odds means that success is more likely than failure (Pr(success) > .5), and negative log odds means that failure is more likely than success (Pr(success) < .5). A log odds of 2 means that success is more likely than failure by the same amount that -2 means that failure is more likely than success. The scale is unbounded; it goes from \(-\infty\) to \(+\infty\).

17.1.3 Link and variance functions

The link function for logistic regression is:

\[\eta = \log \left(\frac{p}{1-p}\right)\]

while the inverse link function is:

\[p = \frac{1}{1 + e^{-\eta}}\]

where \(e\) is Euler’s number. In R, you could type this latter function as 1/(1 + exp(-eta)).

The variance function is the variance for the binomial distribution, namely:

\[np(1 - p).\]

The app below allows you to manipulate the intercept and slope of a line in log odds space and to see the projection of the line back into response space. Note the S-shaped (“sigmoidal”) shape of the function in the response shape.

17.1.4 Estimating logistic regression models in R

For single-level data, you use the glm() function. Note that it is much like the lm() function you are already familiar with. The main difference is that you specify a family argument for the link/variance functions. For logistic regression, you use family = binomial(link = "logit"). The logit link is default for the binomial family with a logit link, so typing family = binomial is sufficient.

glm(DV ~ IV1 + IV2 + ..., data, family = binomial)

17.2 Activities

If you haven’t already, load the nba data set we downloaded back in Section 4.2. And let’s pass c("Undrafted") as the na argument:
Determine the mean height of players in the data set.
Create a new column called player_tall which has ones in rows where the corresponding player is taller than the mean height and zeros in rows where the corresponding player is shorter than (or exactly as tall as) the mean height.
Conduct a logistic regression with pts as a predictor and player_tall as the outcome/target variable.
Plot the model’s predictions (e.g., p(player_tall==True)) for a range of values of pts (e.g., from 0 to 50).