Curated Data Science by Rahul

Fitting Non-linear Functions and Understanding Polynomial Regression in R

I recently watched a video that delves into fitting non-linear functions in R, focusing on polynomial regression. You can find it here. The speaker provides a systematic breakdown of how to implement this while utilizing R’s markdown functionality, enabling clear documentation of work.

Fitting Polynomial Regression

The session begins with polynomial regression using the wage dataset. The primary objective is to model wage as a function of age, fitting a fourth-degree polynomial. In typical scenarios, manually constructing a design matrix including terms (X), (X^2), (X^3), (X^4) would be necessary. However, R’s formula syntax simplifies this by utilizing the poly function. For example, the model can be specified as:

fit <- lm(wage ~ poly(age, 4), data=wage)

This automatically generates the polynomial basis. The summary for the model reveals significant coefficients for degrees 1, 2, and 3 (cubic term), while the quartic term is marginally insignificant.

Orthogonal Polynomials

A critical advantage of using the poly function is that it produces orthogonal polynomials. This orthogonality allows each coefficient to be interpreted independently, unlike typical linear regressions where multicollinearity complicates interpretation. A typical output could show coefficients such as:

Poly(age, 4)1: significant
Poly(age, 4)2: significant
Poly(age, 4)3: significant
Poly(age, 4)4: not significant

This points out that a cubic polynomial sufficiently represents the relationship without the quartic term contributing meaningfully.

Visualizing the Fit

The process of visualizing the fit involves plotting the predicted values along with confidence intervals. The speaker details creating a predicted range using the following code:

age_grid <- seq(min(wage$age), max(wage$age), by=1)
preds <- predict(fit, newdata=data.frame(age=age_grid), se.fit=TRUE)

From this, the upper and lower bounds of the standard error can be calculated and bound in a plot. After constructing and displaying the polynomial fit, you would get outputs indicating how well the model fits the data visually, including shaded areas representing uncertainty bands.

Nested Models and ANOVA

The video proceeds to address how to compare different polynomial fits (e.g., cubic vs. quartic) using ANOVA. A nested model check is performed via:

anova(fit1, fit2)

Where fit1 is a model with age and education, while fit2 adds (age^2) as a term. Results are typically structured with F-statistics and corresponding p-values. This method aids in determining which variables (or polynomial degrees) are significant.

For instance, if a cubic term is included but not significantly different from just a quadratic, the model could be simplified, ensuring better interpretability without losing predictive power.

Generalized Linear Models (GLM)

The discussion also transitions to fitting logistic regressions for binary outcomes, such as separating high-income earners. The 0-1 encoding is achieved through:

binary_wage <- ifelse(wage$wage > 250, 1, 0)

Then the GLM is fitted with:

fit <- glm(binary_wage ~ poly(age, 3), family=binomial(link='logit'))

Standard errors and predicted probabilities from models are extracted in the same concept, although they require transformations back to the probability scale via the logistic function:

probabilities <- exp(preds$fit) / (1 + exp(preds$fit))