Session 3
Correlation & Regression Models
Session Contents
  • Correlation Analysis
  • Regression models
Positive Correlation
Negative Correlation
No Correlation
Pearson & Spearman Correlations in R

To compute correlations in R, use the cor() function. Pearson measures linear relationships, while Spearman is rank-based.

# Load example data
data <- data.frame(
var1 = rnorm(100),  # Continuous variable
var2 = rnorm(100)   # Another continuous variable

# Pearson correlation (linear relationship)
cor(data$var1, data$var2, method = "pearson")

# Spearman correlation (monotonic relationship)
cor(data$var1, data$var2, method = "spearman")
Interpreting Correlation Statistics in R

When computing correlations in R, key statistics help interpret the results:

  • Correlation coefficient (r / ρ): Measures the strength and direction of the relationship.
  • p-value: Tests whether the correlation is statistically significant.
  • Confidence interval: Indicates the range within which the true correlation lies.
Regression models
Linear regression
Logistic regression
Simple & Adjusted Linear Regression in R

A simple linear regression includes one predictor, while an adjusted model controls for additional variables.

  • Simple Model: Outcome (Y) predicted by a single predictor (X).
  • Adjusted Model: Additional covariates (Z1, Z2) are included to account for confounding.
# Load example dataset
data <- data.frame(
    Y  = rnorm(100, mean = 50, sd = 10),  # Outcome variable
    X  = rnorm(100, mean = 20, sd = 5),   # Main predictor
    Z1 = rnorm(100, mean = 10, sd = 3),   # Adjusting variable 1
    Z2 = rnorm(100, mean = 30, sd = 7)    # Adjusting variable 2
)

# Simple linear regression
model_simple <- lm(Y ~ X, data = data)
summary(model_simple)

# Adjusted linear regression
model_adjusted <- lm(Y ~ X + Z1 + Z2, data = data)
summary(model_adjusted)
Understanding Linear Regression Output

When running summary(lm()) in R, key statistics help interpret the model:

  • Estimate (β): Coefficient that quantifies the effect of the predictor.
  • Std. Error: Variability of the coefficient estimate.
  • p-value: Probability that the coefficient is significantly different from 0.
  • R² (R-squared): Proportion of variance in Y explained by the model.
# Run a simple linear regression
model <- lm(Y ~ X, data = data)
summary(model)
Logistic Regression in R

Logistic regression models the probability of a binary outcome using a predictor variable.

  • Outcome (Y): Binary variable (e.g., 1 = Success, 0 = Failure).
  • Predictor (X): Continuous or categorical variable affecting Y.
# Load example dataset
data <- data.frame(
    Y = rbinom(100, 1, prob = 0.5),  # Binary outcome (0 or 1)
    X = rnorm(100, mean = 20, sd = 5)  # Predictor variable

# Logistic regression model
model_logistic <- glm(Y ~ X, data = data, family = binomial)
summary(model_logistic)
Understanding Logistic Regression Output

The summary(glm()) function in R provides key statistics to interpret a logistic regression model:

  • Estimate (β): Log-odds change for each unit increase in the predictor.
  • Std. Error: Variability of the coefficient estimate.
  • p-value: Probability that the coefficient is significantly different from 0.
# Run a logistic regression
model_logistic <- glm(Y ~ X, data = data, family = binomial)
summary(model_logistic)
Differences Between Linear and Logistic Regression
Characteristic Linear Regression Logistic Regression
Type of Y Variable Continuous (e.g., blood pressure) Binary (e.g., disease: Yes/No)
Interpretation of β Change in Y per unit increase in X Change in log-odds of Y per unit increase in X
Predictions Continuous values Probabilities (between 0 and 1)
Interpretation of Logistic RM
# Fit logistic regression model
model <- glm(Y ~ X, data = data, family = binomial)

# Extract coefficients and standard errors
coef_table <- summary(model)$coefficients
beta <- coef_table["X", "Estimate"]  # Logistic regression coefficient
se <- coef_table["X", "Std. Error"]  # Standard error

# Convert to OR and 95% CI
OR <- exp(beta)
CI_lower <- exp(beta - 1.96 * se)
CI_upper <- exp(beta + 1.96 * se)

# Print results
c(OR = OR, CI_lower = CI_lower, CI_upper = CI_upper)