A Gentle Introduction to Bayesian Linear Regression

Mathematics and Machine Learning

Understanding the Foundations of Bayesian Linear Regression

In traditional linear regression, as seen from a frequentist perspective, we seek to identify a single set of parameters that best fit the model. However, given the complexity of real-world data, relying solely on fixed parameters may fall short. What if we allowed for greater flexibility in our model parameters? This concept can be realized through the application of Bayes' theorem in linear regression. In this post, I will explore the mathematical underpinnings of Bayesian linear regression, complemented by visualizations and Python examples.

Overview of Bayesian Linear Regression
Mathematical Foundations of Bayesian Linear Regression
Implementing Bayesian Linear Regression with PyMC

1. Overview of Bayesian Linear Regression

Bayesian linear regression represents a variation of linear regression that utilizes Bayes' theorem. The key distinction between Bayesian and frequentist approaches is that the former estimates the distribution of parameters, while the latter provides fixed parameter estimates. To illustrate this concept, let’s examine a simple linear regression model.

Parameters can be estimated via both frequentist and Bayesian methods.

As depicted, frequentist linear regression yields fixed parameters, allowing for precise solutions based on given data. Conversely, Bayesian linear regression introduces arbitrary probability distributions for each parameter, often employing a Gaussian distribution for all parameters and predictions. This approach embraces multiple plausible interpretations of how data may be generated. The advantages of Bayesian linear regression include:

Similar to ensemble methods, predictions can be inferred by averaging over numerous sampled parameters.
Confidence levels for predictions can be evaluated.
Focus can be directed towards uncertain areas, guiding data collection efforts.
More reliable estimations can be achieved, particularly with limited datasets.

However, there are notable drawbacks:

It often requires substantial computational resources.
Incorrect selection of probability distributions (prior and likelihood) may lead to erroneous conclusions.

In scenarios with ample data, other methodologies like deep learning may be employed. Yet, in fields such as manufacturing and healthcare, data may be scarce, positioning Bayesian analysis as a valuable alternative.

With an understanding of the benefits of Bayesian linear regression, let's delve into its mathematical foundation.

2. Mathematical Foundations of Bayesian Linear Regression

Before exploring Bayesian linear regression, it is beneficial to briefly review frequentist linear regression. The equation for multiple linear regression is presented below.

Parameters can be derived by minimizing the least squared error:

It is crucial to note that we assume the matrix X is full-rank, meaning it possesses an inverse. The least squared error approach effectively minimizes the discrepancy between observed and predicted values. The resulting equation yields a fixed parameter for frequentist linear regression. For further insights, refer to my previous blog post.

How does one interpret linear regression from a Bayesian perspective? Here’s Bayes' theorem:

When applying Bayes' theorem to linear regression, the equation can be expressed as follows:

Here, likelihood denotes the probability density of the residuals given the parameters, prior signifies the probability density of the parameters, and posterior indicates the probability density of the parameters given the residuals. Calculating this formula necessitates prior definition of the probability distribution for each term. Although any distribution can be designated, convergence is not guaranteed, which is why Gaussian distributions are commonly assumed.

For the prior distribution, a Gaussian distribution is typically adopted as a weakly informative prior. To minimize the residuals, the likelihood is also modeled with a Gaussian distribution centered at zero. Now, let’s derive the posterior distribution using these assumptions, applying logarithmic transformations for simplicity.

The resulting expression represents a multivariate Gaussian distribution, with the normalization term incorporated into the constant.

This derivation indicates that a Gaussian prior results in a Gaussian posterior, establishing the Gaussian distribution as a conjugate prior for linear regression. Conjugate priors are advantageous due to their ability to provide closed-form solutions for the posterior. For more details, refer to this post. As new data is acquired, the posterior can be updated and utilized for inference by sampling parameters and substituting them into the linear regression equation.

While the Gaussian distribution can facilitate posterior calculation, it is often impractical to derive a closed-form solution for the posterior. In such cases, the MCMC (Markov Chain Monte Carlo) method may be employed for sampling to arrive at an approximate solution. Several libraries exist to address this general scenario, with PyMC being a prominent choice.

Now that we have established the mathematical principles behind Bayesian linear regression, the next section will cover how to implement it using PyMC.

3. Implementing Bayesian Linear Regression with PyMC

In this segment, we will leverage PyMC to execute Bayesian linear regression. You can follow this guide to install PyMC via Conda.

We will utilize the student performance dataset. Here are the initial five entries of this dataset.

This dataset comprises four dependent variables (“Hours Studied”, “Previous Scores”, “Extracurricular Activities”, “Sleep Hours”, and “Sample Question Papers Practiced”) and one independent variable (“Performance Index”). The following graph illustrates the scatter matrix.

Given that the performance index exhibits a Gaussian-like distribution, employing a Bayesian linear regression model is a sensible choice. We will assume that both the parameters and dependent variable follow a Gaussian distribution.

Recall that we previously assumed the residuals follow a Gaussian distribution with a mean of zero. In this context, we also assume the dependent variable follows a Gaussian distribution with its mean equal to the predicted value from linear regression. Although this might seem different, both setups are equivalent.

Using PyMC, you can implement Bayesian linear regression with just a few lines of code.

test_score_bayesian = pm.Model(coords={"predictors": columns})

with test_score_bayesian:

# posterior variance

sigma = pm.HalfNormal("sigma", 25)

# beta

beta = pm.Normal("beta", 0, 10, dims="predictors")

beta0 = pm.Normal("beta0", 0, 10)

mu = beta0 + at.dot(X_train, beta)

y_hat = pm.Normal("y_hat", mu=mu, sigma=sigma, observed=y_train)

Isn’t it quite straightforward? You can also sample from your implemented model and visualize the distributions of the variables.

Here are the coefficients derived from both the Bayesian linear regression model and the frequentist approach.

As illustrated, the results from both methods are remarkably similar. This is attributable to the assumption of a Gaussian prior, which aligns with the frequentist model's implicit assumptions. When additional prior information is available, you can modify the prior distribution and potentially yield more accurate results.

Lastly, I will share the gist link to enable you to reproduce the results.

In this post, we examined both the mathematical foundations and practical execution of Bayesian linear regression. By utilizing PyMC, creating a Bayesian linear regression model becomes a straightforward process. If I overlooked any details, please feel free to reach out. Thank you for taking the time to read this article!

References

[1] CSC Lecture 19: Bayesian Linear Regression, University of Toronto [2] Shizuya, Y., A Gentle Introduction to Multiple Linear Regression, Intuition [3] Ms. Aerin, Understanding Conjugate Priors, Towards Data Science [4] https://www.pymc.io/projects/docs/en/stable/installation.html [5] Student Performance Dataset, Kaggle

corneretageres.com