corneretageres.com

<Innovative Insights into Predicting Protein-Protein Interactions>

Written on

The human interactome, which encompasses all protein-protein interactions (PPIs), is estimated to contain around 600,000 interactions. With such a vast array of potential PPIs, forecasting how mutations that cause diseases influence the interactome may seem daunting. However, this task is not as insurmountable as one might think, particularly when a University of Waterloo co-op student is provided access to a powerful GPU cluster, top-notch mentorship, and the freedom to explore various approaches.

By utilizing the machine learning framework XGBoost, the advanced deep learning software AlphaFold-Multimer (AF-M), and executing over 47,000 SLURM jobs, I developed a multi-classifier model that predicts the consequences of missense mutations on PPIs, achieving an AUC of 91%.

In this article, I will detail the following:

  • Background: The research question and the rationale behind it.
  • Data Acquisition & Processing: The methods and reasons for acquiring and preprocessing protein data.
  • Machine Learning Model: The model selection process and its implementation for protein data.
  • Results, Model Accuracy, & Feature Importances: An analysis of the model's outputs and significant features, categorized accordingly.
  • Case Studies on ASD-Related Proteins: A focused examination of two specific results related to autism spectrum disorder proteins.
  • Final Thoughts and Lessons: Conclusions, important insights, and future directions for the project.

If you are engaged in bioinformatics or molecular biology and wish to incorporate machine learning into your research, this article is tailored for you.

Let’s dive in!

# Background

In this section, I will discuss:

  • The research question at hand.
  • The reasons for selecting this question.

A well-known fact: protein-protein interactions are critical. My bias may stem from eight months of study at The Center for Applied Genomics at SickKids, but here's why I believe this is the case:

  • PPIs play a role in numerous metabolic processes.
  • Mutations that alter PPIs are frequently linked to diseases (Cheng et al., 2021).

Understanding how and why a specific mutation influences PPIs can lead to significant advancements:

  • A deeper understanding of disease mechanisms.
  • Improved drug design for various diseases.

The release of AlphaFold-Multimer (AF-M) by DeepMind, a version of AlphaFold specifically trained on protein complexes, presented both an opportunity and a question:

Can AF-M effectively capture the influence of missense variants on PPIs?

You may be curious — why focus on missense variants (single residue changes) rather than other mutation types?

The reasoning is twofold:

  1. AF-M is unable to predict macro-level structural changes resulting from point mutations. For example, providing a sequence with a known structure-disrupting mutation wouldn’t yield an unfolded sequence.
  2. However, in silico alterations at the PPI interface can lead to observable changes, measurable through programs like ChimeraX.

In essence, it’s a research inquiry that capitalizes on current technological advancements.

Next, I will outline the methodology employed to address this question.

# Methodology & Approach

In this section, I will cover:

  • The central concept behind framing this problem.
  • A general overview of what we aim to learn.

3D structures — like those produced by AF-M — possess unique, observable properties unavailable elsewhere. These primarily include structural features such as:

  • Protein-protein interface area
  • Shape complementarity
  • Docking score

This valuable data can only be derived from PDB structures. Consequently, our approach was as follows:

  1. Begin with a wildtype protein complex (a standard homo- or hetero-dimer).
  2. Create a similar complex, incorporating a missense variant in one of the proteins — a "missense complex."

By identifying differences in the structural attributes of these complexes and associating them with the specific missense variant, we can establish a relationship:

Impact on PPI ~ (structural features)

In other words, we can assess the impact on PPI as a function of structural features.

The illustration above depicts a hypothetical scenario where two identical complexes, differing only by a missense variant on the right, show significant variations across hypothetical features A through D.

Another opportunity arises: What if I enhanced that structural data with variant annotations? Examples include:

  • Predictions of pathogenicity (CADD, REVEL, AlphaMissense)
  • Pathogenicity annotations from ClinVar
  • Allele frequency data

This would yield additional features to analyze, thereby developing a more holistic function:

Impact on PPI ~ f(x?, x?, …, xn)

During data collection, I successfully incorporated 40+ additional non-structural features alongside the structural ones. This enriched dataset served as excellent training material for an XGBoost classifier model designed to predict the Impact on PPI across four categories:

  1. Increase in interaction strength. (Binding between the proteins improves)
  2. Decrease in interaction strength. (Binding between the proteins weakens)
  3. Disruption of interaction. (The proteins no longer interact)
  4. No effect on interaction. (The variant does not impact the interaction quality)

You may wonder — how did I acquire all this data linking missense variants to PPI effects? And why did I select XGBoost as my classifier model?

Let me clarify.

# Data Acquisition & Processing

In this section, I will cover:

  • The source of my data (IntAct mutations database).
  • How I preprocessed and engineered features from that data.

IntAct Mutations Database

I sourced my training data from the IntAct Mutations database (licensed under CC0). This extensive annotated repository contains thousands of missense variants associated with binary protein interactions.

Each variant is labeled with its effect on PPI, detailed in the 'Feature type' column: increase, decrease, no effect, etc.

However, it's important to note that this dataset lacks structural feature information or other variant annotations. That's where I (with AF-M’s assistance) stepped in.

The database comprises approximately 12,000 eligible data points. To optimize computational efficiency and balance data classes, I randomly subsampled variants until I had around 1,000 for each category.

> Curious about random subsampling? > > Random subsampling involves selectively pulling data points from a given category until a sufficient number is obtained. This method helps ensure your classes are equally weighted in the training dataset, preventing any bias in the classifier towards a single category. > > Alternatively, one could assign different weights to each class, but random subsampling minimizes the protein complexes needing to be folded — which conserves computational resources.

I then crafted a script to generate FASTA files for both the wildtype and missense complexes before inputting them into AF-M to create PDB structures. Due to time constraints, the final dataset consisted of around 250 structures per class.

> Wondering how I transitioned from UniProt IDs to PDB structures? > > This process was accomplished through a straightforward Bash script that extracted UniProt IDs from the IntAct Mutations file and retrieved the corresponding FASTAs via a CURL request. A wildtype FASTA was generated, and a copy was then ‘mutated’ to create a missense version, both serving as inputs for AF-M.

This procedure resulted in two PDB versions for each protein complex: a wildtype and a missense variant (as shown above). When multiplied across approximately a thousand binary interactions and 47,000 SLURM jobs, we accumulated roughly 20TB of data to analyze.

Feature Extraction & Engineering

Next, I needed to extract both structural and variant feature information:

The AF-M pipeline I utilized, AlphaPulldown, simulates a pulldown assay by parallelizing the 'folding' of multiple protein complexes. It also includes a feature extraction process that generates several critical structural metrics:

  • Interface area (Ų)
  • Number of interface residues
  • Shape complementarity
  • Percentage of polar/hydrophobic/charged residues at the interface
  • Docking score
  • Protein-Interface score

Additionally, I incorporated several features of my own, utilizing annotations from the Variant Effect Predictor by Ensembl, which is available under the Apache 2.0 license.

Pathogenicity Predictions

Two notable examples of pathogenicity predictions included:

  • AlphaMissense annotations. These provide pathogenicity predictions for each possible missense mutation within the human proteome.
  • REVEL pathogenicity predictions. REVEL generates an average score based on various other pathogenicity prediction tools.

Why (and how) were these included?

  • If pathogenic missense variants are prevalent at PPIs, then the AlphaMissense and REVEL scores are likely strong indicators of PPI disruption.
  • I derived these values by converting the IntAct mutations into a VEP-compatible format.

gnomAD Frequencies

The Genome Aggregation Database (gnomAD) from the Broad Institute offers population allele frequencies for different groups and variants.

Why (and how) were frequencies included?

  • Variant frequency data can assist in determining whether common or rare variants are more prevalent in PPI-disrupting pathologies.
  • I obtained these values by converting the IntAct mutations into a VEP-friendly format.

Relative Ratios

I also constructed simple ratio features, such as the interface area divided by total surface area, and differences in feature x compared to the wildtype version.

Why include relative ratios?

  • While unsupervised algorithms can perform these calculations independently, supervised models like XGBoost benefit from these ratios.
  • Intuitively, we could predict that a declining ratio of interface area to total surface area suggests a weakened interaction, serving as just one example.

Free Energy

Using EvoEF2 (licensed under MIT), I gathered thermodynamic data on protein complexes, comparing the change in Gibbs free energy (?G) between the folded and unfolded states for wildtype and mutant variants.

Why consider free energy?

  • We anticipate higher free energy predictions from complexes with PPI-disrupting mutations, indicating a more unstable interaction.

Ultimately, the dataset I compiled resembled the following:

However, it's worth noting that bioinformatics tools can be fallible. Metrics for certain variants are frequently incomplete or missing. For example, not every variant in IntAct has an associated gnomAD frequency.

> Why are missing values problematic for machine learning? > > It’s intuitive to understand that missing values complicate any machine learning endeavor. With an incomplete dataset, trusting your model's output can become challenging. Fortunately, there are strategies to address these concerns.

Dealing with missing values can be frustrating in machine learning, necessitating a model capable of handling these issues. Fortunately, XGBoost proved to be a suitable choice — let me explain how.

# The Machine Learning Model

In this section, I will detail:

  • The choice of XGBoost.
  • Hyperparameter tuning and cross-validation.

XGBoost

XGBoost is a gradient-boosted decision tree model.

Decision trees are structures that segment data at threshold-based nodes. Here’s a simple illustration of whether to play outside:

Connecting multiple trees in sequence allows each to correct the errors of the preceding one, thus forming a more robust model, hence the term gradient-boosted.

XGBoost is a lightweight, fast-training model with several advantages for this specific task:

  • No need for normalized data. This eases the workload for me and makes interpreting how the model reached its conclusions much simpler.
  • Handling of multiple data types. This includes categorical and nominal data; however, I utilized encoded dummy variables for the categorical data.
  • Supervised model capability. While AI is making remarkable advancements in science, it can be easy to lose sight of the biological objectives. Supervised models allow us to delve deeper into the biological mechanisms behind PPI outcomes.
  • Robust handling of missing data. As previously mentioned, not all variants come with complete annotations. XGBoost has mechanisms to ensure that missing data does not halt the training process.

> How does XGBoost manage missing data? > > The model treats missing values as a distinct kind of value. At a splitting node, the algorithm determines which direction (left/right) yields the highest gain (optimal reduction of the loss function) and encodes that choice into its training. When encountering a similar missing value in the test set, it will replicate that decision.

Given the reasonably sized dimensions of the dataset and the rapid training capabilities of XGBoost, I decided to include a hyperparameter tuning phase utilizing k-fold cross-validation.

Hyperparameter Tuning & K-Fold Cross-Validation

Every machine learning model has internal parameters — the weights and biases assigned to its classifications.

However, ML models also possess hyperparameters, which are higher-level structural settings for the overall model. These parameters are established prior to training and can greatly influence accuracy outcomes.

Here are the hyperparameters I focused on:

  • Max_depth: Maximum depth of a tree
  • Learning_rate: Contribution of each tree to the final result
  • N_estimators: Number of trees
  • Subsample: Fraction of the training set sampled for each tree

We determine the optimal hyperparameters by testing various combinations and selecting the one yielding the best results — a process known as tuning.

To further enhance the model training, I implemented k-fold cross-validation. What does this entail?

Let’s break it down:

  • In machine learning, we typically divide the dataset into training and test sets. Sometimes, this can lead to overfitting, making the model less effective.
  • To mitigate this, we can divide our dataset into several segments (k). For each of the k models, one segment serves as the test set while the remaining segments form the training data.
  • This process is repeated for all k segments, shuffling the data to prevent overfitting to any specific training/testing split.

This method is applied to every hyperparameter combination, ensuring the highest accuracy possible for the model.

Having trained and tested the model, how effective is it? What insights can it provide about biology? Let’s explore:

# Results, Model Accuracy & Feature Importances

In this section, I will address:

  • The model's accuracy and the significance of those metrics.
  • Class-by-class identification of the strongest predictors.

Confusion Matrix & ROC Curve

To evaluate a multi-classifier's quality, a confusion matrix is typically employed:

The confusion matrix visualizes the test set. The bottom axis displays the model's predictions for each data point (a protein complex), while the y-axis represents the actual values.

The lighter the color along the diagonal, the more accurate the model.

  • For example, the model accurately predicted 47 out of 54 class 3 protein complexes (as seen in the bottom row).
  • Class 2 also exhibited a similar accuracy, with 39 out of 46 complexes correctly classified.
  • In the upper left corner, the model struggled to differentiate between classes 0 and 1 (mutations that decrease and disrupt interactions, respectively). This is understandable, given their similar effects.

The confusion matrix is one method for assessing model accuracy. We can also utilize an ROC curve:

A receiver operating characteristic (ROC) curve plots the false positive rate (FPR) against the true positive rate (TPR):

  • True Positive Rate (TPR): The ratio of true positives to actual positives [TP/(TP+FN)].
  • False Positive Rate (FPR): The ratio of false positives to actual negatives [FP/(FP+TN)].

The points along the curve represent different threshold settings. The threshold is the cutoff point distinguishing between positive and negative cases.

In a multi-class ROC curve, the positive case is a given class, while the negative case encompasses all others (one-vs-all). The diagonal (dashed line) indicates random chance, where TPR equals FPR across all thresholds. The further we are from this line, the better the model performs.

  • At the graph's left bottom, a high threshold may yield few false positives (low FPR) but also many missed true positives (low TPR).
  • As you move right, the threshold decreases, permitting more true positives (high TPR) but also increasing false positives (high FPR).

The ideal classifier exhibits a bowed-out shape, maintaining high TPR at low FPR rates, even with high thresholds. This is evident for most of our curves.

  • TPR = ?<i>c</i>?<i>TPc / </i>?<i>c</i>?(<i>TPc</i>?+<i>FNc</i>?)
  • FPR = ?<i>c</i>?<i>FPc / </i>?<i>c</i>?(<i>FPc</i>?+<i>TNc</i>?)

Micro-averaging the curves involves summing the TPs, FNs, and TNs across each class. This provides a comprehensive view of the model's performance. To standardize our comparison, we calculate the area under the curve (AUC). Closer to 1 indicates greater accuracy. As shown in the figure, we achieved a micro-average AUC of 0.91.

The curves for classes 0 and 1 align with our observations from the confusion matrix. Overall, our model demonstrates reasonable accuracy. However, the intriguing question remains — what biological insights can it provide?

SHAP Values

Utilizing SHAP values allows us to assess the contribution of features to each prediction.

> What are SHAP values? > > Shapley Additive Explanations measure the marginal contributions of each feature to a class prediction. In essence, they quantify feature importance. SHAP values can be applied across any supervised machine learning task. > > They are derived by: > > 1. Taking the average prediction for a particular class. > 2. Iteratively incorporating each feature (in all possible orders) and evaluating the output. > 3. Calculating the marginal impact of a feature on the model output's deviation from the average.

SHAP values are valuable in biological contexts, highlighting which structural features warrant further investigation or which pathogenicity predictions hold the most promise.

Here are the top ten feature importances for the entire model:

From my analysis, several intriguing results emerged:

  • Notably, the influence of the resulting amino acid from the missense mutation being cysteine (C) was both surprising and difficult to explain.
  • Conversely, the significance of the AlphaMissense pathogenicity score was expected — both AlphaPulldown and AlphaMissense rely on the MSAs from AlphaFold.
  • Interestingly, the magnitude of the difference in the percentage of charged residues at the interface between wildtype and missense complexes also plays a significant role.

In reality, seeking one or two globally predictive features for PPI effects is often a futile endeavor. A more fruitful approach involves examining feature importance on a class-by-class basis using SHAP values.

Class 0: Decrease in Interaction Strength

On the left, I present a feature importance plot illustrating the average impact of each feature on the model's output. This graph is fairly straightforward to interpret; larger bars indicate a more substantial influence.

On the right, we have a beeswarm plot. These plots help visualize two factors:

  • Impact on outcome likelihood: Dots to the right of the center line signify that those feature values positively affect class prediction likelihood. Conversely, dots to the left indicate a negative effect.
  • Feature value: The dot colors are also important. Blue dots denote a low feature value, while red dots indicate a high feature value.

By combining these two factors, we can deepen our understanding of feature effects. For example, the feature Resulting_sequence_A serves as a case study. (For context, this is a dummy value, so the values are either 0 or 1.)

  • Dots on the right side are red: when Resulting_sequence_A is high, it increases the likelihood of predicting class 0.
  • Dots on the left are blue: when Resulting_sequence_A is low, it decreases the likelihood of predicting class 0.

In layman’s terms, it appears that alanine as a missense residue strongly diminishes interaction strength. This aligns intuitively, as alanine is a small amino acid with merely a methyl group as its side chain.

This limitation restricts the types of interactions it can form. Thus, if alanine replaces a crucial disulfide bond or obstructs a charged interaction, it is logical that this would heighten the likelihood of reduced interaction strength.

> Integrating experimental and computational biology for a better model > > While there is a biological basis for this finding, it’s worth noting that researchers frequently conduct alanine-scanning studies, sequentially substituting residues in a protein sequence to ascertain their importance in folding, interaction, etc. > > It’s conceivable that results from these studies are overrepresented in the IntAct database, leading the model to perceive this feature as more critical than it might actually be. > > Interestingly, my background in experimental biology was essential in understanding such potential issues. Without it, I may not have recognized this as a possible concern.

Surprisingly, the shape complementarity (sc) feature yielded unexpected results. Sc measures how well two proteins fit together; one would anticipate that a low sc would increase the likelihood of reduced interaction strength. However, the opposite was observed: a higher sc seems to decrease interaction strength, and vice versa.

In the following sections, I will highlight a few particularly interesting features:

Class 1: Disruption of Interaction

  • SIFT is a normalized variant effect prediction score ranging from 0 to 1. Low scores (0–0.5) are anticipated to impact protein activity. The beeswarm plot reflects expected results, reinforcing SIFT’s effectiveness as a tool.
  • Total_free_energy_diff_from_wt is an engineered feature measuring the difference in EvoEF2 free energy between a missense complex and its wildtype counterpart. A substantial difference (high value) suggests a more unstable missense complex, which is evidently reflected in the plot.

Class 2: Increase in Interaction Strength

  • Resulting_sequence_A: Surprisingly, missense mutations leading to a cysteine residue have a notably strong, positive effect on the likelihood of increased interaction strength. Cysteine can form disulfide bridges across dimers, which are among the most robust connections. While this can enhance interactions, the reason behind its significant impact remains unclear.
  • iptm and iptm_ptm: These metrics gauge AlphaFold's confidence in the models. It appears that if two proteins exhibit a strong fit in a complex (indicated by a high iptm_ptm), they tend to interact well. Conversely, if AlphaFold's confidence in the structure is low, it may indicate that these proteins are not naturally compatible in a complex, thus diminishing the likelihood of a class 2 prediction. Nevertheless, it’s also possible that the 3D model is inaccurate and should be disregarded — I cannot draw definitive conclusions here.

Class 3: No Effect on Interaction

  • AM_PATHOGENICITY: This score from AlphaMissense is based on conservation scores for each residue in the human proteome. It appears to be a relatively strong predictor of non-pathogenic mutations, with low pathogenicity scores correlating with a higher likelihood of class 3 predictions.
  • pDockQ: This docking metric assesses how well two proteins fit together in 3D space, similar to sc. As anticipated, a favorable docking score seems to predict a non-perturbational variant effectively.
  • gnomAD_exomes_AF: The variant's exome allele frequency from gnomAD suggests that low-frequency variants are linked to unperturbed protein complexes. However, we would need to examine the frequency range to determine if we are dealing with rare or common variants (these classifications require specific frequency thresholds).

Overall, the results are intriguing. Now, let’s apply the model to novel protein complexes to see if we can glean any new insights.

# Case Studies: ASD-Related Protein Complexes

In this section, I will:

  • Examine two specific examples of protein complexes evaluated using the model.
  • Provide literature-based potential explanations for both.

This work is engaging, but the true value of machine learning in molecular biology lies in uncovering novel insights.

To that end, I applied the model to analyze several coding missense variants from the MSSNG database, a repository that compiles whole genome sequencing and annotation data aimed at studying autism spectrum disorder (ASD). (The Scherer Lab specializes in ASD genomics.)

Given MSSNG's extensive dataset, I honed my analysis to a subset meeting these criteria:

  • Coding missense variants (to focus on affected proteins).
  • Ambiguous pathogenicity predictions from REVEL and AlphaMissense (so the model could provide new insights).
  • Known direct interactions according to the IntAct database (the model only trains on this interaction type).
  • A PAE cutoff of 15 Å when folded using AlphaPulldown (to ensure high confidence in the relative positions of protein domains).

Here are two notable variants from this subset:

  • Androgen receptor & tyrosine protein kinase (AR:YES1.p.Q234R)
  • Transcriptional coactivator & zinc finger protein (YAP1:GLIS3.p.G448D)

The model predicts Class 1 (disrupted interaction) for both variants. Let's delve deeper:

Androgen Receptor & Tyrosine Protein Kinase (AR:YES1.p.Q234R)

Here are the facts:

  • AR (the androgen receptor) is present in the heart, skeletal muscle, and prostate (UniProt).
  • Tyrosine phosphorylation of AR has been associated with tumor growth (Guo et al., 2006).
  • Some studies suggest an inverse comorbidity between ASD and prostate cancer (Fores-Martos et al., 2019).

One potential narrative could be:

  • A disrupted interaction between AR and YES1 reduces tyrosine phosphorylation, thereby lowering cancer risk.
  • Inverse comorbidities between ASD and prostate cancer may explain this variant's association with autism.

Unfortunately, several issues arise:

  • YES1 is expressed only at low levels in the prostate.
  • Furthermore, it is a non-receptor tyrosine kinase.
  • Additionally, there is counter-evidence regarding the ASD-prostate cancer comorbidity.

The bottom line is that conclusive results from the model are lacking. For your interest, here are the SHAP values that contributed to this prediction, highlighting the notably high difference in interface solvation energy from the wildtype.

Transcriptional Coactivator & Zinc Finger Protein (YAP1:GLIS3.p.G448D)

Again, here are the facts:

  • YAP1 can coactivate or repress transcription and is found in the liver and prostate (also at low levels in the brain).
  • GLIS3 acts as a transcriptional activator/repressor located in the kidneys, brain, and liver.
  • Numerous GLIS3 binding sites are linked to neuropathologies affecting nervous system development and synaptic transmission (Calderari et al., 2018).
  • GLIS3 may also play a role in regulating autophagy (Calderari et al., 2018).

One possible narrative could be:

  • A disrupted interaction between these two proteins might accelerate the progression of neuropathologies related to autism.

However, pinpointing a specific mechanism from this research proves challenging.

Once again, here are its SHAP values:

Having trained a robust model, interpreted its results, and applied it to learn more about biological processes, what are the key takeaways? And what lies ahead?

# Final Thoughts and Lessons

Future Directions

I dedicated approximately four months at TCAG to this project, much of which involved data collection. (This followed an additional four months acclimating to AF-M.)

Given more time, here are steps I would take to enhance the project:

  • Data Gathering: With a limited dataset, we can only place so much trust in the model's results. Acquiring more PDBs, particularly experimental ones, is essential for model improvement.
  • Feature Engineering: A more in-depth molecular biology approach to feature engineering may boost model accuracy.
  • Pathogenicity Prediction Focus: Currently, the model predicts the impact on PPI (using the Feature type from the IntAct mutations database). An alternative route would be utilizing the enriched data for pathogenicity prediction, similar to REVEL.
  • Testing on Unexplained Variants: At present, I could only apply the model to a handful of variants of uncertain significance. Analyzing more PPIs could uncover additional research avenues.

However, I take pride in what I achieved and learned during this timeframe.

Conclusions & Key Takeaways

I doubt many students are afforded such freedom on an exciting research question, along with access to world-class mentors and one of Canada's largest compute clusters.

But I was.

I am immensely grateful to all involved (including Dr. Brett Trost, Dr. Richard Wintle, and Dr. Steve Scherer, to name a few).

Working in the Scherer Lab demonstrated that research doesn't have to be static, slow, and tedious. With the right team and vision, it can be swift, dynamic, and groundbreaking.

On a broader scale, I have learned the pivotal role AI/ML will play in the future of computational biology.

  • My initial plan for this project involved constructing a graph structure to observe and track structural differences between wildtype and missense complexes.
  • Upon realizing the statistical magnitude needed to discern the differences, it became clear that ML could provide a solution.

What are your thoughts? I welcome your feedback, questions, comments, and critiques.