Recent advances in deep generative methods have allowed antibody sequence and structure co-design. This study addresses the challenge of tailoring the highly variable complementarity-determining regions (CDRs) in antibodies to fulfill developability requirements. We introduce a guidance approach that integrates property information into the antibody design process using diffusion probabilistic models. This approach allows us to simultaneously design CDRs conditioned on antigen structures while considering critical properties like solubility and folding stability. Our property-guided diffusion model offers versatility by accommodating diverse property constraints, presenting a promising avenue for computational antibody design in therapeutic applications.
@article{villegas2024abdevelop,title={Guiding diffusion models for antibody sequence and structure co-design with developability properties},author={Villegas-Morcillo, Amelia and Weber, Jana M. and Reinders, Marcel J.T.},journal={PRX Life},volume={2},number={3},pages={033012},year={2024},doi={10.1103/PRXLife.2.033012}}
2023
NeurIPS 23
Why Did This Model Forecast This Future? Information-Theoretic Saliency for Counterfactual Explanations of Probabilistic Regression Models
Chirag Raman, Alec Nonnemaker,
Amelia Villegas-Morcillo, Hayley Hung, and Marco Loog
In Advances in Neural Information Processing Systems 2023
We propose a post hoc saliency-based explanation framework for counterfactual reasoning in probabilistic multivariate time-series forecasting (regression) settings. Building upon Miller’s framework of explanations derived from research in multiple social science disciplines, we establish a conceptual link between counterfactual reasoning and saliency-based explanation techniques. To address the lack of a principled notion of saliency, we leverage a unifying definition of information-theoretic saliency grounded in preattentive human visual cognition and extend it to forecasting settings. Specifically, we obtain a closed-form expression for commonly used density functions to identify which observed timesteps appear salient to an underlying model in making its probabilistic forecasts. We empirically validate our framework in a principled manner using synthetic data to establish ground-truth saliency that is unavailable for real-world data. Finally, using real-world data and forecasting models, we demonstrate how our framework can assist domain experts in forming new data-driven hypotheses about the causal relationships between features in the wild.
@inproceedings{raman2023saliency,title={Why Did This Model Forecast This Future? Information-Theoretic Saliency for Counterfactual Explanations of Probabilistic Regression Models},author={Raman, Chirag and Nonnemaker, Alec and Villegas-Morcillo, Amelia and Hung, Hayley and Loog, Marco},booktitle={Advances in Neural Information Processing Systems},volume={36},year={2023}}
Bioinformatics
ManyFold: An efficient and flexible library for training and validating protein folding models
Amelia Villegas-Morcillo, Louis Robinson, Arthur Flajolet, and Thomas D. Barrett
ManyFold is a flexible library for protein structure prediction with deep learning that (i) supports models that use both multiple sequence alignments (MSA) and protein language model (pLM) embedding as inputs, (ii) allows inference of existing models (AlphaFold, OpenFold), (iii) is fully trainable, allowing for both fine-tuning and the training of new models from scratch and (iv) is written in Jax to support efficient batched operation in distributed settings. A proof-of-concept pLM-based model, pLMFold, is trained from scratch to obtain reasonable results with reduced computational overheads in comparison to AlphaFold. The source code for ManyFold, the validation dataset, and a small sample of training data are available at https://github.com/instadeepai/manyfold.
@article{villegas2022manyfold,title={ManyFold: An efficient and flexible library for training and validating protein folding models},author={Villegas-Morcillo, Amelia and Robinson, Louis and Flajolet, Arthur and Barrett, Thomas D.},journal={Bioinformatics},volume={39},number={1},pages={1--3},year={2023},doi={10.1093/bioinformatics/btac773}}
In recent years, machine learning approaches for de novo protein structure prediction have made significant progress, culminating in AlphaFold which approaches experimental accuracies in certain settings and heralds the possibility of rapid in silico protein modelling and design. However, such applications can be challenging in practice due to the significant compute required for training and inference of such models, and their strong reliance on the evolutionary information contained in multiple sequence alignments (MSAs), which may not be available for certain targets of interest. Here, we first present a streamlined AlphaFold architecture and training pipeline that still provides good performance with significantly reduced computational burden. Aligned with recent approaches such as OmegaFold and ESMFold, our model is initially trained to predict structure from sequences alone by leveraging embeddings from the pretrained ESM-2 protein language model (pLM). We then compare this approach to an equivalent model trained on MSA-profile information only, and find that the latter still provides a performance boost – suggesting that even state-of-the-art pLMs cannot yet easily replace the evolutionary information of homologous sequences. Finally, we train a model that can make predictions from either the combination, or only one, of pLM and MSA inputs. Ultimately, we obtain accuracies in any of these three input modes similar to models trained uniquely in that setting, whilst also demonstrating that these modalities are complimentary, each regularly outperforming the other.
@inproceedings{villegas2022somanyfolds,title={So ManyFolds, So Little Time: Efficient Protein Structure Prediction With pLMs and MSAs},author={Barrett, Thomas D. and Villegas-Morcillo, Amelia and Robinson, Louis and Gaujac, Benoit and Adméte, David and Saquand, Elia and Beguir, Karim and Flajolet, Arthur},booktitle={AI4Science and MLSB workshops at NeurIPS},year={2022}}
BriefBio
An analysis of protein language model embeddings for fold prediction
Amelia Villegas-Morcillo, Angel M. Gomez, and Victoria Sanchez
The identification of the protein fold class is a challenging problem in structural biology. Recent computational methods for fold prediction leverage deep learning techniques to extract protein fold-representative embeddings mainly using evolutionary information in the form of multiple sequence alignment (MSA) as input source. In contrast, protein language models (LM) have reshaped the field thanks to their ability to learn efficient protein representations (protein-LM embeddings) from purely sequential information in a self-supervised manner. In this paper, we analyze a framework for protein fold prediction using pre-trained protein-LM embeddings as input to several fine-tuning neural network models, which are supervisedly trained with fold labels. In particular, we compare the performance of six protein-LM embeddings: the long short-term memory-based UniRep and SeqVec, and the transformer-based ESM-1b, ESM-MSA, ProtBERT and ProtT5; as well as three neural networks: Multi-Layer Perceptron, ResCNN-BGRU (RBG) and Light-Attention (LAT). We separately evaluated the pairwise fold recognition (PFR) and direct fold classification (DFC) tasks on well-known benchmark datasets. The results indicate that the combination of transformer-based embeddings, particularly those obtained at amino acid level, with the RBG and LAT fine-tuning models performs remarkably well in both tasks. To further increase prediction accuracy, we propose several ensemble strategies for PFR and DFC, which provide a significant performance boost over the current state-of-the-art results. All this suggests that moving from traditional protein representations to protein-LM embeddings is a very promising approach to protein fold-related tasks.
@article{villegas2022foldembed,title={An analysis of protein language model embeddings for fold prediction},author={Villegas-Morcillo, Amelia and Gomez, Angel M. and Sanchez, Victoria},journal={Briefings in Bioinformatics},volume={23},number={3},pages={1--14},year={2022},doi={10.1093/bib/bbac142}}
Current state-of-the-art deep learning approaches for protein fold recognition learn protein embeddings that improve prediction performance at the fold level. However, there still exists aperformance gap at the fold level and the (relatively easier) family level, suggesting that it might be possible to learn an embedding space that better represents the protein folds. In this paper, we propose the FoldHSphere method to learn a better fold embedding space through a two-stage training procedure. We first obtain prototype vectors for each fold class that are maximally separated in hyperspherical space. We then train a neural network by minimizing the angular large margin cosine loss to learn protein embeddings clustered around the corresponding hyperspherical fold prototypes. Our network architectures, ResCNN-GRU and ResCNN-BGRU, process the input protein sequences by applying several residual-convolutional blocks followed by a gated recurrent unit-based recurrent layer. Evaluation results on the LINDAHL dataset indicate that the use of our hyperspherical embeddings effectively bridges the performance gap at the family and fold levels. Furthermore, our FoldHSpherePro ensemble method yields an accuracy of 81.3% at the fold level, outperforming all the state-of-the-art methods. Our methodology is efficient in learning discriminative and fold-representative embeddings for the protein domains. The proposed hyperspherical embeddings are effective at identifying the protein fold class by pairwise comparison, even when amino acid sequence similarities are low.
@article{villegas2021foldhsphere,title={FoldHSphere: deep hyperspherical embeddings for protein fold recognition},author={Villegas-Morcillo, Amelia and Sanchez, Victoria and Gomez, Angel M.},journal={BMC Bioinformatics},volume={22},number={1},pages={1--21},year={2021},doi={10.1186/s12859-021-04419-7}}
Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function.
@article{villegas2021function,title={Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function},author={Villegas-Morcillo*, Amelia and Makrodimitris*, Stavros and van-Ham, Roeland C.H.J. and Gomez, Angel M. and Sanchez, Victoria and Reinders, Marcel J.T.},journal={Bioinformatics},volume={37},number={2},pages={162--170},year={2021},doi={10.1093/bioinformatics/btaa701}}
IEEE/ACM TCBB
Protein Fold Recognition From Sequences Using Convolutional and Recurrent Neural Networks
Amelia Villegas-Morcillo, Angel M. Gomez, Juan A. Morales-Cordovilla, and Victoria Sanchez
IEEE/ACM Transactions on Computational Biology and Bioinformatics 2021
The identification of a protein fold type from its amino acid sequence provides important insights about the protein 3D structure. In this paper, we propose a deep learning architecture that can process protein residue-level features to address the protein fold recognition task. Our neural network model combines 1D-convolutional layers with gated recurrent unit (GRU) layers. The GRU cells, as recurrent layers, cope with the processing issues associated to the highly variable protein sequence lengths and so extract a fold-related embedding of fixed size for each protein domain. These embeddings are then used to perform the pairwise fold recognition task, which is based on transferring the fold type of the most similar template structure. We compare our model with several template-based and deep learning-based methods from the state-of-the-art. The evaluation results over the well-known LINDAHL and SCOP_TEST sets, along with a proposed LINDAHL test set updated to SCOP 1.75, show that our embeddings perform significantly better than these methods, specially at the fold level. Supplementary material, source code and trained models are available at http://sigmat.ugr.es/ amelia/CNN-GRU-RF+/.
@article{villegas2021foldrecog,title={Protein Fold Recognition From Sequences Using Convolutional and Recurrent Neural Networks},author={Villegas-Morcillo, Amelia and Gomez, Angel M. and Morales-Cordovilla, Juan A. and Sanchez, Victoria},journal={IEEE/ACM Transactions on Computational Biology and Bioinformatics},volume={18},number={6},pages={2848--2854},year={2021},doi={10.1109/TCBB.2020.3012732}}
Protein-Protein interactions (PPIs) are key to many important life processes, such as cancer replication or DNA transcription. While in vivo or in vitro methods for PPI screening exist, they are expensive and computational approaches have been proposed to address PPI prediction. Previous computational methods rely on hand crafted features to capture the underlying information of the protein data. In this work we present a deep neural network architecture leveraging embedding techniques and recurrent neural networks to extract features and predict interaction between protein pairs. The results achieved are similar to those obtained by other state-of-the-art computational approaches to the problem but without any feature engineering involved, directly using the raw amino acid sequences.
@inproceedings{gonzalez2018ppi,title={End-to-end prediction of protein-protein interaction based on embedding and recurrent neural networks},author={Gonzalez-Lopez, Francisco and Morales-Cordovilla, Juan A. and Villegas-Morcillo, Amelia and Gomez, Angel M. and Sanchez, Victoria},booktitle={IEEE International Conference on Bioinformatics and Biomedicine (BIBM)},pages={2344--2350},year={2018},doi={10.1109/BIBM.2018.8621328}}
A protein contact map is a simplified matrix representation of the protein structure, where the spatial proximity of two amino acid residues is reflected. Although the accurate prediction of protein inter-residue contacts from the amino acid sequence is an open problem, considerable progress has been made in recent years. This progress has been driven by the development of contact predictors that identify the coevolutionary events occurring in a protein multiple sequence alignment (MSA). However, it has been shown that these methods introduce Gaussian noise in the estimated contact map, making its reduction necessary. In this paper, we propose the use of two different Gaussian denoising approximations in order to enhance the protein contact estimation. These approaches are based on (i) sparse representations over learned dictionaries, and (ii) deep residual convolutional neural networks. The results highlight that the residual learning strategy allows a better reconstruction of the contact map, thus improving contact predictions.
@inproceedings{villegas2018contmap,title={Improved Protein Residue-Residue Contact Prediction Using Image Denoising Methods},author={Villegas-Morcillo, Amelia and Morales-Cordovilla, Juan A. and Gomez, Angel M. and Sanchez, Victoria},booktitle={European Signal Processing Conference (EUSIPCO)},pages={1167--1171},year={2018},doi={10.23919/EUSIPCO.2018.8553519}}