ManyFold is a flexible library for protein structure prediction with deep learning that (i) supports models that use both multiple sequence alignments (MSA) and protein language model (pLM) embedding as inputs, (ii) allows inference of existing models (AlphaFold, OpenFold), (iii) is fully trainable, allowing for both fine-tuning and the training of new models from scratch and (iv) is written in Jax to support efficient batched operation in distributed settings. A proof-of-concept pLM-based model, pLMFold, is trained from scratch to obtain reasonable results with reduced computational overheads in comparison to AlphaFold. The source code for ManyFold, the validation dataset, and a small sample of training data are available at https://github.com/instadeepai/manyfold.
In recent years, machine learning approaches for de novo protein structure prediction have made significant progress, culminating in AlphaFold which approaches experimental accuracies in certain settings and heralds the possibility of rapid in silico protein modelling and design. However, such applications can be challenging in practice due to the significant compute required for training and inference of such models, and their strong reliance on the evolutionary information contained in multiple sequence alignments (MSAs), which may not be available for certain targets of interest. Here, we first present a streamlined AlphaFold architecture and training pipeline that still provides good performance with significantly reduced computational burden. Aligned with recent approaches such as OmegaFold and ESMFold, our model is initially trained to predict structure from sequences alone by leveraging embeddings from the pretrained ESM-2 protein language model (pLM). We then compare this approach to an equivalent model trained on MSA-profile information only, and find that the latter still provides a performance boost – suggesting that even state-of-the-art pLMs cannot yet easily replace the evolutionary information of homologous sequences. Finally, we train a model that can make predictions from either the combination, or only one, of pLM and MSA inputs. Ultimately, we obtain accuracies in any of these three input modes similar to models trained uniquely in that setting, whilst also demonstrating that these modalities are complimentary, each regularly outperforming the other.
BriefBio
An analysis of protein language model embeddings for fold prediction
Amelia Villegas-Morcillo, Angel M. Gomez, and Victoria Sanchez
The identification of the protein fold class is a challenging problem in structural biology. Recent computational methods for fold prediction leverage deep learning techniques to extract protein fold-representative embeddings mainly using evolutionary information in the form of multiple sequence alignment (MSA) as input source. In contrast, protein language models (LM) have reshaped the field thanks to their ability to learn efficient protein representations (protein-LM embeddings) from purely sequential information in a self-supervised manner. In this paper, we analyze a framework for protein fold prediction using pre-trained protein-LM embeddings as input to several fine-tuning neural network models, which are supervisedly trained with fold labels. In particular, we compare the performance of six protein-LM embeddings: the long short-term memory-based UniRep and SeqVec, and the transformer-based ESM-1b, ESM-MSA, ProtBERT and ProtT5; as well as three neural networks: Multi-Layer Perceptron, ResCNN-BGRU (RBG) and Light-Attention (LAT). We separately evaluated the pairwise fold recognition (PFR) and direct fold classification (DFC) tasks on well-known benchmark datasets. The results indicate that the combination of transformer-based embeddings, particularly those obtained at amino acid level, with the RBG and LAT fine-tuning models performs remarkably well in both tasks. To further increase prediction accuracy, we propose several ensemble strategies for PFR and DFC, which provide a significant performance boost over the current state-of-the-art results. All this suggests that moving from traditional protein representations to protein-LM embeddings is a very promising approach to protein fold-related tasks.
Current state-of-the-art deep learning approaches for protein fold recognition learn protein embeddings that improve prediction performance at the fold level. However, there still exists aperformance gap at the fold level and the (relatively easier) family level, suggesting that it might be possible to learn an embedding space that better represents the protein folds. In this paper, we propose the FoldHSphere method to learn a better fold embedding space through a two-stage training procedure. We first obtain prototype vectors for each fold class that are maximally separated in hyperspherical space. We then train a neural network by minimizing the angular large margin cosine loss to learn protein embeddings clustered around the corresponding hyperspherical fold prototypes. Our network architectures, ResCNN-GRU and ResCNN-BGRU, process the input protein sequences by applying several residual-convolutional blocks followed by a gated recurrent unit-based recurrent layer. Evaluation results on the LINDAHL dataset indicate that the use of our hyperspherical embeddings effectively bridges the performance gap at the family and fold levels. Furthermore, our FoldHSpherePro ensemble method yields an accuracy of 81.3% at the fold level, outperforming all the state-of-the-art methods. Our methodology is efficient in learning discriminative and fold-representative embeddings for the protein domains. The proposed hyperspherical embeddings are effective at identifying the protein fold class by pairwise comparison, even when amino acid sequence similarities are low.
Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function.
IEEE/ACM TCBB
Protein Fold Recognition From Sequences Using Convolutional and Recurrent Neural Networks
Amelia Villegas-Morcillo, Angel M. Gomez, Juan A. Morales-Cordovilla, and Victoria Sanchez
IEEE/ACM Transactions on Computational Biology and Bioinformatics 2021
The identification of a protein fold type from its amino acid sequence provides important insights about the protein 3D structure. In this paper, we propose a deep learning architecture that can process protein residue-level features to address the protein fold recognition task. Our neural network model combines 1D-convolutional layers with gated recurrent unit (GRU) layers. The GRU cells, as recurrent layers, cope with the processing issues associated to the highly variable protein sequence lengths and so extract a fold-related embedding of fixed size for each protein domain. These embeddings are then used to perform the pairwise fold recognition task, which is based on transferring the fold type of the most similar template structure. We compare our model with several template-based and deep learning-based methods from the state-of-the-art. The evaluation results over the well-known LINDAHL and SCOP_TEST sets, along with a proposed LINDAHL test set updated to SCOP 1.75, show that our embeddings perform significantly better than these methods, specially at the fold level. Supplementary material, source code and trained models are available at http://sigmat.ugr.es/ amelia/CNN-GRU-RF+/.
Protein-Protein interactions (PPIs) are key to many important life processes, such as cancer replication or DNA transcription. While in vivo or in vitro methods for PPI screening exist, they are expensive and computational approaches have been proposed to address PPI prediction. Previous computational methods rely on hand crafted features to capture the underlying information of the protein data. In this work we present a deep neural network architecture leveraging embedding techniques and recurrent neural networks to extract features and predict interaction between protein pairs. The results achieved are similar to those obtained by other state-of-the-art computational approaches to the problem but without any feature engineering involved, directly using the raw amino acid sequences.
A protein contact map is a simplified matrix representation of the protein structure, where the spatial proximity of two amino acid residues is reflected. Although the accurate prediction of protein inter-residue contacts from the amino acid sequence is an open problem, considerable progress has been made in recent years. This progress has been driven by the development of contact predictors that identify the coevolutionary events occurring in a protein multiple sequence alignment (MSA). However, it has been shown that these methods introduce Gaussian noise in the estimated contact map, making its reduction necessary. In this paper, we propose the use of two different Gaussian denoising approximations in order to enhance the protein contact estimation. These approaches are based on (i) sparse representations over learned dictionaries, and (ii) deep residual convolutional neural networks. The results highlight that the residual learning strategy allows a better reconstruction of the contact map, thus improving contact predictions.