ProtTrans Toward Understanding the Language of Life Through Self-Supervised Learning


The fields of computational biology and bioinformatics uncover vast data gold mines from protein sequences, which are perfect for the development of Language Models (LMs), which are derived from Natural Language Processing (NLP). These LMs push the boundaries of prediction in new directions while maintaining low inference costs. In this study, we trained two auto-regressive models (Transformer-XL and XLNet) and four auto-encoder models (BERT, Albert, Electra, and T5) on data from UniRef and BFD containing up to 393 billion different amino acids. On the Summit supercomputer, the protein learning machines, or pLMs, were trained with the help of 5616 GPUs and a TPU Pod with up to 1024 cores. After dimension reduction, it was discovered that the raw pLM- embeddings derived from the unlabeled data captured some of the biophysical characteristics of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks, including the following: (1) a per-residue (per-token) prediction of protein secondary structure (three-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (two-state accuracy Q2=91%) For the first time, the most informative embeddings (ProtT5) were able to outperform the state-of-the-art in terms of secondary structure. This was accomplished without the use of multiple sequence alignments (MSAs) or evolutionary information, which allowed the researchers to avoid costly database searches. The findings, when taken as a whole, gave the impression that pLMs had picked up some of the grammar of the language of life.

Did you like this research project?

To get this research project Guidelines, Training and Code... Click Here

PROJECT TITLE : Fast and Secure Multihop Broadcast Solutions for Intervehicular Communication - 2014 ABSTRACT: Intervehicular communication (IVC) is an important emerging research area that is expected to considerably contribute
PROJECT TITLE :Network Traffic Classification Using Correlation Information - 2013ABSTRACT:Traffic classification has wide applications in network management, from security monitoring to quality of service measurements. Recent
PROJECT TITLE :The Generalization Ability of Online Algorithms for Dependent Data - 2013ABSTRACT:We study the generalization performance of online learning algorithms trained on samples coming from a dependent source of data.
PROJECT TITLE :Ranking on Data Manifold with Sink Points - 2013ABSTRACT:Ranking is an important problem in various applications, such as Information Retrieval (IR), natural language processing, computational biology, and social

Ready to Complete Your Academic MTech Project Work In Affordable Price ?

Project Enquiry