ProtTrans: Self-Supervised Learning as a Pathway to Understanding the Language of Life

PROJECT TITLE :

ProtTrans Toward Understanding the Language of Life Through Self-Supervised Learning

ABSTRACT:

The fields of computational biology and bioinformatics uncover vast data gold mines from protein sequences, which are perfect for the development of Language Models (LMs), which are derived from Natural Language Processing (NLP). These LMs push the boundaries of prediction in new directions while maintaining low inference costs. In this study, we trained two auto-regressive models (Transformer-XL and XLNet) and four auto-encoder models (BERT, Albert, Electra, and T5) on data from UniRef and BFD containing up to 393 billion different amino acids. On the Summit supercomputer, the protein learning machines, or pLMs, were trained with the help of 5616 GPUs and a TPU Pod with up to 1024 cores. After dimension reduction, it was discovered that the raw pLM- embeddings derived from the unlabeled data captured some of the biophysical characteristics of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks, including the following: (1) a per-residue (per-token) prediction of protein secondary structure (three-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (two-state accuracy Q2=91%) For the first time, the most informative embeddings (ProtT5) were able to outperform the state-of-the-art in terms of secondary structure. This was accomplished without the use of multiple sequence alignments (MSAs) or evolutionary information, which allowed the researchers to avoid costly database searches. The findings, when taken as a whole, gave the impression that pLMs had picked up some of the grammar of the language of life.

Did you like this research project?

To get this research project Guidelines, Training and Code... Click Here

ProtTrans: Self-Supervised Learning as a Pathway to Understanding the Language of Life

QUICK LINKS

Ready to Complete Your Academic MTech Project Work In Affordable Price ?