Skip to content

LegallyOverworked/Protein-Language-Models

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

86 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Protein-Language-Models

At the intersection of the rapidly growing biological data landscape and advancements in Natural Language Processing (NLP), protein language models (PLMs) have emerged as a transformative force in modern research. These models have achieved remarkable progress, highlighting the need for timely and comprehensive overviews. However, much of the existing literature focuses narrowly on specific domains, often missing a broader analysis of PLMs. This study provides a systematic review of PLMs from a macro perspective, covering key historical milestones and current mainstream trends. We focus on the models themselves and their evaluation metrics, exploring aspects such as model architectures, positional encoding, scaling laws, and datasets. In the evaluation section, we discuss benchmarks and downstream applications. To further support ongoing research, we introduce relevant mainstream tools. Lastly, we critically examine the key challenges and limitations in this rapidly evolving field.

News

Overview

This is the overview of our article.

Protein-Language-Models-Overview

Contents

Models

We categorize protein models into two sections: Non-transformer-based models and Transformer-based models. The Transformer-based models are further divided into three parts: Encoder-only models, Decoder-only models, and Encoder-decoder models. In the following table, we provide related information on each model, including the paper link, release time, parameters, base model, pretraining dataset, and whether the model is open-source, along with the link to the open-source code for users to reference(Models are sorted alphabetically by their names).

Non-transformer-based models

Model Time Params Base model Pretraining Dataset Code
CARP 2024.02 600K-640M CNN UniRef50 √
MIF-ST 2023.03 3.4M GNN CATH √
ProSE 2021.06 - LSTM UniRef90, SCOP √
ProtVec 2015.11 - Skip-gram UniProtKB/Swiss-Prot Γ—
ProtVecX 2019.03 - ProtVec UniRef50, UniProtKB/Swiss-Prot Γ—
SeqVec 2019.12 - ELMo UniRef50 √
Seq2vec 2020.09 - CNN-LSTM - Γ—
UDSMProt 2020.01 - AWD-LSTM UniProtKB/Swiss-Prot Γ—
UniRep 2019.10 - mLSTM UniRef50 √

Transformer-based models

Encoder-only models

Model Time Params Pretraining Dataset Code
AbLang 2022.06 - OAS √
AbLang2 2024.02 - OAS √
AMPLIFY 2024.09 120M/350M UniRef50, UniRef100, OAS, SCOP √
AminoBert 2022.10 - UniRef90, PDB, MGnify Γ—
AntiBERTa 2022.07 86M OAS √
AntiBERTy 2021.12 26M OAS √
BALM 2024.05 - OAS √
DistilProtBert 2022.09 230M UniRef50 √
ESM-1b 2020.02 650M UniRef50 √
ESM-1v 2021.02 650M UniRef90 √
ESM-2 2023.03 8M-15B UniRef50 √
ESM-3 2024.07 98B UniRef, MGnify, AlphaFoldDB, ESMAtlas √
ESM All-Atom 2024.05 35M AlphaFoldDB √
ESM-C 2024.12 300M, 600M, 6B UniRef, MGnify, JGI Γ—
ESM-GearNet 2023.10 - AlphaFoldDB √
ESM-MSA-1b 2021.02 100M UniRef50 √
IgBert 2024.12 420M OAS Γ—
LM-GVP 2022.04 - - √
OntoProtein 2022.06 - ProteinKG25 √
PeTriBERT 2022.08 40M AlphaFoldDB Γ—
PMLM 2021.10 87M-731M UniRef50, Pfam Γ—
PRoBERTa 2020.09 44M UniProtKB/Swiss-Prot √
PromptProtein 2023.02 650M UniRef50, PDB √
ProteinBERT 2022.03 16M UniRef90 √
ProteinLM 2021.12 200M/3B Pfam √
ProtFlash 2023.10 79M/174M UniRef50 √
ProtTrans 2021.07 - UniRef, BFD √
SaProt 2023.10 650M AlphaFoldDB, PDB √
TCR-BERT 2021.11 100M PIRD, VDJdb, TCRdb, murine LCMV GP33 √

Decoder-only models

Model Time Params Pretraining Dataset Code
DARK 2022.01 128M - Γ—
IgLM 2022.12 13M - √
PoET 2023.11 57M-604M - Γ—
ProGen 2020.03 1.2B UniParc, UniProtKB/Swiss-Prot √
ProGen2 2023.10 151M-6.4B UniRef90, BFD30, PDB √
ProLLaMA 2024.02 - UniRef50 √
ProtGPT2 2021.01 738M UniRef50 √
RITA 2022.05 1.2B UniRef100 Γ—
ZymCTRL 2022.01 738M BRENDA √

Encoder-decoder models

Model Time Params Pretraining Dataset Code
Ankh 2023.01 450M/1.15B UniRef50 Link
IgT5 2024.12 3B OAS Γ—
LM-Design 2023.02 664M - Γ—
MSA-Augmenter 2023.06 260M UniRef50 Link
ProSST 2024.05 110M AlphaFoldDB, CATH Link
ProstT5 2023.07 3B AlphaFoldDB, PDB Link
ProtT5 2022.06 3B/11B UniRef50, BFD Link
pAbT5 2023.10 - OAS Γ—
Sapiens 2022.02 0.6M OAS Link
SS-pLM 2023.08 14.8M UniRef50 Γ—
xTrimoPGLM 2023.07 100B UniRef90, ColdFoldDB Γ—

Datasets

Protein datasets can be classified into two categories depending on whether they include annotations: pre-training datasets and benchmarks. Pre-training datasets are typically used for self-supervised pre-training as they lack labels, whereas benchmarks, which contain labeled data, are used for supervised fine-tuning or model evaluation. We provide the relevant papers and links for the pre-training datasets and benchmarks of the present popular protein language models in the following table(Pre trained datasets and benchmarks are sorted alphabetically by their names).The pre training datasets are divided into sequence datasets and structural datasets, and the benchmarks are divided into structural benchmarks, functional benchmarks, and other benchmarks.

Pre-training datasets

Sequence datasets

Dataset Time Scale Link
BFD[1,2,3] 2021.07 2.5B √
BRENDA 2002.01 - √
MGnify 2022.12 - √
Pfam 2023.09 47M √
UniClust30 2016.11 - √
UniParc 2023.11 632M √
UniProtKB/Swiss-Prot 2023.11 570K √
UniProtKB/TrEMBL 2023.11 251M √
UniRef50[1,2] 2023.11 53M √
UniRef90[1,2] 2023.11 150M √
UniRef100[1,2] 2023.11 314M √

Structural datasets

Dataset Time Scale Link
AlphafoldDB[1,2] 2021.11 200M √
PDB 2023.12 214K √

Benchmarks

Structural benchmarks

Dataset Time Scale Link
CAMEO - - √
CASP - - √
CATH 2023.02 151M √
SCOP 2023.01 914K √

Functional benchmarks

Dataset Time Scale Link
CAFA - - √
EC 2023.11 2.6M √
FLIP 2022.01 320K √
GO 2023.11 1.5M √

Other benchmarks

Dataset Time Scale Link
PEER 2022.11 390K √
ProteinGym 2022.12 300K √
TAPE 2021.09 120K √

Tools

We provide links to commonly used protein tools in the following table for readers to use(Tools are sorted alphabetically by their names).The tools are divided into sequence tools, structural tools, and other tools

Sequence tools

Tool Link
BLAST √
HHblits&HHfilter √
MMseq2 √

Structural tools

Tool Link
Foldseek √
PyMOL √
TM-align √

Other tools

Tool Link
t-SNE √
Umap √

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published