At the intersection of the rapidly growing biological data landscape and advancements in Natural Language Processing (NLP), protein language models (PLMs) have emerged as a transformative force in modern research. These models have achieved remarkable progress, highlighting the need for timely and comprehensive overviews. However, much of the existing literature focuses narrowly on specific domains, often missing a broader analysis of PLMs. This study provides a systematic review of PLMs from a macro perspective, covering key historical milestones and current mainstream trends. We focus on the models themselves and their evaluation metrics, exploring aspects such as model architectures, positional encoding, scaling laws, and datasets. In the evaluation section, we discuss benchmarks and downstream applications. To further support ongoing research, we introduce relevant mainstream tools. Lastly, we critically examine the key challenges and limitations in this rapidly evolving field.
- π [2025/02] Our paper has been submitted to a preprint server.
This is the overview of our article.
We categorize protein models into two sections: Non-transformer-based models and Transformer-based models. The Transformer-based models are further divided into three parts: Encoder-only models, Decoder-only models, and Encoder-decoder models. In the following table, we provide related information on each model, including the paper link, release time, parameters, base model, pretraining dataset, and whether the model is open-source, along with the link to the open-source code for users to reference(Models are sorted alphabetically by their names).
| Model | Time | Params | Base model | Pretraining Dataset | Code |
|---|---|---|---|---|---|
| CARP | 2024.02 | 600K-640M | CNN | UniRef50 | β |
| MIF-ST | 2023.03 | 3.4M | GNN | CATH | β |
| ProSE | 2021.06 | - | LSTM | UniRef90, SCOP | β |
| ProtVec | 2015.11 | - | Skip-gram | UniProtKB/Swiss-Prot | Γ |
| ProtVecX | 2019.03 | - | ProtVec | UniRef50, UniProtKB/Swiss-Prot | Γ |
| SeqVec | 2019.12 | - | ELMo | UniRef50 | β |
| Seq2vec | 2020.09 | - | CNN-LSTM | - | Γ |
| UDSMProt | 2020.01 | - | AWD-LSTM | UniProtKB/Swiss-Prot | Γ |
| UniRep | 2019.10 | - | mLSTM | UniRef50 | β |
| Model | Time | Params | Pretraining Dataset | Code |
|---|---|---|---|---|
| AbLang | 2022.06 | - | OAS | β |
| AbLang2 | 2024.02 | - | OAS | β |
| AMPLIFY | 2024.09 | 120M/350M | UniRef50, UniRef100, OAS, SCOP | β |
| AminoBert | 2022.10 | - | UniRef90, PDB, MGnify | Γ |
| AntiBERTa | 2022.07 | 86M | OAS | β |
| AntiBERTy | 2021.12 | 26M | OAS | β |
| BALM | 2024.05 | - | OAS | β |
| DistilProtBert | 2022.09 | 230M | UniRef50 | β |
| ESM-1b | 2020.02 | 650M | UniRef50 | β |
| ESM-1v | 2021.02 | 650M | UniRef90 | β |
| ESM-2 | 2023.03 | 8M-15B | UniRef50 | β |
| ESM-3 | 2024.07 | 98B | UniRef, MGnify, AlphaFoldDB, ESMAtlas | β |
| ESM All-Atom | 2024.05 | 35M | AlphaFoldDB | β |
| ESM-C | 2024.12 | 300M, 600M, 6B | UniRef, MGnify, JGI | Γ |
| ESM-GearNet | 2023.10 | - | AlphaFoldDB | β |
| ESM-MSA-1b | 2021.02 | 100M | UniRef50 | β |
| IgBert | 2024.12 | 420M | OAS | Γ |
| LM-GVP | 2022.04 | - | - | β |
| OntoProtein | 2022.06 | - | ProteinKG25 | β |
| PeTriBERT | 2022.08 | 40M | AlphaFoldDB | Γ |
| PMLM | 2021.10 | 87M-731M | UniRef50, Pfam | Γ |
| PRoBERTa | 2020.09 | 44M | UniProtKB/Swiss-Prot | β |
| PromptProtein | 2023.02 | 650M | UniRef50, PDB | β |
| ProteinBERT | 2022.03 | 16M | UniRef90 | β |
| ProteinLM | 2021.12 | 200M/3B | Pfam | β |
| ProtFlash | 2023.10 | 79M/174M | UniRef50 | β |
| ProtTrans | 2021.07 | - | UniRef, BFD | β |
| SaProt | 2023.10 | 650M | AlphaFoldDB, PDB | β |
| TCR-BERT | 2021.11 | 100M | PIRD, VDJdb, TCRdb, murine LCMV GP33 | β |
| Model | Time | Params | Pretraining Dataset | Code |
|---|---|---|---|---|
| DARK | 2022.01 | 128M | - | Γ |
| IgLM | 2022.12 | 13M | - | β |
| PoET | 2023.11 | 57M-604M | - | Γ |
| ProGen | 2020.03 | 1.2B | UniParc, UniProtKB/Swiss-Prot | β |
| ProGen2 | 2023.10 | 151M-6.4B | UniRef90, BFD30, PDB | β |
| ProLLaMA | 2024.02 | - | UniRef50 | β |
| ProtGPT2 | 2021.01 | 738M | UniRef50 | β |
| RITA | 2022.05 | 1.2B | UniRef100 | Γ |
| ZymCTRL | 2022.01 | 738M | BRENDA | β |
| Model | Time | Params | Pretraining Dataset | Code |
|---|---|---|---|---|
| Ankh | 2023.01 | 450M/1.15B | UniRef50 | Link |
| IgT5 | 2024.12 | 3B | OAS | Γ |
| LM-Design | 2023.02 | 664M | - | Γ |
| MSA-Augmenter | 2023.06 | 260M | UniRef50 | Link |
| ProSST | 2024.05 | 110M | AlphaFoldDB, CATH | Link |
| ProstT5 | 2023.07 | 3B | AlphaFoldDB, PDB | Link |
| ProtT5 | 2022.06 | 3B/11B | UniRef50, BFD | Link |
| pAbT5 | 2023.10 | - | OAS | Γ |
| Sapiens | 2022.02 | 0.6M | OAS | Link |
| SS-pLM | 2023.08 | 14.8M | UniRef50 | Γ |
| xTrimoPGLM | 2023.07 | 100B | UniRef90, ColdFoldDB | Γ |
Protein datasets can be classified into two categories depending on whether they include annotations: pre-training datasets and benchmarks. Pre-training datasets are typically used for self-supervised pre-training as they lack labels, whereas benchmarks, which contain labeled data, are used for supervised fine-tuning or model evaluation. We provide the relevant papers and links for the pre-training datasets and benchmarks of the present popular protein language models in the following table(Pre trained datasets and benchmarks are sorted alphabetically by their names).The pre training datasets are divided into sequence datasets and structural datasets, and the benchmarks are divided into structural benchmarks, functional benchmarks, and other benchmarks.
| Dataset | Time | Scale | Link |
|---|---|---|---|
| BFD[1,2,3] | 2021.07 | 2.5B | β |
| BRENDA | 2002.01 | - | β |
| MGnify | 2022.12 | - | β |
| Pfam | 2023.09 | 47M | β |
| UniClust30 | 2016.11 | - | β |
| UniParc | 2023.11 | 632M | β |
| UniProtKB/Swiss-Prot | 2023.11 | 570K | β |
| UniProtKB/TrEMBL | 2023.11 | 251M | β |
| UniRef50[1,2] | 2023.11 | 53M | β |
| UniRef90[1,2] | 2023.11 | 150M | β |
| UniRef100[1,2] | 2023.11 | 314M | β |
| Dataset | Time | Scale | Link |
|---|---|---|---|
| AlphafoldDB[1,2] | 2021.11 | 200M | β |
| PDB | 2023.12 | 214K | β |
| Dataset | Time | Scale | Link |
|---|---|---|---|
| CAMEO | - | - | β |
| CASP | - | - | β |
| CATH | 2023.02 | 151M | β |
| SCOP | 2023.01 | 914K | β |
| Dataset | Time | Scale | Link |
|---|---|---|---|
| CAFA | - | - | β |
| EC | 2023.11 | 2.6M | β |
| FLIP | 2022.01 | 320K | β |
| GO | 2023.11 | 1.5M | β |
| Dataset | Time | Scale | Link |
|---|---|---|---|
| PEER | 2022.11 | 390K | β |
| ProteinGym | 2022.12 | 300K | β |
| TAPE | 2021.09 | 120K | β |
We provide links to commonly used protein tools in the following table for readers to use(Tools are sorted alphabetically by their names).The tools are divided into sequence tools, structural tools, and other tools
| Tool | Link |
|---|---|
| BLAST | β |
| HHblits&HHfilter | β |
| MMseq2 | β |
| Tool | Link |
|---|---|
| Foldseek | β |
| PyMOL | β |
| TM-align | β |
| Tool | Link |
|---|---|
| t-SNE | β |
| Umap | β |
