Protein-Language-Models

At the intersection of the rapidly growing biological data landscape and advancements in Natural Language Processing (NLP), protein language models (PLMs) have emerged as a transformative force in modern research. These models have achieved remarkable progress, highlighting the need for timely and comprehensive overviews. However, much of the existing literature focuses narrowly on specific domains, often missing a broader analysis of PLMs. This study provides a systematic review of PLMs from a macro perspective, covering key historical milestones and current mainstream trends. We focus on the models themselves and their evaluation metrics, exploring aspects such as model architectures, positional encoding, scaling laws, and datasets. In the evaluation section, we discuss benchmarks and downstream applications. To further support ongoing research, we introduce relevant mainstream tools. Lastly, we critically examine the key challenges and limitations in this rapidly evolving field.

News

🌟 [2025/02] Our paper has been submitted to a preprint server.

Overview

This is the overview of our article.

Models

We categorize protein models into two sections: Non-transformer-based models and Transformer-based models. The Transformer-based models are further divided into three parts: Encoder-only models, Decoder-only models, and Encoder-decoder models. In the following table, we provide related information on each model, including the paper link, release time, parameters, base model, pretraining dataset, and whether the model is open-source, along with the link to the open-source code for users to reference(Models are sorted alphabetically by their names).

Non-transformer-based models

Model	Time	Params	Base model	Pretraining Dataset	Code
CARP	2024.02	600K-640M	CNN	UniRef50	√
MIF-ST	2023.03	3.4M	GNN	CATH	√
ProSE	2021.06	-	LSTM	UniRef90, SCOP	√
ProtVec	2015.11	-	Skip-gram	UniProtKB/Swiss-Prot	×
ProtVecX	2019.03	-	ProtVec	UniRef50, UniProtKB/Swiss-Prot	×
SeqVec	2019.12	-	ELMo	UniRef50	√
Seq2vec	2020.09	-	CNN-LSTM	-	×
UDSMProt	2020.01	-	AWD-LSTM	UniProtKB/Swiss-Prot	×
UniRep	2019.10	-	mLSTM	UniRef50	√

Transformer-based models

Encoder-only models

Model	Time	Params	Pretraining Dataset	Code
AbLang	2022.06	-	OAS	√
AbLang2	2024.02	-	OAS	√
AMPLIFY	2024.09	120M/350M	UniRef50, UniRef100, OAS, SCOP	√
AminoBert	2022.10	-	UniRef90, PDB, MGnify	×
AntiBERTa	2022.07	86M	OAS	√
AntiBERTy	2021.12	26M	OAS	√
BALM	2024.05	-	OAS	√
DistilProtBert	2022.09	230M	UniRef50	√
ESM-1b	2020.02	650M	UniRef50	√
ESM-1v	2021.02	650M	UniRef90	√
ESM-2	2023.03	8M-15B	UniRef50	√
ESM-3	2024.07	98B	UniRef, MGnify, AlphaFoldDB, ESMAtlas	√
ESM All-Atom	2024.05	35M	AlphaFoldDB	√
ESM-C	2024.12	300M, 600M, 6B	UniRef, MGnify, JGI	×
ESM-GearNet	2023.10	-	AlphaFoldDB	√
ESM-MSA-1b	2021.02	100M	UniRef50	√
IgBert	2024.12	420M	OAS	×
LM-GVP	2022.04	-	-	√
OntoProtein	2022.06	-	ProteinKG25	√
PeTriBERT	2022.08	40M	AlphaFoldDB	×
PMLM	2021.10	87M-731M	UniRef50, Pfam	×
PRoBERTa	2020.09	44M	UniProtKB/Swiss-Prot	√
PromptProtein	2023.02	650M	UniRef50, PDB	√
ProteinBERT	2022.03	16M	UniRef90	√
ProteinLM	2021.12	200M/3B	Pfam	√
ProtFlash	2023.10	79M/174M	UniRef50	√
ProtTrans	2021.07	-	UniRef, BFD	√
SaProt	2023.10	650M	AlphaFoldDB, PDB	√
TCR-BERT	2021.11	100M	PIRD, VDJdb, TCRdb, murine LCMV GP33	√

Decoder-only models

Model	Time	Params	Pretraining Dataset	Code
DARK	2022.01	128M	-	×
IgLM	2022.12	13M	-	√
PoET	2023.11	57M-604M	-	×
ProGen	2020.03	1.2B	UniParc, UniProtKB/Swiss-Prot	√
ProGen2	2023.10	151M-6.4B	UniRef90, BFD30, PDB	√
ProLLaMA	2024.02	-	UniRef50	√
ProtGPT2	2021.01	738M	UniRef50	√
RITA	2022.05	1.2B	UniRef100	×
ZymCTRL	2022.01	738M	BRENDA	√

Encoder-decoder models

Model	Time	Params	Pretraining Dataset	Code
Ankh	2023.01	450M/1.15B	UniRef50	Link
IgT5	2024.12	3B	OAS	×
LM-Design	2023.02	664M	-	×
MSA-Augmenter	2023.06	260M	UniRef50	Link
ProSST	2024.05	110M	AlphaFoldDB, CATH	Link
ProstT5	2023.07	3B	AlphaFoldDB, PDB	Link
ProtT5	2022.06	3B/11B	UniRef50, BFD	Link
pAbT5	2023.10	-	OAS	×
Sapiens	2022.02	0.6M	OAS	Link
SS-pLM	2023.08	14.8M	UniRef50	×
xTrimoPGLM	2023.07	100B	UniRef90, ColdFoldDB	×

Datasets

Protein datasets can be classified into two categories depending on whether they include annotations: pre-training datasets and benchmarks. Pre-training datasets are typically used for self-supervised pre-training as they lack labels, whereas benchmarks, which contain labeled data, are used for supervised fine-tuning or model evaluation. We provide the relevant papers and links for the pre-training datasets and benchmarks of the present popular protein language models in the following table(Pre trained datasets and benchmarks are sorted alphabetically by their names).The pre training datasets are divided into sequence datasets and structural datasets, and the benchmarks are divided into structural benchmarks, functional benchmarks, and other benchmarks.

Pre-training datasets

Sequence datasets

Dataset	Time	Scale	Link
BFD[1,2,3]	2021.07	2.5B	√
BRENDA	2002.01	-	√
MGnify	2022.12	-	√
Pfam	2023.09	47M	√
UniClust30	2016.11	-	√
UniParc	2023.11	632M	√
UniProtKB/Swiss-Prot	2023.11	570K	√
UniProtKB/TrEMBL	2023.11	251M	√
UniRef50[1,2]	2023.11	53M	√
UniRef90[1,2]	2023.11	150M	√
UniRef100[1,2]	2023.11	314M	√

Structural datasets

Dataset	Time	Scale	Link
AlphafoldDB[1,2]	2021.11	200M	√
PDB	2023.12	214K	√

Benchmarks

Structural benchmarks

Dataset	Time	Scale	Link
CAMEO	-	-	√
CASP	-	-	√
CATH	2023.02	151M	√
SCOP	2023.01	914K	√

Functional benchmarks

Dataset	Time	Scale	Link
CAFA	-	-	√
EC	2023.11	2.6M	√
FLIP	2022.01	320K	√
GO	2023.11	1.5M	√

Other benchmarks

Dataset	Time	Scale	Link
PEER	2022.11	390K	√
ProteinGym	2022.12	300K	√
TAPE	2021.09	120K	√

Tools

We provide links to commonly used protein tools in the following table for readers to use(Tools are sorted alphabetically by their names).The tools are divided into sequence tools, structural tools, and other tools

Sequence tools

Tool	Link
BLAST	√
HHblits&HHfilter	√
MMseq2	√

Structural tools

Tool	Link
Foldseek	√
PyMOL	√
TM-align	√

Other tools

Tool	Link
t-SNE	√
Umap	√

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
figures		figures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Protein-Language-Models

News

Overview

Contents

Models

Non-transformer-based models

Transformer-based models

Encoder-only models

Decoder-only models

Encoder-decoder models

Datasets

Pre-training datasets

Sequence datasets

Structural datasets

Benchmarks

Structural benchmarks

Functional benchmarks

Other benchmarks

Tools

Sequence tools

Structural tools

Other tools

About

Uh oh!

Releases

Packages

LegallyOverworked/Protein-Language-Models

Folders and files

Latest commit

History

Repository files navigation

Protein-Language-Models

News

Overview

Contents

Models

Non-transformer-based models

Transformer-based models

Encoder-only models

Decoder-only models

Encoder-decoder models

Datasets

Pre-training datasets

Sequence datasets

Structural datasets

Benchmarks

Structural benchmarks

Functional benchmarks

Other benchmarks

Tools

Sequence tools

Structural tools

Other tools

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages