Codes for paper: Pre-training Genomic Language Model with Human Variants for Better Understanding Functional Genomics

Installation

We have different environments for running UKBioBERT and UKBioFormer to avoid package conflicts. Moreover, to pre-train or infer based on UKBioBERT, we also need different environments for devices with GPU as H100 or not.

For UKBioBERT, please install the environment based on (without H100):

conda env create -f ukbiobert_pretrain.yml

For UKBioBERT, please install the environment based on (with H100):

conda env create -f ukbiobert_pretrain_h100.yml

For UKBioFormer and fine-tuning, please install the environment based on:

conda env create -f ukbioformer.yml

If you meet errors in the installation process, please check the error information, and comment the packages causing the errors and use the update function of conda to have a try again.

Dataset Preparation

Our experiments need the access to different databases, including UKBioBank, GTEx, and ROSMAP. Since we are not allowed to share the data, please apply for the access.

To process the datasets used for pre-training, please refer the content in the folder pretraining. The idea is to map variants with reference sequences to generate pre-training samples.

To process the datasets used for infernece and fine-tuning, please refer the codes used in this repo, which is comprehensive and tested in our side.

Pre-training

Please refer the codes in the folder pretraining for details. We recommend at least having one H100 or A100 (80GB) for pre-training.

We cannot directly share the pre-trained models due to the limitations of UK BioBank, and the model weights are returned to UK BioBank for access.

Applications

Please refer the codes in the folder improve_cellline, ukbiobert_application, and ukbioformer_application for details. We recommend at least having one A40 (48GB) for fine-tuning model.

Since our used genetic data are protected, we can only provide a demo file in demo folder about how to run inference with dummy data. Sorry for the inconvenience.

Acknowledgement

We thank the great codes implemented by the teams including Huggingface, grelu, enformer_pytorch, borzoi_pytorch, and performer.

Contact

Please contact Tianyu Liu if you have any questions (email: tianyu.liu@yale.edu).

Citation

@article{liu2025pre,
  title={Pre-training Genomic Language Model with Variants for Better Modeling Functional Genomics},
  author={Liu, Tianyu and Zhang, Xiangyu and Lin, Jiecong and Pinello, Luca and Ying, Rex and Zhao, Hongyu},
  journal={NPJ Artificial Intelligence (in press)},
  pages={2025--02},
  year={2026},
  publisher={Springs Nature}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Codes for paper: Pre-training Genomic Language Model with Human Variants for Better Understanding Functional Genomics

Installation

Dataset Preparation

Pre-training

Applications

Acknowledgement

Contact

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.history		.history
demo		demo
grelu		grelu
improve_cellline		improve_cellline
pretraining		pretraining
ukbiobert_application		ukbiobert_application
ukbioformer_application		ukbioformer_application
README.md		README.md
ukbiobert_pretrain.yml		ukbiobert_pretrain.yml
ukbiobert_pretrain_h100.yml		ukbiobert_pretrain_h100.yml
ukbioformer.yml		ukbioformer.yml

Folders and files

Latest commit

History

Repository files navigation

Codes for paper: Pre-training Genomic Language Model with Human Variants for Better Understanding Functional Genomics

Installation

Dataset Preparation

Pre-training

Applications

Acknowledgement

Contact

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages