Codes for paper: Pre-training Genomic Language Model with Human Variants for Better Understanding Functional Genomics
We have different environments for running UKBioBERT and UKBioFormer to avoid package conflicts. Moreover, to pre-train or infer based on UKBioBERT, we also need different environments for devices with GPU as H100 or not.
For UKBioBERT, please install the environment based on (without H100):
conda env create -f ukbiobert_pretrain.yml
For UKBioBERT, please install the environment based on (with H100):
conda env create -f ukbiobert_pretrain_h100.yml
For UKBioFormer and fine-tuning, please install the environment based on:
conda env create -f ukbioformer.yml
If you meet errors in the installation process, please check the error information, and comment the packages causing the errors and use the update function of conda to have a try again.
Our experiments need the access to different databases, including UKBioBank, GTEx, and ROSMAP. Since we are not allowed to share the data, please apply for the access.
To process the datasets used for pre-training, please refer the content in the folder pretraining. The idea is to map variants with reference sequences to generate pre-training samples.
To process the datasets used for infernece and fine-tuning, please refer the codes used in this repo, which is comprehensive and tested in our side.
Please refer the codes in the folder pretraining for details. We recommend at least having one H100 or A100 (80GB) for pre-training.
We cannot directly share the pre-trained models due to the limitations of UK BioBank, and the model weights are returned to UK BioBank for access.
Please refer the codes in the folder improve_cellline, ukbiobert_application, and ukbioformer_application for details. We recommend at least having one A40 (48GB) for fine-tuning model.
Since our used genetic data are protected, we can only provide a demo file in demo folder about how to run inference with dummy data. Sorry for the inconvenience.
We thank the great codes implemented by the teams including Huggingface, grelu, enformer_pytorch, borzoi_pytorch, and performer.
Please contact Tianyu Liu if you have any questions (email: tianyu.liu@yale.edu).
@article{liu2025pre,
title={Pre-training Genomic Language Model with Variants for Better Modeling Functional Genomics},
author={Liu, Tianyu and Zhang, Xiangyu and Lin, Jiecong and Pinello, Luca and Ying, Rex and Zhao, Hongyu},
journal={NPJ Artificial Intelligence (in press)},
pages={2025--02},
year={2026},
publisher={Springs Nature}
}