Skip to content

HelloWorldLTY/UKBioLM

Repository files navigation

Codes for paper: Pre-training Genomic Language Model with Human Variants for Better Understanding Functional Genomics

Installation

We have different environments for running UKBioBERT and UKBioFormer to avoid package conflicts. Moreover, to pre-train or infer based on UKBioBERT, we also need different environments for devices with GPU as H100 or not.

For UKBioBERT, please install the environment based on (without H100):

conda env create -f ukbiobert_pretrain.yml

For UKBioBERT, please install the environment based on (with H100):

conda env create -f ukbiobert_pretrain_h100.yml

For UKBioFormer and fine-tuning, please install the environment based on:

conda env create -f ukbioformer.yml

If you meet errors in the installation process, please check the error information, and comment the packages causing the errors and use the update function of conda to have a try again.

Dataset Preparation

Our experiments need the access to different databases, including UKBioBank, GTEx, and ROSMAP. Since we are not allowed to share the data, please apply for the access.

To process the datasets used for pre-training, please refer the content in the folder pretraining. The idea is to map variants with reference sequences to generate pre-training samples.

To process the datasets used for infernece and fine-tuning, please refer the codes used in this repo, which is comprehensive and tested in our side.

Pre-training

Please refer the codes in the folder pretraining for details. We recommend at least having one H100 or A100 (80GB) for pre-training.

We cannot directly share the pre-trained models due to the limitations of UK BioBank, and the model weights are returned to UK BioBank for access.

Applications

Please refer the codes in the folder improve_cellline, ukbiobert_application, and ukbioformer_application for details. We recommend at least having one A40 (48GB) for fine-tuning model.

Since our used genetic data are protected, we can only provide a demo file in demo folder about how to run inference with dummy data. Sorry for the inconvenience.

Acknowledgement

We thank the great codes implemented by the teams including Huggingface, grelu, enformer_pytorch, borzoi_pytorch, and performer.

Contact

Please contact Tianyu Liu if you have any questions (email: tianyu.liu@yale.edu).

Citation

@article{liu2025pre,
  title={Pre-training Genomic Language Model with Variants for Better Modeling Functional Genomics},
  author={Liu, Tianyu and Zhang, Xiangyu and Lin, Jiecong and Pinello, Luca and Ying, Rex and Zhao, Hongyu},
  journal={NPJ Artificial Intelligence (in press)},
  pages={2025--02},
  year={2026},
  publisher={Springs Nature}
}

About

[NPJ AI] A foundation model for individual genome modelling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors