Warning
This repository has been archived and is no longer maintained.
The code is provided for historical reference and may contain unpatched or unknown vulnerabilities.
It should not be used in production systems.
Pipeliner Index Maker
Given
- a reference fasta file (ref.fa) and
- a GTF style annotation file (genes.gtf),
these set of scripts orchestrate the creation of required files to run RNASeq CCBR pipeliner on Biowulf.
Disclaimer
If you have 2 GTFs, eg. viral + host hybrid genomes, then you need to create one FASTA and one GTF file for the hybrid genome prior to running PIM.
Creating Hyrbid Fasta File
Once you have two fasta files (eg. viral fasta and host fasta), then we can use fasta_formatter like this on biowulf:
% cat host.fa virus.fa | fasta_formatter -w80 -o host_virus.faCreating GTF File
Most times you do have a well-formated GTF file for the host from Ensembl or Gencode, but that is not the case for the virus. The thing to ensure is that your GTF (even if it is hand curated) includes:
- a
genefeature - each
genefeature has atleast onetranscriptfeature - and each
transcriptfeature has atleast oneexonfeature. If not then the GTF file needs to be manipulated until these conditions are satisfied.
Here is an example feature from a hand-curated Biotyn_probe GTF file:
Biot1 BiotynProbe gene 1 21 0.000000 + . gene_id "Biot1"; gene_name "Biot1"; gene_biotype "biotynlated_probe_control";
Biot1 BiotynProbe transcript 1 21 0.000000 + . gene_id "Biot1"; gene_name "Biot1"; gene_biotype "biotynlated_probe_control"; transcript_id "Biot1"; transcript_name "Biot1"; transcript_type "biotynlated_probe_control";
Biot1 BiotynProbe exon 1 21 0.000000 + . gene_id "Biot1"; transcript_id "Biot1"; transcript_type "biotynlated_probe_control";In this tab-delimited example,
- first line:
genefeature with 2 required attributes in column9, namely,gene_id,gene_nameand an optional attributegene_biotype - second line:
transcriptfor the abovegenerepeating the attributes with few more attributes, namely,transcript_id(required),transcript_name(required) andtranscript_type(optional) - third line:
exonfor the above transcript withgene_idandtranscript_idattributes required.
Once you have the host and virus GTFs in acceptable formats, you can simply concatenate them to use as input for PIM, like so:
% cat host.gtf virus.gtf > host_virus.gtfNOTE
It is important to ensure that the sequence ids in the fasta match with the sequence ids in the gtf, else PIM will produce unexpected results.
% grep "^>" host_virus.fa | awk '{print $1}' | sed "s/>//g" | sort > fasta_sequence_ids.txt % cut -f1 host_virus.gtf | sort | uniq > gtf_sequence_ids.txt % diff fasta_sequence_ids.txt gtf_sequence_ids.txtIdeally, the above
diffcommand should produce no result, indicating that the sequence_ids are identical.
Once you have a fasta and a GTF file, here are the steps to create an index folder:
- check out PIM
% git clone https://github.com/CCBR/PIM.gitThese files are already checked out at /data/CCBR_Pipeliner/db/PipeDB/PIM on Biowulf
-
appropriately edit the
config.yaml. See details of YAML file below. -
submit jobs to slurm using
IndexMaker. See details ofIndexMakerbelow. -
create a new genome specific JSON file to be added to the CCBR Pipliner
This file has all the required inputs to run the PIM
| Variable | Comment |
|---|---|
| GENOME | Name of the genome, eg. "mm10" |
| REFFA | Absolute path to reference fasta file |
| GTFFILE | Absolute path to annotation GTF file |
| OUTDIR | Absoulte path to where the results (indices) will be saved |
| SCRIPTSDIR | Absolute path to where the scripts in this repo have been checked out |
| READLENGTHS | List of STAR readlengths to generate indices for |
Here is an example config.yaml file:
GENOME: "mm10"
REFFA: "/data/CCBR_Pipeliner/db/PipeDB/Indices/mm10_basic/indexes/mm10.fa"
GTFFILE: "/data/CCBR_Pipeliner/db/PipeDB/Indices/mm10_basic/genes.gtf"
OUTDIR: "/scratch/indexmaker/test"
SCRIPTSDIR: "/scratch/indexmaker/Scripts"
READLENGTHS:
- 50
- 100
- Please also see an example configuration file in 'examples/' directory.
After editing YAML file, you can dryrun using
% bash IndexMaker --use-config -n- Note: This assumes you have a
config.yamlfile in the root of PIM's directory:/path/to/repo/of/PIM/config.yaml.
If everything checks out then run:
% bash IndexMaker --use-configThis will submit jobs to SLURM job scheduler to create the reference files. To view cluster resources requested, please read cluster.json.