-
Notifications
You must be signed in to change notification settings - Fork 89
Add RNA-seq Variant Discovery (RNAvar) workflow #1188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| version: 1.2 | ||
| workflows: | ||
| - name: rna-variant-discovery | ||
| subclass: Galaxy | ||
| publish: true | ||
| primaryDescriptorPath: /RNAvar.ga | ||
| testParameterFiles: | ||
| - /RNAvar-tests.yml | ||
| authors: | ||
| - name: "Emre Gülben" | ||
| orcid: "https://orcid.org/0009-0009-9085-1055" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| # Changelog | ||
|
|
||
| ## [0.1] - 2026-03-27 | ||
|
|
||
| - First release. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| # RNA-seq Variant Discovery (RNAvar) | ||
|
|
||
| This workflow is an implementation of a standard RNA-seq variant discovery pipeline. It handles the transition from raw reads to annotated variants using industry-standard tools. | ||
|
|
||
| ## Workflow Logic | ||
| 1. **Alignment:** **STAR** in 2-pass mode for high-accuracy splice-aware mapping. | ||
| 2. **Preprocessing:** **MarkDuplicates** and **GATK4 SplitNCigarReads** to prepare the BAM for variant calling by splitting reads into exon segments. | ||
| 3. **Recalibration:** **GATK4 BaseRecalibrator** (BQSR) using known polymorphic sites to adjust base quality scores. | ||
| 4. **Variant Calling:** **GATK4 HaplotypeCaller** with parameters specifically tuned for RNA-seq (e.g., ignoring soft-clipped bases). | ||
| 5. **Filtering:** **bcftools filter** applying hard filters (`FS > 30.0`, `QD < 2.0`) to reduce false positives. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These filters are user-configurable in the nf-core pipeline so should become workflow parameters, too.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done: 5146320 ✅ |
||
| 6. **Annotation:** Functional annotation of variants using **SnpEff** against the hg38 database. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It makes limited sense to hard-code the SnpEff database and, thereby, restrict analysis to just one organism and genome version. The database choice should instead be turned into a workflow parameter. I would populate the user choices from the "Locally installed SnpEff databases" instead of letting the user download databases on demand. Please let us know if you need help with this.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done here: 5146320 |
||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After presenting the current workflow logic, it wouldbe good to have a "Limitations" section, where you discuss which features of the nf-core ppeline are currently not yet implemented. Things hat I can immediately think of here are
Basically, compare https://nf-co.re/rnavar/1.2.3/docs/usage/ to what's implemented here.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added the limitations section: 5146320 |
||
| ## Inputs | ||
| * **Forward Reads (R1)**: Fastq sequencing reads (forward). Supports gzipped (`.gz`) files. | ||
| * **Reverse Reads (R2)**: Fastq sequencing reads (reverse). Supports gzipped (`.gz`) files. | ||
| * **Reference Genome**: FASTA file of the reference genome (e.g., hg38). | ||
| * **Genome Annotation**: GFF3 file containing gene models for splice-aware alignment. | ||
| * **Known Variants (dbSNP)**: VCF file of known polymorphisms for base quality score recalibration. | ||
| * **Known Indels**: VCF file of known insertions and deletions for base quality score recalibration. | ||
|
|
||
| ## Outputs | ||
| * **Final Annotated Variants**: VCF file containing discovered variants with functional annotations (impact, effect, gene names) from SnpEff. | ||
| * **MultiQC Quality Report**: An HTML report aggregating quality metrics from alignment and preprocessing steps. | ||
|
|
||
| ## Testing and Parity | ||
| The workflow is validated against a targeted subset of human RNA-seq data on chromosome 22. It successfully identifies the 54 high-confidence variants required to match the output of the original discovery engine. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Be more precise here and state the origin of that test data. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| - doc: Test outline for RNAvar | ||
| job: | ||
| "Forward Reads (R1)": | ||
| class: File | ||
| path: test-data/test_rnaseq_1.fastq.gz | ||
| filetype: fastqsanger.gz | ||
| "Reverse Reads (R2)": | ||
| class: File | ||
| path: test-data/test_rnaseq_2.fastq.gz | ||
| filetype: fastqsanger.gz | ||
| "Reference Genome": | ||
| class: File | ||
| path: test-data/genome.fasta | ||
| filetype: fasta | ||
| "Genome Annotation": | ||
| class: File | ||
| path: test-data/genome.gff3 | ||
| filetype: gff3 | ||
| "Known Variants (dbSNP)": | ||
| class: File | ||
| path: test-data/dbsnp_146.hg38.vcf | ||
| filetype: vcf | ||
| "Known Indels": | ||
| class: File | ||
| path: test-data/mills_and_1000G.indels.vcf | ||
| filetype: vcf | ||
| outputs: | ||
| "Final Annotated Variants": | ||
| asserts: | ||
| # Expected variants: 54 | VCF Header lines: 36 | Total: 90. | ||
| - has_n_lines: | ||
| n: 90 | ||
| # Verify the content is scientifically correct by checking for | ||
| # the presence of the target chromosome string. | ||
| - has_text: | ||
| text: "chr22" | ||
| # Ensure SnpEff successfully added functional annotations | ||
| - has_text: | ||
| text: "ANN=" | ||
| "MultiQC Quality Report": | ||
| asserts: | ||
| - has_text: | ||
| text: "MultiQC" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the README should clearly state that the aim is to provide a faithful reimplementation of https://nf-co.re/rnavar and that currently there are still limitations.