Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions workflows/variant-calling/rna-variant-discovery/.dockstore.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
version: 1.2
workflows:
- name: rna-variant-discovery
subclass: Galaxy
publish: true
primaryDescriptorPath: /RNAvar.ga
testParameterFiles:
- /RNAvar-tests.yml
authors:
- name: "Emre Gülben"
orcid: "https://orcid.org/0009-0009-9085-1055"
5 changes: 5 additions & 0 deletions workflows/variant-calling/rna-variant-discovery/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Changelog

## [0.1] - 2026-03-27

- First release.
26 changes: 26 additions & 0 deletions workflows/variant-calling/rna-variant-discovery/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# RNA-seq Variant Discovery (RNAvar)

This workflow is an implementation of a standard RNA-seq variant discovery pipeline. It handles the transition from raw reads to annotated variants using industry-standard tools.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the README should clearly state that the aim is to provide a faithful reimplementation of https://nf-co.re/rnavar and that currently there are still limitations.

## Workflow Logic
1. **Alignment:** **STAR** in 2-pass mode for high-accuracy splice-aware mapping.
2. **Preprocessing:** **MarkDuplicates** and **GATK4 SplitNCigarReads** to prepare the BAM for variant calling by splitting reads into exon segments.
3. **Recalibration:** **GATK4 BaseRecalibrator** (BQSR) using known polymorphic sites to adjust base quality scores.
4. **Variant Calling:** **GATK4 HaplotypeCaller** with parameters specifically tuned for RNA-seq (e.g., ignoring soft-clipped bases).
5. **Filtering:** **bcftools filter** applying hard filters (`FS > 30.0`, `QD < 2.0`) to reduce false positives.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These filters are user-configurable in the nf-core pipeline so should become workflow parameters, too.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 5146320

6. **Annotation:** Functional annotation of variants using **SnpEff** against the hg38 database.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes limited sense to hard-code the SnpEff database and, thereby, restrict analysis to just one organism and genome version. The database choice should instead be turned into a workflow parameter.

I would populate the user choices from the "Locally installed SnpEff databases" instead of letting the user download databases on demand.

Please let us know if you need help with this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here: 5146320


Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After presenting the current workflow logic, it wouldbe good to have a "Limitations" section, where you discuss which features of the nf-core ppeline are currently not yet implemented.

Things hat I can immediately think of here are

  • Sample Sheet support
  • support for SnpEff alternatives for variant annotation
  • optional UMI-Tools-based preprocessing
  • no parallelization of the GATK4 SplitNCigarReads step (is this something that should be implemented at the tool level similar to what's done in the freebayes wrapper?
  • base recalibration and variant filtering are currently non-optional when they are optional in th nf-core pipeline

Basically, compare https://nf-co.re/rnavar/1.2.3/docs/usage/ to what's implemented here.
Of course, if you still want to fix any of these limitations, you're welcome to do so.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the limitations section: 5146320

## Inputs
* **Forward Reads (R1)**: Fastq sequencing reads (forward). Supports gzipped (`.gz`) files.
* **Reverse Reads (R2)**: Fastq sequencing reads (reverse). Supports gzipped (`.gz`) files.
* **Reference Genome**: FASTA file of the reference genome (e.g., hg38).
* **Genome Annotation**: GFF3 file containing gene models for splice-aware alignment.
* **Known Variants (dbSNP)**: VCF file of known polymorphisms for base quality score recalibration.
* **Known Indels**: VCF file of known insertions and deletions for base quality score recalibration.

## Outputs
* **Final Annotated Variants**: VCF file containing discovered variants with functional annotations (impact, effect, gene names) from SnpEff.
* **MultiQC Quality Report**: An HTML report aggregating quality metrics from alignment and preprocessing steps.

## Testing and Parity
The workflow is validated against a targeted subset of human RNA-seq data on chromosome 22. It successfully identifies the 54 high-confidence variants required to match the output of the original discovery engine.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more precise here and state the origin of that test data.

43 changes: 43 additions & 0 deletions workflows/variant-calling/rna-variant-discovery/RNAvar-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
- doc: Test outline for RNAvar
job:
"Forward Reads (R1)":
class: File
path: test-data/test_rnaseq_1.fastq.gz
filetype: fastqsanger.gz
"Reverse Reads (R2)":
class: File
path: test-data/test_rnaseq_2.fastq.gz
filetype: fastqsanger.gz
"Reference Genome":
class: File
path: test-data/genome.fasta
filetype: fasta
"Genome Annotation":
class: File
path: test-data/genome.gff3
filetype: gff3
"Known Variants (dbSNP)":
class: File
path: test-data/dbsnp_146.hg38.vcf
filetype: vcf
"Known Indels":
class: File
path: test-data/mills_and_1000G.indels.vcf
filetype: vcf
outputs:
"Final Annotated Variants":
asserts:
# Expected variants: 54 | VCF Header lines: 36 | Total: 90.
- has_n_lines:
n: 90
# Verify the content is scientifically correct by checking for
# the presence of the target chromosome string.
- has_text:
text: "chr22"
# Ensure SnpEff successfully added functional annotations
- has_text:
text: "ANN="
"MultiQC Quality Report":
asserts:
- has_text:
text: "MultiQC"
Loading