Add RNA-seq Variant Discovery (RNAvar) workflow by emregulben · Pull Request #1188 · galaxyproject/iwc

emregulben · 2026-03-27T15:41:21Z

Summary

Adds the RNA-seq Variant Discovery (RNAvar) workflow.

Technical Details

Implementation based on nf-core/rnavar logic:

Alignment: STAR (2-pass mode).
Preprocessing: GATK4 MarkDuplicates, SplitNCigarReads, and BaseRecalibrator.
Variant Calling: GATK4 HaplotypeCaller using RNA-seq parameters.
Annotation: SnpEff (hg38).

Validation

Tests: Planemo-validated suite verifying workflow outputs and technical parity with reference results.

FOR CONTRIBUTOR:

I have read the Adding workflows guidelines
License permits unrestricted use (educational + commercial)
Please also take note of the reviewer guidelines below to facilitate a smooth review process.

FOR REVIEWERS:

.dockstore.yml: file is present and aligned with creator metadata in workflow. ORCID identifiers are strongly encouraged in creator metadata. The .dockstore.yml file is required to run tests
Workflow is sufficiently generic to be used with lab data and does not hardcode sample names, reference data and can be run without reading an accompanying tutorial.
In workflow: annotation field contains short description of what the workflow does. Should start with This workflow does/runs/performs … xyz … to generate/analyze/etc …
In workflow: workflow inputs and outputs have human readable names (spaces are fine, no underscore, dash only where spelling dictates it), no abbreviation unless it is generally understood. Altering input or output labels requires adjusting these labels in the the workflow-tests.yml file as well
In workflow: name field should be human readable (spaces are fine, no underscore, dash only where spelling dictates it), no abbreviation unless generally understood
Workflow folder: prefer dash (-) over underscore (_), prefer all lowercase. Folder becomes repository in iwc-workflows organization and is included in TRS id
Readme explains what workflow does, what are valid inputs and what outputs users can expect. If a tutorial or other resources exist they can be linked. If a similar workflow exists in IWC readme should explain differences with existing workflow and when one might prefer one workflow over another
Changelog contains appropriate entries
Large files (> 100 KB) are uploaded to zenodo and location urls are used in test file

Copilot

Pull request overview

This PR adds a new Galaxy workflow under workflows/variant-calling/rna-variant-discovery/ implementing an RNA-seq variant discovery pipeline (STAR → GATK preprocessing/calling → filtering → SnpEff annotation) along with Dockstore metadata, documentation, and a Planemo test suite with bundled test data.

Changes:

Added the RNAvar.ga Galaxy workflow for RNA-seq variant calling and annotation, with MultiQC reporting.
Added a Planemo test definition (RNAvar-tests.yml) and corresponding small test datasets.
Added workflow packaging/documentation files (.dockstore.yml, README.md, CHANGELOG.md).

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
workflows/variant-calling/rna-variant-discovery/RNAvar.ga	New Galaxy workflow definition for RNA-seq variant discovery and reporting
workflows/variant-calling/rna-variant-discovery/RNAvar-tests.yml	Planemo test job + output assertions for key workflow outputs
workflows/variant-calling/rna-variant-discovery/.dockstore.yml	Dockstore registration metadata pointing to the workflow and tests
workflows/variant-calling/rna-variant-discovery/README.md	High-level explanation of workflow steps and expected behavior
workflows/variant-calling/rna-variant-discovery/CHANGELOG.md	Initial release entry
workflows/variant-calling/rna-variant-discovery/test-data/test_rnaseq_1.fastq.gz	Test input R1 reads for Planemo validation
workflows/variant-calling/rna-variant-discovery/test-data/test_rnaseq_2.fastq.gz	Test input R2 reads for Planemo validation
workflows/variant-calling/rna-variant-discovery/test-data/genome.fasta	Small reference FASTA used by tests
workflows/variant-calling/rna-variant-discovery/test-data/genome.gff3	Small annotation file used by tests
workflows/variant-calling/rna-variant-discovery/test-data/dbsnp_146.hg38.vcf	Known variants VCF used for BQSR / known-sites
workflows/variant-calling/rna-variant-discovery/test-data/mills_and_1000G.indels.vcf	Known indels VCF used for BQSR / known-sites

Copilot · 2026-04-08T12:10:29Z

+                "owner": "iuc",
+                "tool_shed": "toolshed.g2.bx.psu.edu"
+            },
+            "tool_state": "{\"__input_ext\": \"input\", \"annotations\": null, \"chr\": null, \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"csvStats\": true, \"filter\": {\"specificEffects\": \"no\", \"__current_case__\": 0}, \"filterOut\": null, \"generate_gene_stats\": false, \"generate_stats\": true, \"input\": {\"__class__\": \"ConnectedValue\"}, \"inputFormat\": \"vcf\", \"intervals\": null, \"noLog\": true, \"outputFormat\": \"vcf\", \"snpDb\": {\"genomeSrc\": \"named\", \"__current_case__\": 2, \"genome_version\": \"hg38\"}, \"spliceRegion\": {\"setSpliceRegions\": \"no\", \"__current_case__\": 0}, \"spliceSiteSize\": \"2\", \"transcripts\": null, \"udLength\": \"0\", \"__page__\": 0, \"__rerun_remap_job_id__\": null}",


SnpEff is configured with a hardcoded named genome database (hg38). Since the workflow also accepts a user-provided reference FASTA and annotation, this can silently produce incorrect annotations (or fail) when the inputs are not hg38. Consider either (a) building a SnpEff database from the provided reference+annotation (as other variant-calling workflows here do) or (b) exposing the SnpEff genome version as an explicit workflow parameter and documenting that the workflow is hg38-specific.

lldelisle · 2026-04-08T12:15:34Z

If I may add to copilot review, I think would be good to describe in the README each of the expected inputs and the expected outputs.
Also add the README in the workflow so users can display it as 'help for this workflow'.
I am not familiar with the field, do you usually provide a fasta or do you use built-in genome indexes like hg38 etc..?

wm75 · 2026-04-09T15:05:47Z

+3. **Recalibration:** **GATK4 BaseRecalibrator** (BQSR) using known polymorphic sites to adjust base quality scores.
+4. **Variant Calling:** **GATK4 HaplotypeCaller** with parameters specifically tuned for RNA-seq (e.g., ignoring soft-clipped bases).
+5. **Filtering:** **bcftools filter** applying hard filters ($FS > 30.0$, $QD < 2.0$) to reduce false positives.
+6. **Annotation:** Functional annotation of variants using **SnpEff** against the hg38 database.


It makes limited sense to hard-code the SnpEff database and, thereby, restrict analysis to just one organism and genome version. The database choice should instead be turned into a workflow parameter.

I would populate the user choices from the "Locally installed SnpEff databases" instead of letting the user download databases on demand.

Please let us know if you need help with this.

Done here: 5146320

wm75 · 2026-04-09T15:11:47Z

+# RNA-seq Variant Discovery (RNAvar)
+
+This workflow is an implementation of a standard RNA-seq variant discovery pipeline. It handles the transition from raw reads to annotated variants using industry-standard tools.
+


Here, the README should clearly state that the aim is to provide a faithful reimplementation of https://nf-co.re/rnavar and that currently there are still limitations.

wm75 · 2026-04-09T15:21:16Z

+4. **Variant Calling:** **GATK4 HaplotypeCaller** with parameters specifically tuned for RNA-seq (e.g., ignoring soft-clipped bases).
+5. **Filtering:** **bcftools filter** applying hard filters ($FS > 30.0$, $QD < 2.0$) to reduce false positives.
+6. **Annotation:** Functional annotation of variants using **SnpEff** against the hg38 database.
+


After presenting the current workflow logic, it wouldbe good to have a "Limitations" section, where you discuss which features of the nf-core ppeline are currently not yet implemented.

Things hat I can immediately think of here are

Sample Sheet support

support for SnpEff alternatives for variant annotation

optional UMI-Tools-based preprocessing

no parallelization of the GATK4 SplitNCigarReads step (is this something that should be implemented at the tool level similar to what's done in the freebayes wrapper?

base recalibration and variant filtering are currently non-optional when they are optional in th nf-core pipeline

Basically, compare https://nf-co.re/rnavar/1.2.3/docs/usage/ to what's implemented here.
Of course, if you still want to fix any of these limitations, you're welcome to do so.

Added the limitations section: 5146320

wm75 · 2026-04-09T15:22:12Z

+6. **Annotation:** Functional annotation of variants using **SnpEff** against the hg38 database.
+
+## Testing and Parity
+The workflow is validated against a targeted subset of human RNA-seq data on chromosome 22. It successfully identifies the 54 high-confidence variants required to match the output of the original discovery engine.


Be more precise here and state the origin of that test data.

wm75 · 2026-04-09T15:26:17Z

+        "9": {
+            "annotation": "",
+            "content_id": "toolshed.g2.bx.psu.edu/repos/devteam/samtools_sort/samtools_sort/2.0.8",
+            "errors": null,
+            "id": 9,
+            "input_connections": {
+                "input1": {
+                    "id": 7,
+                    "output_name": "mapped_reads"
+                }
+            },
+            "inputs": [],
+            "label": null,
+            "name": "Samtools sort",
+            "outputs": [
+                {
+                    "name": "output1",
+                    "type": "bam"
+                }
+            ],
+            "position": {
+                "left": 1275.194213718916,
+                "top": 1154.8021613220824
+            },
+            "post_job_actions": {},
+            "tool_id": "toolshed.g2.bx.psu.edu/repos/devteam/samtools_sort/samtools_sort/2.0.8",
+            "tool_shed_repository": {
+                "changeset_revision": "69dca890852c",
+                "name": "samtools_sort",
+                "owner": "devteam",
+                "tool_shed": "toolshed.g2.bx.psu.edu"
+            },
+            "tool_state": "{\"__input_ext\": \"input\", \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"input1\": {\"__class__\": \"ConnectedValue\"}, \"minhash\": false, \"prim_key_cond\": {\"prim_key_select\": \"\", \"__current_case__\": 0}, \"__page__\": 0, \"__rerun_remap_job_id__\": null}",
+            "tool_uuid": null,
+            "tool_version": "2.0.8",
+            "type": "tool",
+            "uuid": "27f1d0d4-b06c-4e39-a76b-a705964f4a90",
+            "when": null,
+            "workflow_outputs": []
+        },


The RNAStar Galaxy wrapper should already generate coordinate-sorted BAM output, so is this step necessary?

Please check and/or clarify.

I verified STAR already handles the sorting. Removed the step: 5146320

wm75 · 2026-04-09T15:28:20Z

+        "10": {
+            "annotation": "",
+            "content_id": "toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_AddOrReplaceReadGroups/3.1.1.0",
+            "errors": null,
+            "id": 10,
+            "input_connections": {
+                "inputFile": {
+                    "id": 9,
+                    "output_name": "output1"
+                }
+            },
+            "inputs": [],
+            "label": null,
+            "name": "AddOrReplaceReadGroups",
+            "outputs": [
+                {
+                    "name": "outFile",
+                    "type": "bam"
+                }
+            ],
+            "position": {
+                "left": 1550.8885393603414,
+                "top": 1254.1892548481562
+            },
+            "post_job_actions": {},
+            "tool_id": "toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_AddOrReplaceReadGroups/3.1.1.0",
+            "tool_shed_repository": {
+                "changeset_revision": "3f254c5ced1d",
+                "name": "picard",
+                "owner": "devteam",
+                "tool_shed": "toolshed.g2.bx.psu.edu"
+            },
+            "tool_state": "{\"CN\": null, \"DS\": null, \"DT\": null, \"PI\": null, \"PL\": \"ILLUMINA\", \"PU\": \"run\", \"__input_ext\": \"input\", \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"inputFile\": {\"__class__\": \"ConnectedValue\"}, \"read_group_id_conditional\": {\"do_auto_name\": true, \"__current_case__\": 0}, \"read_group_lb_conditional\": {\"do_auto_name\": true, \"__current_case__\": 0}, \"read_group_sm_conditional\": {\"do_auto_name\": true, \"__current_case__\": 0}, \"validation_stringency\": \"LENIENT\", \"__page__\": 0, \"__rerun_remap_job_id__\": null}",
+            "tool_uuid": null,
+            "tool_version": "3.1.1.0",
+            "type": "tool",
+            "uuid": "cf07f182-d8ab-4c43-8df3-9c263101cbe3",
+            "when": null,
+            "workflow_outputs": []
+        },


Try to understand if and how this step changes the data. Is there a way to use it in a meaningful way either now without having implemented sample sheet funtionality or maybe in the future?

This step is required for the GATK4 tools used later in the workflow. Tools like HaplotypeCaller and BaseRecalibrator are designed to look for 'Read Group' metadata in the BAM header. Without these tags, the tools throw an error and won't process the file (as noted in the GATK documentation on Read Groups).

By including this step now with default values, we ensure the workflow is robust and GATK-compliant. It also sets us up nicely for the future. If we implement sample sheet support later, this tool is exactly where that metadata would be injected into the pipeline.

wm75 · 2026-04-09T15:46:49Z

+                "owner": "iuc",
+                "tool_shed": "toolshed.g2.bx.psu.edu"
+            },
+            "tool_state": "{\"__input_ext\": \"input\", \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"comment\": \"\", \"export\": false, \"flat\": false, \"image_content_input\": null, \"png_plots\": false, \"results\": [{\"__index__\": 0, \"software_cond\": {\"software\": \"star\", \"__current_case__\": 28, \"output\": [{\"__index__\": 0, \"type\": {\"type\": \"log\", \"__current_case__\": 0, \"input\": {\"__class__\": \"ConnectedValue\"}}}]}}, {\"__index__\": 1, \"software_cond\": {\"software\": \"fastqc\", \"__current_case__\": 8, \"output\": [{\"__index__\": 0, \"type\": \"data\", \"input\": {\"__class__\": \"ConnectedValue\"}}]}}, {\"__index__\": 2, \"software_cond\": {\"software\": \"picard\", \"__current_case__\": 17, \"output\": [{\"__index__\": 0, \"type\": \"markdups\", \"input\": {\"__class__\": \"ConnectedValue\"}}]}}, {\"__index__\": 3, \"software_cond\": {\"software\": \"snpeff\", \"__current_case__\": 26, \"input\": {\"__class__\": \"ConnectedValue\"}}}, {\"__index__\": 4, \"software_cond\": {\"software\": \"samtools\", \"__current_case__\": 24, \"output\": [{\"__index__\": 0, \"type\": {\"type\": \"flagstat\", \"__current_case__\": 1, \"input\": {\"__class__\": \"ConnectedValue\"}}}]}}, {\"__index__\": 5, \"software_cond\": {\"software\": \"gatk\", \"__current_case__\": 11, \"output\": [{\"__index__\": 0, \"type\": \"base_recalibrator\", \"input\": {\"__class__\": \"ConnectedValue\"}}]}}], \"title\": \"\", \"__page__\": 0, \"__rerun_remap_job_id__\": null}",


Please order the different MultiQC input reports in a logical way proceeding from raw read metrics through alignment, variant calling and annotation.

5146320 ✔️

wm75 · 2026-04-09T15:53:12Z

+                "owner": "iuc",
+                "tool_shed": "toolshed.g2.bx.psu.edu"
+            },
+            "tool_state": "{\"__input_ext\": \"input\", \"algo\": {\"params\": {\"settingsType\": \"default\", \"__current_case__\": 0}}, \"chimOutType\": \"\", \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"filter\": {\"basic_filters\": null, \"output_params2\": {\"output_select2\": \"no\", \"__current_case__\": 1}}, \"oformat\": {\"outSAMattributes\": [\"NH\", \"HI\", \"AS\", \"nM\", \"ch\"], \"HI_offset\": \"1\", \"outSAMprimaryFlag\": \"OneBestScore\", \"outSAMmapqUnique\": \"60\", \"wasp_conditional\": {\"waspOutputMode\": \"\", \"__current_case__\": 1}}, \"outWig\": {\"outWigType\": \"None\", \"__current_case__\": 0, \"outWigStrand\": \"false\"}, \"perf\": {\"outBAMsortingBinsN\": \"50\", \"winAnchorMultimapNmax\": \"50\"}, \"refGenomeSource\": {\"geneSource\": \"history\", \"__current_case__\": 1, \"genomeFastaFiles\": {\"__class__\": \"ConnectedValue\"}, \"genomeSAindexNbases\": \"14\", \"GTFconditional\": {\"GTFselect\": \"with-gtf\", \"__current_case__\": 0, \"sjdbGTFfile\": {\"__class__\": \"ConnectedValue\"}, \"sjdbGTFfeatureExon\": \"exon\", \"sjdbOverhang\": \"14\", \"quantmode_output\": {\"quantMode\": \"-\", \"__current_case__\": 0}}, \"diploidconditional\": {\"diploid\": \"No\", \"__current_case__\": 1}}, \"singlePaired\": {\"sPaired\": \"paired\", \"__current_case__\": 1, \"input1\": {\"__class__\": \"ConnectedValue\"}, \"input2\": {\"__class__\": \"ConnectedValue\"}}, \"twopass\": {\"twopassMode\": \"Basic\", \"__current_case__\": 1, \"twopass_read_subset\": \"\", \"sj_precalculated\": \"\"}, \"__page__\": 0, \"__rerun_remap_job_id__\": null}",


The primary advantage of building the STAR index on te fly (which otherwise is just very wasteful) is that you can control the --sjdbOverhang parameter (see https://nf-co.re/rnavar/1.2.3/docs/usage/#alignment-options).
This means that you need to expose the sjdbOverhang wrapper parameter as a workflow parameter with a helpful annotation.

For me sjdbOverhang is always exposed if you use a prebuilt genome reference without a prebuit gene model.

That's also a good reason, yes.
Do you think we should go this way @lldelisle ? Let the user choose from cached, prebuilt references, but bring their own gene model? Could be a very good compromise.

This is always the way I do, this is also the way it is done in the RNA-seq IWC workflow.
I think it should be cached prebuilt reference + own gtf.
Therefore I would add a step toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff_build_gb/5.4+galaxy0 to build the snpEff database matching the same reference fasta and the same gtf.
Using the cache option (reuse jobs with identical parameters) will allow to not really rebuild the database if it was already built with the same gtf.

This is always the way I do, this is also the way it is done in the RNA-seq IWC workflow.
I think it should be cached prebuilt reference + own gtf.

Cool, sounds excellent to me, thanks!

In that case, however, how can we make sure that the reference fasta matches the one used for building the STAR index?

Also,

Therefore I would add a step toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff_build_gb/5.4+galaxy0 to build the snpEff database matching the same reference fasta and the same gtf.

I'm not so convinced this is a good idea, building SnpEff annotation files from gff is a fragile art for eukaryotic genomes and the pre-built cached ones might contain fewer errors.

So, essentially, the question to answer is: what's the least error-prone way to have matching genomes at the STAR, GATK and SnpEff steps.

Potential differences between SnpEff and custom gff annotations are another thing, but wouldn't be catastrophic for the workflow. Users who have a carefully crafted SnpEff db based specifically for their gff could also ask for server-side installation?
For exotic cases where there's no reasonably close existing SnpEff db, we could offer an optional branch of the WF to build the db from the reference and the gff, I just wouldn't make this the default.

Of course, I'm only making all of this up while typing so feel free to disagree with me :-)

I’ve exposed sjdbOverhang as a mandatory workflow parameter. Regarding the SnpEff database discussion, I’ve kept the database selection as a string input to use pre-built/cached local versions, avoiding the potential fragility of building custom databases from GFFs during the run. Let me know if the current version is fine. 5146320

wm75 · 2026-04-09T16:29:51Z

+2. **Preprocessing:** **MarkDuplicates** and **GATK4 SplitNCigarReads** to prepare the BAM for variant calling by splitting reads into exon segments.
+3. **Recalibration:** **GATK4 BaseRecalibrator** (BQSR) using known polymorphic sites to adjust base quality scores.
+4. **Variant Calling:** **GATK4 HaplotypeCaller** with parameters specifically tuned for RNA-seq (e.g., ignoring soft-clipped bases).
+5. **Filtering:** **bcftools filter** applying hard filters (`FS > 30.0`, `QD < 2.0`) to reduce false positives.


These filters are user-configurable in the nf-core pipeline so should become workflow parameters, too.

Done: 5146320 ✅

lldelisle · 2026-04-10T07:53:18Z

+                "top": 209.96850976711244
+            },
+            "tool_id": null,
+            "tool_state": "{\"multiple\": false, \"validators\": [], \"parameter_type\": \"text\", \"optional\": false}",


You need to force to choose among the prebuit SnpEff databases:

With attempt restriction based on connections

lldelisle · 2026-04-10T08:53:24Z

Here is my proposed variation: https://usegalaxy.eu/u/delislel/w/rna-seq-variant-discovery-ld-version
What I changed:

use as input fastqs: PE collection, I think it is easier if you want to process multiple samples at the time and even if you don't this ensure that all your output files will have the name of your sample.
use built-in genome instead of fasta
add an output to STAR (genecounts) for me this is very useful to see this statistics in the MultiQC to check you put the right GTF (you should have a high proportion of reads mapping to genes if the GTF is correct).
I also rearranged the inputs to they are in the order they are used in the pipeline
Finally, I put 'hidden' all intermediate steps

Tell me what you think and feel free to change it.

Add rna-variant-discovery workflow

56c226d

mvdbeek requested a review from Copilot April 8, 2026 12:06

Copilot started reviewing on behalf of mvdbeek April 8, 2026 12:06 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Update README

c74074c

wm75 reviewed Apr 9, 2026

View reviewed changes

Address PR comments

5146320

emregulben requested review from lldelisle and wm75 April 9, 2026 22:28

lldelisle reviewed Apr 10, 2026

View reviewed changes

		# RNA-seq Variant Discovery (RNAvar)

		This workflow is an implementation of a standard RNA-seq variant discovery pipeline. It handles the transition from raw reads to annotated variants using industry-standard tools.

Conversation

emregulben commented Mar 27, 2026

Summary

Technical Details

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lldelisle commented Apr 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lldelisle commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants