1 update small variant vcf list by marrip · Pull Request #2 · InPreD/tsoppy

marrip · 2026-03-30T09:47:37Z

Hey guys!

This is my attempt at implementing the updating of the small variant vcf list. I have used the code we run at HUS, which is reflected in this nextflow process, as a base.

I have defined classes for the vcf list and the different vcf files to be able to write methods and gather all input values in one place. Also I wanted to create a clean interface towards the cli. Logging is handled by the logger defined in the cli module. Testing is done with table-driven tests, meaning we have different test cases for each function and collect them in a list of objects which we loop over. This makes things DRYer and provides a better overview of what is tested.

Please ask and comment so we can figure out if this is a good way to handle things ☺️

…ckage

tinavisnovska · 2026-04-08T13:48:27Z

These are small variant files produced with LocalApp code, right? Do we have an agreement on what small variant files we want to use when Dragen processes the data? Do you want to have separate functionality for handling those files or you want this code to be expanded so that it works for the Dragen data too?

When I discussed with @danielvo about Dragen files containing small variants, he pointed to Results/**/*TMB_trace.tsv, which looks very useful but is not in vcf format.

tinavisnovska · 2026-04-08T18:04:57Z

@@ -0,0 +1,179 @@
+"""
+This module contains the code for the `update_small_variant_vcf_list` command.
+The command takes two arguments, `results_dir`, which is a string that specifies the directory where the results of the latest TSO500 run are stored.


What happens if there is a sample associated with a patient from the latest TSO500 run sequenced in one of the older runs (for example tumor DNA sample of patient A is sequenced in the latest run but normal DNA sample in one of the older runs)? Would it be useful for the older vcf file be added in the vcf list or not really?

As far as I understood what this function does in TSOPPI, it updates the list according to the latest run so earlier run vcfs should be already be included.

We have now settled some of the input discussions, so let me return to this pull request. :-D

The VCF list is used for expanding the variant recurrence table (VRT). While the VRT is intended to track all possible variant calls (including artifacts/noise), the TMB trace file is meant to only hold information about confident variant calls. The gVCF files provide the largest variant call set available in the Illumina pipeline output, even though it looks like the Dragen's "hard-filtered" gVCFs contain only a tiny fraction of what the Local App's gVCFs do (in case of the first DNA test sample, the Dragen output contained literally only 1 % of the Local App output, 484 405 vs. 5 640 variants, but that is still more than what the TMB trace files contain).

tinavisnovska · 2026-04-08T18:59:19Z

Overall, it looks great and reads well! Surely will be useful part of the code.

marrip · 2026-04-09T12:36:03Z

These are small variant files produced with LocalApp code, right? Do we have an agreement on what small variant files we want to use when Dragen processes the data? Do you want to have separate functionality for handling those files or you want this code to be expanded so that it works for the Dragen data too?

When I discussed with @danielvo about Dragen files containing small variants, he pointed to Results/**/*TMB_trace.tsv, which looks very useful but is not in vcf format.

very good point, we agreed on collecting the necessary information to cover all possible input. The issue may serve as a reminder #8

Co-authored-by: Håvard Molversmyr <54852797+molversmyr@users.noreply.github.com>

marrip · 2026-04-14T08:03:56Z

@molversmyr thanks for your review. All valid points so I made all the suggested changes. Cheers! 🙏

danielvo · 2026-06-09T13:31:08Z

+            help="Glob pattern to search for small variant VCF files in the results directory.",
+            callback=glob_pattern_callback,
+        ),
+    ] = "**/Results/**/*_MergedSmallVariants.genome.vcf",


Should we have all these values in config files? The Dragen and LocalApp pipelines will need own config files with different paths.

yes, we need those in a config file, I have started on one in #30, let me know if that makes sense. Once we have the input classes settled I will rework this PR to use them.

danielvo · 2026-06-09T13:48:23Z

+        )
+
+
+class Vcf:


Should we select a different name here, considering that we also have the Vcf(BaseInput) class? This object represents a file path (which happens to also inform about the patient and tumor type), while the other VCF object intends to track the VCF content/variants. Would VCF_path be a good choice here?

yes, I think we need to rename and restructure a good deal to fit our newly implemented classes.

Co-authored-by: danielvo <7126118+danielvo@users.noreply.github.com>

…'s suggestion Co-authored-by: Martin Rippin <74295098+marrip@users.noreply.github.com>

marrip added 8 commits March 30, 2026 11:18

chore: add pandas to deps

2011b98

feat: add update small variant vcf list subpackage

b178f0f

test: add tests and test data for update small variant vcf list subpa…

74d4541

…ckage

feat: add logger and set defaults

a55e39c

feat: expose update small variant vcf list subpackage via command in cli

141c519

style: include comments

02fb98f

chore: lint packages

8f625af

chore: ruff lint

938c6e7

marrip requested review from danielvo, molversmyr and tinavisnovska March 30, 2026 09:47

marrip self-assigned this Mar 30, 2026

tinavisnovska reviewed Apr 8, 2026

View reviewed changes

Comment thread src/tsoppy/update_small_variant_vcf_list/main.py Outdated

tinavisnovska reviewed Apr 8, 2026

View reviewed changes

Comment thread src/tsoppy/update_small_variant_vcf_list/main.py Outdated

tinavisnovska reviewed Apr 8, 2026

View reviewed changes

Comment thread src/tsoppy/update_small_variant_vcf_list/main.py Outdated

marrip added 4 commits April 9, 2026 09:22

style: make clear that N stands for normal

35a3774

feat: move patient_id and sample_type parsing into Vcf class

d0ba93b

docs: update module description

6cdec52

test: update unit tests to reflect new Vcf class definition

ae14586

marrip added 3 commits April 9, 2026 14:37

chore: lint

4f245c5

feat: switch from pandas to polars to improve performance

954286b

test: adapt tests to using polars

6585e51

molversmyr reviewed Apr 10, 2026

View reviewed changes

Comment thread src/tsoppy/cli.py Outdated

molversmyr reviewed Apr 10, 2026

View reviewed changes

Comment thread tests/test_update_small_variant_vcf_list_main.py Outdated

molversmyr reviewed Apr 10, 2026

View reviewed changes

Comment thread src/tsoppy/update_small_variant_vcf_list/main.py Outdated

molversmyr reviewed Apr 10, 2026

View reviewed changes

Comment thread src/tsoppy/cli.py Outdated

marrip and others added 3 commits April 13, 2026 13:25

fix: make regex a raw string

d703129

fix: use polars equals to compare dataframes

f750434

feat: rm None as input type for results_dir

982ebd7

Co-authored-by: Håvard Molversmyr <54852797+molversmyr@users.noreply.github.com>

danielvo reviewed Jun 9, 2026

View reviewed changes

Comment thread src/tsoppy/update_small_variant_vcf_list/main.py Outdated

danielvo reviewed Jun 9, 2026

View reviewed changes

Comment thread src/tsoppy/update_small_variant_vcf_list/main.py Outdated

danielvo reviewed Jun 9, 2026

View reviewed changes

docs: apply @danielvo 's suggestion

68c849f

Co-authored-by: danielvo <7126118+danielvo@users.noreply.github.com>

marrip commented Jun 12, 2026

View reviewed changes

Comment thread src/tsoppy/update_small_variant_vcf_list/main.py Outdated

fix: correct __eq__ function for both classes according to @danielvo …

b17acdd

…'s suggestion Co-authored-by: Martin Rippin <74295098+marrip@users.noreply.github.com>

Conversation

marrip commented Mar 30, 2026

Uh oh!

Uh oh!

Uh oh!

tinavisnovska commented Apr 8, 2026

Uh oh!

tinavisnovska Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

marrip Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

danielvo Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tinavisnovska commented Apr 8, 2026

Uh oh!

marrip commented Apr 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marrip commented Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

danielvo Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

marrip Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

danielvo Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

marrip Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants