Skip to content

1 update small variant vcf list#2

Open
marrip wants to merge 20 commits into
developfrom
1-update-small-variant-vcf-list
Open

1 update small variant vcf list#2
marrip wants to merge 20 commits into
developfrom
1-update-small-variant-vcf-list

Conversation

@marrip

@marrip marrip commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

Hey guys!

This is my attempt at implementing the updating of the small variant vcf list. I have used the code we run at HUS, which is reflected in this nextflow process, as a base.

I have defined classes for the vcf list and the different vcf files to be able to write methods and gather all input values in one place. Also I wanted to create a clean interface towards the cli. Logging is handled by the logger defined in the cli module. Testing is done with table-driven tests, meaning we have different test cases for each function and collect them in a list of objects which we loop over. This makes things DRYer and provides a better overview of what is tested.

Please ask and comment so we can figure out if this is a good way to handle things ☺️

@marrip marrip self-assigned this Mar 30, 2026
Comment thread src/tsoppy/update_small_variant_vcf_list/main.py Outdated
Comment thread src/tsoppy/update_small_variant_vcf_list/main.py Outdated
@tinavisnovska

Copy link
Copy Markdown
Collaborator

These are small variant files produced with LocalApp code, right? Do we have an agreement on what small variant files we want to use when Dragen processes the data? Do you want to have separate functionality for handling those files or you want this code to be expanded so that it works for the Dragen data too?

When I discussed with @danielvo about Dragen files containing small variants, he pointed to Results/**/*TMB_trace.tsv, which looks very useful but is not in vcf format.

@@ -0,0 +1,179 @@
"""
This module contains the code for the `update_small_variant_vcf_list` command.
The command takes two arguments, `results_dir`, which is a string that specifies the directory where the results of the latest TSO500 run are stored.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there is a sample associated with a patient from the latest TSO500 run sequenced in one of the older runs (for example tumor DNA sample of patient A is sequenced in the latest run but normal DNA sample in one of the older runs)? Would it be useful for the older vcf file be added in the vcf list or not really?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understood what this function does in TSOPPI, it updates the list according to the latest run so earlier run vcfs should be already be included.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have now settled some of the input discussions, so let me return to this pull request. :-D

The VCF list is used for expanding the variant recurrence table (VRT). While the VRT is intended to track all possible variant calls (including artifacts/noise), the TMB trace file is meant to only hold information about confident variant calls. The gVCF files provide the largest variant call set available in the Illumina pipeline output, even though it looks like the Dragen's "hard-filtered" gVCFs contain only a tiny fraction of what the Local App's gVCFs do (in case of the first DNA test sample, the Dragen output contained literally only 1 % of the Local App output, 484 405 vs. 5 640 variants, but that is still more than what the TMB trace files contain).

Comment thread src/tsoppy/update_small_variant_vcf_list/main.py Outdated
@tinavisnovska

Copy link
Copy Markdown
Collaborator

Overall, it looks great and reads well! Surely will be useful part of the code.

@marrip

marrip commented Apr 9, 2026

Copy link
Copy Markdown
Contributor Author

These are small variant files produced with LocalApp code, right? Do we have an agreement on what small variant files we want to use when Dragen processes the data? Do you want to have separate functionality for handling those files or you want this code to be expanded so that it works for the Dragen data too?

When I discussed with @danielvo about Dragen files containing small variants, he pointed to Results/**/*TMB_trace.tsv, which looks very useful but is not in vcf format.

very good point, we agreed on collecting the necessary information to cover all possible input. The issue may serve as a reminder #8

Comment thread src/tsoppy/cli.py Outdated
Comment thread tests/test_update_small_variant_vcf_list_main.py Outdated
Comment thread src/tsoppy/update_small_variant_vcf_list/main.py Outdated
Comment thread src/tsoppy/cli.py Outdated
marrip and others added 3 commits April 13, 2026 13:25
@marrip

marrip commented Apr 14, 2026

Copy link
Copy Markdown
Contributor Author

@molversmyr thanks for your review. All valid points so I made all the suggested changes. Cheers! 🙏

Comment thread src/tsoppy/update_small_variant_vcf_list/main.py Outdated
Comment thread src/tsoppy/update_small_variant_vcf_list/main.py Outdated
Comment thread src/tsoppy/cli.py
help="Glob pattern to search for small variant VCF files in the results directory.",
callback=glob_pattern_callback,
),
] = "**/Results/**/*_MergedSmallVariants.genome.vcf",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have all these values in config files? The Dragen and LocalApp pipelines will need own config files with different paths.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we need those in a config file, I have started on one in #30, let me know if that makes sense. Once we have the input classes settled I will rework this PR to use them.

)


class Vcf:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we select a different name here, considering that we also have the Vcf(BaseInput) class? This object represents a file path (which happens to also inform about the patient and tumor type), while the other VCF object intends to track the VCF content/variants. Would VCF_path be a good choice here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think we need to rename and restructure a good deal to fit our newly implemented classes.

Co-authored-by: danielvo <7126118+danielvo@users.noreply.github.com>
Comment thread src/tsoppy/update_small_variant_vcf_list/main.py Outdated
…'s suggestion

Co-authored-by: Martin Rippin <74295098+marrip@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants