1 update small variant vcf list#2
Conversation
|
These are small variant files produced with LocalApp code, right? Do we have an agreement on what small variant files we want to use when Dragen processes the data? Do you want to have separate functionality for handling those files or you want this code to be expanded so that it works for the Dragen data too? When I discussed with @danielvo about Dragen files containing small variants, he pointed to Results/**/*TMB_trace.tsv, which looks very useful but is not in vcf format. |
| @@ -0,0 +1,179 @@ | |||
| """ | |||
| This module contains the code for the `update_small_variant_vcf_list` command. | |||
| The command takes two arguments, `results_dir`, which is a string that specifies the directory where the results of the latest TSO500 run are stored. | |||
There was a problem hiding this comment.
What happens if there is a sample associated with a patient from the latest TSO500 run sequenced in one of the older runs (for example tumor DNA sample of patient A is sequenced in the latest run but normal DNA sample in one of the older runs)? Would it be useful for the older vcf file be added in the vcf list or not really?
There was a problem hiding this comment.
As far as I understood what this function does in TSOPPI, it updates the list according to the latest run so earlier run vcfs should be already be included.
There was a problem hiding this comment.
We have now settled some of the input discussions, so let me return to this pull request. :-D
The VCF list is used for expanding the variant recurrence table (VRT). While the VRT is intended to track all possible variant calls (including artifacts/noise), the TMB trace file is meant to only hold information about confident variant calls. The gVCF files provide the largest variant call set available in the Illumina pipeline output, even though it looks like the Dragen's "hard-filtered" gVCFs contain only a tiny fraction of what the Local App's gVCFs do (in case of the first DNA test sample, the Dragen output contained literally only 1 % of the Local App output, 484 405 vs. 5 640 variants, but that is still more than what the TMB trace files contain).
|
Overall, it looks great and reads well! Surely will be useful part of the code. |
very good point, we agreed on collecting the necessary information to cover all possible input. The issue may serve as a reminder #8 |
Co-authored-by: Håvard Molversmyr <54852797+molversmyr@users.noreply.github.com>
|
@molversmyr thanks for your review. All valid points so I made all the suggested changes. Cheers! 🙏 |
| help="Glob pattern to search for small variant VCF files in the results directory.", | ||
| callback=glob_pattern_callback, | ||
| ), | ||
| ] = "**/Results/**/*_MergedSmallVariants.genome.vcf", |
There was a problem hiding this comment.
Should we have all these values in config files? The Dragen and LocalApp pipelines will need own config files with different paths.
There was a problem hiding this comment.
yes, we need those in a config file, I have started on one in #30, let me know if that makes sense. Once we have the input classes settled I will rework this PR to use them.
| ) | ||
|
|
||
|
|
||
| class Vcf: |
There was a problem hiding this comment.
Should we select a different name here, considering that we also have the Vcf(BaseInput) class? This object represents a file path (which happens to also inform about the patient and tumor type), while the other VCF object intends to track the VCF content/variants. Would VCF_path be a good choice here?
There was a problem hiding this comment.
yes, I think we need to rename and restructure a good deal to fit our newly implemented classes.
Co-authored-by: danielvo <7126118+danielvo@users.noreply.github.com>
…'s suggestion Co-authored-by: Martin Rippin <74295098+marrip@users.noreply.github.com>
Hey guys!
This is my attempt at implementing the updating of the small variant vcf list. I have used the code we run at HUS, which is reflected in this nextflow process, as a base.
I have defined classes for the vcf list and the different vcf files to be able to write methods and gather all input values in one place. Also I wanted to create a clean interface towards the cli. Logging is handled by the logger defined in the cli module. Testing is done with table-driven tests, meaning we have different test cases for each function and collect them in a list of objects which we loop over. This makes things DRYer and provides a better overview of what is tested.
Please ask and comment so we can figure out if this is a good way to handle things☺️