Update FreeBayes parameters for nf-core/eager compatibility#7850
Update FreeBayes parameters for nf-core/eager compatibility#7850mertydn wants to merge 2 commits intogalaxyproject:mainfrom
Conversation
| #end if | ||
|
|
||
| ##If custom_discovery mode is selected, run the script directly without splitting the genome and cut the script! | ||
| #if str( $options_type.options_type_selector ) == "custom_discovery": |
There was a problem hiding this comment.
Can this be moved down to the big if statement that we already have for the $options_type.options_type_selector cases.
There was a problem hiding this comment.
@bernt-matthias I kept this block separate on purpose. The custom_discovery mode needs to run directly, without the --region i loop. If I move it inside the main block, it will add the --region parameters and break the command. Is it okay to keep it separate to avoid the region split?
There was a problem hiding this comment.
How about adding a 4th mode to target_limit_type_selector. That is splitting the do_not_limit in one where sequences are processed separately in parallel and one where all sequences on a bam file are processed jointly (I think it might be important to do distinguish this anyway).
There was a problem hiding this comment.
@mertydn can you explain why this particular set of options is incompatible with splitting? I'm a bit lost.
There was a problem hiding this comment.
@wm75 I am trying to run the nf-core/eager workflow on Galaxy and get the exact same results.
In eager, this step runs all at once without splitting the data:
freebayes --bam 'b_0.bam' --fasta-reference 'localref.fa' -p 2 -C 1 -g 0
I think galaxy splits the data into pieces (--region).
But this causes changes in the vcf file, so my results do not match nextflow. I needed a way to run it without --region parametre
There was a problem hiding this comment.
I can't generate the result that only the QUAL values of the Y chromosomes are different in usegalaxy. because I had changed the XML file to restrict it to only use that parameters.
There was a problem hiding this comment.
Just share one of the prior freebayes runs, or even just the inputs and the eager-produced vcf in a history and I can play with freebayes myself.
There was a problem hiding this comment.
There was a problem hiding this comment.
@wm75 The freebayes in usegalaxy includes a large number of parameters. If you try running it by restricting it to only the parameters I mentioned, I think you will get a result showing a difference only in the Y chromosome.
|
Ok, that was very helpful!
the freebayes help says the default for theta is 0.001 (which makes biological sense, I think) and this is also what the wrapper uses as default, but the freebayes source code says the default is 10e-3, which at first glance seems to be cool, just that this is really 0.01 !! @mertydn for you this means: try running the Galaxy job with the existing wrapper again but set theta to 0.01 explicitly, and you should get identical QUAL values to the eager pipeline. Please report back here if that's not the case. @bernt-matthias @bgruening @nsoranzo can this really be the case? Has freebayes always used a different theta default from what it's stating in its command line help? I just have a hard time believing this so please double-check. |
|
haha: freebayes/freebayes#16 |
|
@wm75 Hello, the problem is definitely related to the --regions parameter. I've added a very simple Freebayes XML to the existing history for you. Or you can run it from the terminal with these parameters. I can't use Freebayes in its current state with these simple parameters. JK2067 is eager's output. Latest VCF had ran by Galaxy. |
|
@wm75 simple parameters ##commandline="freebayes --region Y:0..59373566 --bam b_0.bam --fasta-reference localref.fa --vcf ./vcf_output/part_Y:0..59373566.vcf -p 2 -C 1 -g 0" |
Updated FreeBayes tool version and modified processing options for joint analysis.
|
Sorry, but above you're comparing freebayes 1.3.5 to 1.3.10. |
|
@wm75 for theta 0.01 Unfortunately, They are not same. Please check |
|
Ok, I can confirm that even with the same version of freebayes (1.3.10) a comparison of even the most basic command with and without regions:
produces tiny QUAL score differences, while specifying regions or not does not affect any other chromosome. Your patch here does not solve this problem at the root because you'd only skip using --region in one special mode, but the difference for Y variants between standard non-splitting command lines and this wrapper occurs independent of all other settings. At the same time all other QUAL score differences (for variants not on Y) disappear if you just set --theta to 0.01, correct? In that case I think the remaining tiny Y variants differences between nf-core eager and an equivalent Galaxy WF using the freebayes wrapper are simply something we need to accept and acknowledge for now and don't justify an entire rewrite of the wrapper. It's great though that through this discussion we learnt the following about the command line tool:
Thanks for raising this @mertydn ! |
Yes, that sadly seems to be the case! |
|
@wm75 Yes, that is correct. Setting --theta=0.01 resolves the QUAL score differences for variants except those on the Y chromosome. Thank you so much for taking the time ! |
|
Asked by AI to check for more problems of this kind:
Guess this needs to be reported upstream. |
|
@bernt-matthias as stated above, Lance reported this in 2011 and it was autoclosed in 2020 because nobody cared :/ |
|
Thanks all for taking a look!
I will report it again - maybe Pjotr Prins has time to take care of it.
@bernt-matthias yes, I also found this to be a rather good use of AI for other packages. Even though also AI can miss things, it's a very fast second pair of eyes. |
|
Thanks @bernt-matthias ! |
FreeBayes: Updated the wrapper and test data (
freebayes-phix174-test8.vcf).BBDuk and FreeBayes tools couldn't be run with the correct parameters for use in the copy of nf-core/eager workflow i created on Galaxy. That's why i had to try updating them. The fastq and vcf files they generated were producing very different results compared to the reference nfcore outputs. With the changes I made, I can get the same results in my local working environment. However, the changes I made may be against the principles. I await your feedback on making the updates.