Skip to content

Update FreeBayes parameters for nf-core/eager compatibility#7850

Open
mertydn wants to merge 2 commits intogalaxyproject:mainfrom
mertydn:update-freebayes
Open

Update FreeBayes parameters for nf-core/eager compatibility#7850
mertydn wants to merge 2 commits intogalaxyproject:mainfrom
mertydn:update-freebayes

Conversation

@mertydn
Copy link
Copy Markdown
Contributor

@mertydn mertydn commented Mar 30, 2026

  • FreeBayes: Updated the wrapper and test data (freebayes-phix174-test8.vcf).

  • BBDuk and FreeBayes tools couldn't be run with the correct parameters for use in the copy of nf-core/eager workflow i created on Galaxy. That's why i had to try updating them. The fastq and vcf files they generated were producing very different results compared to the reference nfcore outputs. With the changes I made, I can get the same results in my local working environment. However, the changes I made may be against the principles. I await your feedback on making the updates.

Comment thread tools/freebayes/freebayes.xml Outdated
#end if

##If custom_discovery mode is selected, run the script directly without splitting the genome and cut the script!
#if str( $options_type.options_type_selector ) == "custom_discovery":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be moved down to the big if statement that we already have for the $options_type.options_type_selector cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bernt-matthias I kept this block separate on purpose. The custom_discovery mode needs to run directly, without the --region i loop. If I move it inside the main block, it will add the --region parameters and break the command. Is it okay to keep it separate to avoid the region split?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a 4th mode to target_limit_type_selector. That is splitting the do_not_limit in one where sequences are processed separately in parallel and one where all sequences on a bam file are processed jointly (I think it might be important to do distinguish this anyway).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mertydn can you explain why this particular set of options is incompatible with splitting? I'm a bit lost.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wm75 I am trying to run the nf-core/eager workflow on Galaxy and get the exact same results.
In eager, this step runs all at once without splitting the data:
freebayes --bam 'b_0.bam' --fasta-reference 'localref.fa' -p 2 -C 1 -g 0

I think galaxy splits the data into pieces (--region).
But this causes changes in the vcf file, so my results do not match nextflow. I needed a way to run it without --region parametre

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't generate the result that only the QUAL values ​​of the Y chromosomes are different in usegalaxy. because I had changed the XML file to restrict it to only use that parameters.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just share one of the prior freebayes runs, or even just the inputs and the eager-produced vcf in a history and I can play with freebayes myself.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wm75 The freebayes in usegalaxy includes a large number of parameters. If you try running it by restricting it to only the parameters I mentioned, I think you will get a result showing a difference only in the Y chromosome.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll investigate, thanks!

Comment thread tools/freebayes/freebayes.xml Outdated
@wm75
Copy link
Copy Markdown
Member

wm75 commented Mar 30, 2026

Ok, that was very helpful!

  1. just like I thought this has nothing to do with using --regions or not, but
  2. when you're switching to the Full options, you're automatically populating all parameters with what the wrapper thinks are the default values for them, and this is really surprising:

the freebayes help says the default for theta is 0.001 (which makes biological sense, I think) and this is also what the wrapper uses as default, but the freebayes source code says the default is 10e-3, which at first glance seems to be cool, just that this is really 0.01 !!

@mertydn for you this means: try running the Galaxy job with the existing wrapper again but set theta to 0.01 explicitly, and you should get identical QUAL values to the eager pipeline. Please report back here if that's not the case.


@bernt-matthias @bgruening @nsoranzo can this really be the case? Has freebayes always used a different theta default from what it's stating in its command line help? I just have a hard time believing this so please double-check.

@wm75
Copy link
Copy Markdown
Member

wm75 commented Mar 30, 2026

haha: freebayes/freebayes#16
just that @lparsons has shifted everything by a digit when reporting.

@mertydn
Copy link
Copy Markdown
Contributor Author

mertydn commented Mar 30, 2026

@wm75 Hello, the problem is definitely related to the --regions parameter. I've added a very simple Freebayes XML to the existing history for you. Or you can run it from the terminal with these parameters. I can't use Freebayes in its current state with these simple parameters.

JK2067 is eager's output. Latest VCF had ran by Galaxy.

@mertydn
Copy link
Copy Markdown
Contributor Author

mertydn commented Mar 30, 2026

@wm75 simple parameters
1
2

current wrapper
3
4

##commandline="freebayes --region Y:0..59373566 --bam b_0.bam --fasta-reference localref.fa --vcf ./vcf_output/part_Y:0..59373566.vcf -p 2 -C 1 -g 0"

Updated FreeBayes tool version and modified processing options for joint analysis.
@wm75
Copy link
Copy Markdown
Member

wm75 commented Mar 30, 2026

Sorry, but above you're comparing freebayes 1.3.5 to 1.3.10.
Please just go to your own shared history. Rerun the freebayes job there leaving everything as before and only change --theta to 0.01. Do you then get the same results as in the uploaded JK2067. I guess you will.

@mertydn
Copy link
Copy Markdown
Contributor Author

mertydn commented Mar 30, 2026

@wm75
Copy link
Copy Markdown
Member

wm75 commented Mar 30, 2026

Ok, I can confirm that even with the same version of freebayes (1.3.10) a comparison of even the most basic command with and without regions:

freebayes -f hs37d5_chr21-MT.fa in.bam | grep "^Y" > out_simple.vcf vs
freebayes -f hs37d5_chr21-MT.fa --region Y in.bam | grep "^Y" > out_region.vcf

produces tiny QUAL score differences, while specifying regions or not does not affect any other chromosome.
At the moment I don't know what is causing this, but the effect on Y variants seems really tiny and in most analyses nobody cares about Y variants anyway.

Your patch here does not solve this problem at the root because you'd only skip using --region in one special mode, but the difference for Y variants between standard non-splitting command lines and this wrapper occurs independent of all other settings.

At the same time all other QUAL score differences (for variants not on Y) disappear if you just set --theta to 0.01, correct?

In that case I think the remaining tiny Y variants differences between nf-core eager and an equivalent Galaxy WF using the freebayes wrapper are simply something we need to accept and acknowledge for now and don't justify an entire rewrite of the wrapper.

It's great though that through this discussion we learnt the following about the command line tool:

  1. the unexpected --theta=0.01 default
  2. the strange interaction between the --region param and variants on the Y chromosome

Thanks for raising this @mertydn !

@nsoranzo
Copy link
Copy Markdown
Member

the freebayes help says the default for theta is 0.001 (which makes biological sense, I think) and this is also what the wrapper uses as default, but the freebayes source code says the default is 10e-3, which at first glance seems to be cool, just that this is really 0.01 !!

Yes, that sadly seems to be the case!

@mertydn
Copy link
Copy Markdown
Contributor Author

mertydn commented Mar 31, 2026

@wm75 Yes, that is correct. Setting --theta=0.01 resolves the QUAL score differences for variants except those on the Y chromosome.

Thank you so much for taking the time !

@bernt-matthias
Copy link
Copy Markdown
Contributor

Asked by AI to check for more problems of this kind:

Option Documented default (help) Actual code default
-T --theta 0.001 TH = 10e‑3 (0.01)
-7 --genotyping-max-banddepth 6 genotypingMaxBandDepth = 7

Guess this needs to be reported upstream.

@martenson
Copy link
Copy Markdown
Member

@bernt-matthias as stated above, Lance reported this in 2011 and it was autoclosed in 2020 because nobody cared :/

@bernt-matthias
Copy link
Copy Markdown
Contributor

freebayes/freebayes#845

@wm75
Copy link
Copy Markdown
Member

wm75 commented Mar 31, 2026

Thanks all for taking a look!

Lance reported this in 2011 and it was autoclosed in 2020 because nobody cared :/

I will report it again - maybe Pjotr Prins has time to take care of it.
Will also try to report the --region / Y chromosome weirdness.

Asked by AI to check for more problems of this kind

@bernt-matthias yes, I also found this to be a rather good use of AI for other packages. Even though also AI can miss things, it's a very fast second pair of eyes.

@wm75
Copy link
Copy Markdown
Member

wm75 commented Mar 31, 2026

freebayes/freebayes#845

Thanks @bernt-matthias !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants