fix(performance): Use bounded HTTP Range requests for indexed BAM queries#1998
Open
TechIsCool wants to merge 1 commit intosamtools:developfrom
Open
fix(performance): Use bounded HTTP Range requests for indexed BAM queries#1998TechIsCool wants to merge 1 commit intosamtools:developfrom
TechIsCool wants to merge 1 commit intosamtools:developfrom
Conversation
When reading indexed BAM files from remote URLs (HTTP, S3, etc.), seeking to a chunk offset would then read unbounded to EOF. For small queries against large files, this downloads far more data than needed. This adds bgzf_seek_limit() which accepts the chunk end offset from the BAM index, enabling bounded Range requests (bytes=X-Y) instead of unbounded ones (bytes=X-) in the libcurl backend. Changes: - hfile.h/hfile.c: Add readahead_limit field and setter - bgzf.h/bgzf.c: Add bgzf_seek_limit() that passes limit to hfile - hfile_libcurl.c: Use CURLOPT_RANGE with bounds when limit is set - hts.c: Call bgzf_seek_limit() with chunk end in hts_itr_next() The limit is cleared after each hseek(), so only affects reads immediately following a seek. Signed-off-by: David Beck <techiscool@gmail.com>
Member
|
This changes the public This problem has already been fixed in the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When reading remote BAM files with an index, htslib seeks to each chunk's start offset but issues unbounded Range requests. The server advertises gigabytes of Content-Length even though we only need kilobytes:
bytes=8224425-bytes=1631423494-bytes=7287649006-The client terminates early, but "early termination" isn't free - data in flight still transfers. We have also found that being specific about what is needed improves S3 responsiveness.
Solution
The BAM index already contains chunk end offsets. Pass them through to the HTTP layer:
EC2 Benchmark
(35 MB/s bandwidth)
Environment: EC2 m8azn.medium (up to 25 Gbps bandwidth), us-east-1
Test file:
s3://1000genomes/.../NA12878.mapped.ILLUMINA.bwa.CEU.exome.20121211.bam(17.3 GB)Measurement: Wall clock time + actual wire transfer via
/sys/class/net/<iface>/statistics/rx_bytesIt appears S3 optimizes resource allocation for bounded requests, leading to much faster responses. The time improvement exceeds bandwidth savings, suggesting that S3 can serve bounded requests more efficiently.
Per-Request Comparison (5 regions)
Range: bytes=X-Range: bytes=X-YLocal Benchmark
(2 MB/s bandwidth)
Environment: MacOS M1 (up to 100Mbps), California
Test file:
s3://1000genomes/.../NA12878.mapped.ILLUMINA.bwa.CEU.exome.20121211.bam(17.3 GB)Reproduction