Skip to content

qc command refactor#22

Merged
Claptar merged 2 commits into
mainfrom
dev
Apr 22, 2026
Merged

qc command refactor#22
Claptar merged 2 commits into
mainfrom
dev

Conversation

@Claptar

@Claptar Claptar commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

PR Notes — starsolo qc subcommand overhaul

Summary

Refactors the starsolo qc subcommand from a standalone shell script (scripts/solo_qc.sh) into a proper sourced library (lib/qc.sh), consistent with all other platform libs. Adds explicit directory arguments, new CLI flags, improved whitelist detection, and full README documentation.

Changes

lib/qc.sh (new, replaces scripts/solo_qc.sh)

  • Moved logic into run_qc() and show_qc_help() functions, sourced by bin/starsolo — consistent with lib/platform_*.sh pattern
  • Now uses log_info / log_warn / die from common.sh instead of raw >&2 echo
  • Helper functions prefixed (_qc_guess_species, _qc_guess_wl) to avoid namespace collisions

New CLI interface

  • Takes explicit output directories as positional args: starsolo qc sample1/ sample2/ …
  • Shows help when called with no arguments
  • New --species flag to override the species label in output (previously always guessed)
  • New --no-contamination-check flag to skip cross-sample barcode overlap check (runs by default)

Whitelist detection fix

  • Now matches on the basename of the whitelist file only — path-independent
  • Added 10x v4 chemistry support: 3M-3pgex-may-2023.txtv4-3p, 3M-5pgex-jan-2023.txtv4-5p

Cross-contamination check

  • Temp files now written to mktemp -d instead of the working directory (no more sorted.*.barcodes.tsv left behind)

bin/starsolo

  • Sources lib/qc.sh directly; removed the exec bash scripts/solo_qc.sh wrapper
  • starsolo qc --help now works consistently with other subcommands

README.md

  • Updated QC section with new usage, options table, checks performed, and full output column reference
  • Updated project structure to reflect lib/qc.sh

Copilot AI review requested due to automatic review settings April 22, 2026 11:54
@Claptar Claptar self-assigned this Apr 22, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the starsolo qc subcommand from a standalone script into a sourced library, updating the CLI and documentation accordingly.

Changes:

  • Replaced scripts/solo_qc.sh with lib/qc.sh implementing run_qc() / show_qc_help() and updated whitelist/species detection.
  • Updated bin/starsolo to source the QC library and dispatch qc consistently with other subcommands.
  • Updated README QC documentation and project structure listing.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 6 comments.

File Description
scripts/solo_qc.sh Removed legacy standalone QC script.
lib/qc.sh Added new QC library implementing argument parsing, completion checks, contamination checks, and TSV output.
bin/starsolo Sources lib/qc.sh, updates help text/examples, and dispatches qc via run_qc.
README.md Documents new starsolo qc usage/options/output and updates project structure section.
Comments suppressed due to low confidence (1)

README.md:283

  • The project structure section still lists scripts/solo_qc.sh, but this PR removes that script and replaces it with lib/qc.sh. Update the tree under scripts/ to remove solo_qc.sh (or reflect its new location) so the README matches the repository contents.
│   └── defaults.conf          # Default configuration
├── scripts/
│   ├── solo_qc.sh             # QC aggregation
│   ├── bbduk_trim.sh          # Adapter trimming helper

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md

| Column | Description |
|:--|:--|
| `Sample` | Directory name |

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says the Sample column is the "Directory name", but run_qc currently prints the directory argument verbatim (which may include paths like results/GSM123). Either update the docs to say it's the provided path, or change QC output to print basename(<dir>) (and ensure uniqueness if doing so).

Suggested change
| `Sample` | Directory name |
| `Sample` | Provided directory path |

Copilot uses AI. Check for mistakes.
Comment thread lib/qc.sh
Comment on lines +48 to +52
-h|--help)
show_qc_help; exit 0 ;;
-s|--species|--specie)
FORCE_SPECIES="$2"; shift 2 ;;
--no-contamination-check)

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --species option assumes a following argument (FORCE_SPECIES="$2"; shift 2). With set -u, calling starsolo qc --species (missing value) will raise an unbound parameter error / shift out of range instead of a clean message. Validate that $2 is present and not another flag before shifting, and fail with die if missing.

Also, --specie is accepted here but not documented in show_qc_help; either document it as an alias or remove it to avoid an accidental/typo API surface.

Copilot uses AI. Check for mistakes.
Comment thread lib/qc.sh
Comment on lines +91 to +96
for i in "${CLEAN_DIRS[@]}"; do
if [[ -s "$i/output/Gene/filtered/barcodes.tsv.gz" ]]; then
for j in "${CLEAN_DIRS[@]}"; do
if [[ -s "$j/output/Gene/filtered/barcodes.tsv.gz" && "$i" != "$j" ]]; then
local_i="$TMPDIR_SORT/$(basename "$i").barcodes.tsv"
local_j="$TMPDIR_SORT/$(basename "$j").barcodes.tsv"

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temp filenames are derived from basename "$i" / basename "$j". Since starsolo qc now accepts arbitrary directory arguments, two different directories with the same basename (e.g., /run1/sampleA and /run2/sampleA) will collide and corrupt comparisons. Use a collision-free naming scheme (e.g., an index in CLEAN_DIRS, a hash of the full path, or mktemp per sample).

Suggested change
for i in "${CLEAN_DIRS[@]}"; do
if [[ -s "$i/output/Gene/filtered/barcodes.tsv.gz" ]]; then
for j in "${CLEAN_DIRS[@]}"; do
if [[ -s "$j/output/Gene/filtered/barcodes.tsv.gz" && "$i" != "$j" ]]; then
local_i="$TMPDIR_SORT/$(basename "$i").barcodes.tsv"
local_j="$TMPDIR_SORT/$(basename "$j").barcodes.tsv"
declare -A SORTED_BARCODE_FILES=()
for i in "${CLEAN_DIRS[@]}"; do
if [[ -s "$i/output/Gene/filtered/barcodes.tsv.gz" ]]; then
for j in "${CLEAN_DIRS[@]}"; do
if [[ -s "$j/output/Gene/filtered/barcodes.tsv.gz" && "$i" != "$j" ]]; then
if [[ -z "${SORTED_BARCODE_FILES["$i"]}" ]]; then
SORTED_BARCODE_FILES["$i"]=$(mktemp "$TMPDIR_SORT/barcodes.XXXXXX.tsv")
fi
if [[ -z "${SORTED_BARCODE_FILES["$j"]}" ]]; then
SORTED_BARCODE_FILES["$j"]=$(mktemp "$TMPDIR_SORT/barcodes.XXXXXX.tsv")
fi
local_i="${SORTED_BARCODE_FILES["$i"]}"
local_j="${SORTED_BARCODE_FILES["$j"]}"

Copilot uses AI. Check for mistakes.
Comment thread lib/qc.sh
N2=$(wc -l < "$local_j")
MIN=$(( N1 <= N2 ? N1 : N2 ))
COMM=$(comm -12 "$local_i" "$local_j" | wc -l)
PCT=$(echo "$COMM" | awk -v v="$MIN" '{printf "%d\n",100*$1/v+0.5}')

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PCT divides by MIN, but MIN can be 0 if either barcode list is empty (wc -l == 0). That will cause an awk division-by-zero error under set -e and abort QC. Guard against MIN==0 (e.g., treat overlap as 0% and continue).

Suggested change
PCT=$(echo "$COMM" | awk -v v="$MIN" '{printf "%d\n",100*$1/v+0.5}')
if (( MIN == 0 )); then
PCT=0
else
PCT=$(echo "$COMM" | awk -v v="$MIN" '{printf "%d\n",100*$1/v+0.5}')
fi

Copilot uses AI. Check for mistakes.
Comment thread lib/qc.sh
Comment on lines +86 to +89
log_info "Checking potential sample cross-contamination …"
local TMPDIR_SORT
TMPDIR_SORT=$(mktemp -d)

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TMPDIR_SORT is cleaned up only at the end via rm -rf "$TMPDIR_SORT". If any command in the contamination loop fails (e.g., zcat/sort/comm) and the script exits due to set -e, the temp directory will be left behind. Consider installing a trap (e.g., on RETURN within run_qc) to ensure cleanup on all exit paths.

Copilot uses AI. Check for mistakes.
Comment thread lib/qc.sh
Comment on lines +41 to +44
run_qc() {
local FORCE_SPECIES="" CHECK_CONTAMINATION=1
local -a DIRS=()

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI currently exercises each platform via .github/workflows/test.yml, but there is no coverage for the newly refactored starsolo qc subcommand. Add a workflow step (or a small fixture under data/test) that runs bin/starsolo qc ... and asserts basic output/header behavior (and optionally --no-contamination-check) to prevent regressions in argument parsing and file detection.

Copilot uses AI. Check for mistakes.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c73bcb27ab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/qc.sh
Comment on lines +95 to +96
local_i="$TMPDIR_SORT/$(basename "$i").barcodes.tsv"
local_j="$TMPDIR_SORT/$(basename "$j").barcodes.tsv"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use unique temp-file keys for contamination inputs

Naming the sorted barcode cache files with basename conflates different directories that share a sample folder name (for example batch1/sampleA and batch2/sampleA). In that case both samples write/read the same temp path, so comm can compare a file to itself and report near-100% overlap (plus concurrent writes to one file), producing false cross-contamination warnings for unrelated samples.

Useful? React with 👍 / 👎.

Comment thread lib/qc.sh
CF=$(echo "$R1" | awk -v v="$R2" '{printf "%.3f\n",v/$1}')
R3=$(grep "UMIs in Cells," "$i/output/GeneFull/Summary.csv" | awk -F "," '{print $2}')
GC=$(grep "Median GeneFull per Cell," "$i/output/GeneFull/Summary.csv" | awk -F "," '{print $2}')
ST=$(grep "^soloStrand" "$i/Log.out" | grep RE-DEFINED | awk '{print $2}')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle absent RE-DEFINED strand markers safely

starsolo enables set -euo pipefail, so this pipeline returns non-zero when Log.out does not contain RE-DEFINED; the command then exits before the fallback ST="Undef" logic can run. That turns a previously tolerated condition into a hard failure and causes starsolo qc to abort on valid logs that omit this marker.

Useful? React with 👍 / 👎.

@Claptar Claptar merged commit c4fbf61 into main Apr 22, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants