Conversation
There was a problem hiding this comment.
Pull request overview
Refactors the starsolo qc subcommand from a standalone script into a sourced library, updating the CLI and documentation accordingly.
Changes:
- Replaced
scripts/solo_qc.shwithlib/qc.shimplementingrun_qc()/show_qc_help()and updated whitelist/species detection. - Updated
bin/starsoloto source the QC library and dispatchqcconsistently with other subcommands. - Updated README QC documentation and project structure listing.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| scripts/solo_qc.sh | Removed legacy standalone QC script. |
| lib/qc.sh | Added new QC library implementing argument parsing, completion checks, contamination checks, and TSV output. |
| bin/starsolo | Sources lib/qc.sh, updates help text/examples, and dispatches qc via run_qc. |
| README.md | Documents new starsolo qc usage/options/output and updates project structure section. |
Comments suppressed due to low confidence (1)
README.md:283
- The project structure section still lists
scripts/solo_qc.sh, but this PR removes that script and replaces it withlib/qc.sh. Update the tree underscripts/to removesolo_qc.sh(or reflect its new location) so the README matches the repository contents.
│ └── defaults.conf # Default configuration
├── scripts/
│ ├── solo_qc.sh # QC aggregation
│ ├── bbduk_trim.sh # Adapter trimming helper
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| | Column | Description | | ||
| |:--|:--| | ||
| | `Sample` | Directory name | |
There was a problem hiding this comment.
The documentation says the Sample column is the "Directory name", but run_qc currently prints the directory argument verbatim (which may include paths like results/GSM123). Either update the docs to say it's the provided path, or change QC output to print basename(<dir>) (and ensure uniqueness if doing so).
| | `Sample` | Directory name | | |
| | `Sample` | Provided directory path | |
| -h|--help) | ||
| show_qc_help; exit 0 ;; | ||
| -s|--species|--specie) | ||
| FORCE_SPECIES="$2"; shift 2 ;; | ||
| --no-contamination-check) |
There was a problem hiding this comment.
The --species option assumes a following argument (FORCE_SPECIES="$2"; shift 2). With set -u, calling starsolo qc --species (missing value) will raise an unbound parameter error / shift out of range instead of a clean message. Validate that $2 is present and not another flag before shifting, and fail with die if missing.
Also, --specie is accepted here but not documented in show_qc_help; either document it as an alias or remove it to avoid an accidental/typo API surface.
| for i in "${CLEAN_DIRS[@]}"; do | ||
| if [[ -s "$i/output/Gene/filtered/barcodes.tsv.gz" ]]; then | ||
| for j in "${CLEAN_DIRS[@]}"; do | ||
| if [[ -s "$j/output/Gene/filtered/barcodes.tsv.gz" && "$i" != "$j" ]]; then | ||
| local_i="$TMPDIR_SORT/$(basename "$i").barcodes.tsv" | ||
| local_j="$TMPDIR_SORT/$(basename "$j").barcodes.tsv" |
There was a problem hiding this comment.
Temp filenames are derived from basename "$i" / basename "$j". Since starsolo qc now accepts arbitrary directory arguments, two different directories with the same basename (e.g., /run1/sampleA and /run2/sampleA) will collide and corrupt comparisons. Use a collision-free naming scheme (e.g., an index in CLEAN_DIRS, a hash of the full path, or mktemp per sample).
| for i in "${CLEAN_DIRS[@]}"; do | |
| if [[ -s "$i/output/Gene/filtered/barcodes.tsv.gz" ]]; then | |
| for j in "${CLEAN_DIRS[@]}"; do | |
| if [[ -s "$j/output/Gene/filtered/barcodes.tsv.gz" && "$i" != "$j" ]]; then | |
| local_i="$TMPDIR_SORT/$(basename "$i").barcodes.tsv" | |
| local_j="$TMPDIR_SORT/$(basename "$j").barcodes.tsv" | |
| declare -A SORTED_BARCODE_FILES=() | |
| for i in "${CLEAN_DIRS[@]}"; do | |
| if [[ -s "$i/output/Gene/filtered/barcodes.tsv.gz" ]]; then | |
| for j in "${CLEAN_DIRS[@]}"; do | |
| if [[ -s "$j/output/Gene/filtered/barcodes.tsv.gz" && "$i" != "$j" ]]; then | |
| if [[ -z "${SORTED_BARCODE_FILES["$i"]}" ]]; then | |
| SORTED_BARCODE_FILES["$i"]=$(mktemp "$TMPDIR_SORT/barcodes.XXXXXX.tsv") | |
| fi | |
| if [[ -z "${SORTED_BARCODE_FILES["$j"]}" ]]; then | |
| SORTED_BARCODE_FILES["$j"]=$(mktemp "$TMPDIR_SORT/barcodes.XXXXXX.tsv") | |
| fi | |
| local_i="${SORTED_BARCODE_FILES["$i"]}" | |
| local_j="${SORTED_BARCODE_FILES["$j"]}" |
| N2=$(wc -l < "$local_j") | ||
| MIN=$(( N1 <= N2 ? N1 : N2 )) | ||
| COMM=$(comm -12 "$local_i" "$local_j" | wc -l) | ||
| PCT=$(echo "$COMM" | awk -v v="$MIN" '{printf "%d\n",100*$1/v+0.5}') |
There was a problem hiding this comment.
PCT divides by MIN, but MIN can be 0 if either barcode list is empty (wc -l == 0). That will cause an awk division-by-zero error under set -e and abort QC. Guard against MIN==0 (e.g., treat overlap as 0% and continue).
| PCT=$(echo "$COMM" | awk -v v="$MIN" '{printf "%d\n",100*$1/v+0.5}') | |
| if (( MIN == 0 )); then | |
| PCT=0 | |
| else | |
| PCT=$(echo "$COMM" | awk -v v="$MIN" '{printf "%d\n",100*$1/v+0.5}') | |
| fi |
| log_info "Checking potential sample cross-contamination …" | ||
| local TMPDIR_SORT | ||
| TMPDIR_SORT=$(mktemp -d) | ||
|
|
There was a problem hiding this comment.
TMPDIR_SORT is cleaned up only at the end via rm -rf "$TMPDIR_SORT". If any command in the contamination loop fails (e.g., zcat/sort/comm) and the script exits due to set -e, the temp directory will be left behind. Consider installing a trap (e.g., on RETURN within run_qc) to ensure cleanup on all exit paths.
| run_qc() { | ||
| local FORCE_SPECIES="" CHECK_CONTAMINATION=1 | ||
| local -a DIRS=() | ||
|
|
There was a problem hiding this comment.
CI currently exercises each platform via .github/workflows/test.yml, but there is no coverage for the newly refactored starsolo qc subcommand. Add a workflow step (or a small fixture under data/test) that runs bin/starsolo qc ... and asserts basic output/header behavior (and optionally --no-contamination-check) to prevent regressions in argument parsing and file detection.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c73bcb27ab
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| local_i="$TMPDIR_SORT/$(basename "$i").barcodes.tsv" | ||
| local_j="$TMPDIR_SORT/$(basename "$j").barcodes.tsv" |
There was a problem hiding this comment.
Use unique temp-file keys for contamination inputs
Naming the sorted barcode cache files with basename conflates different directories that share a sample folder name (for example batch1/sampleA and batch2/sampleA). In that case both samples write/read the same temp path, so comm can compare a file to itself and report near-100% overlap (plus concurrent writes to one file), producing false cross-contamination warnings for unrelated samples.
Useful? React with 👍 / 👎.
| CF=$(echo "$R1" | awk -v v="$R2" '{printf "%.3f\n",v/$1}') | ||
| R3=$(grep "UMIs in Cells," "$i/output/GeneFull/Summary.csv" | awk -F "," '{print $2}') | ||
| GC=$(grep "Median GeneFull per Cell," "$i/output/GeneFull/Summary.csv" | awk -F "," '{print $2}') | ||
| ST=$(grep "^soloStrand" "$i/Log.out" | grep RE-DEFINED | awk '{print $2}') |
There was a problem hiding this comment.
Handle absent RE-DEFINED strand markers safely
starsolo enables set -euo pipefail, so this pipeline returns non-zero when Log.out does not contain RE-DEFINED; the command then exits before the fallback ST="Undef" logic can run. That turns a previously tolerated condition into a hard failure and causes starsolo qc to abort on valid logs that omit this marker.
Useful? React with 👍 / 👎.
PR Notes —
starsolo qcsubcommand overhaulSummary
Refactors the
starsolo qcsubcommand from a standalone shell script (scripts/solo_qc.sh) into a proper sourced library (lib/qc.sh), consistent with all other platform libs. Adds explicit directory arguments, new CLI flags, improved whitelist detection, and full README documentation.Changes
lib/qc.sh(new, replacesscripts/solo_qc.sh)run_qc()andshow_qc_help()functions, sourced bybin/starsolo— consistent withlib/platform_*.shpatternlog_info/log_warn/diefromcommon.shinstead of raw>&2 echo_qc_guess_species,_qc_guess_wl) to avoid namespace collisionsNew CLI interface
starsolo qc sample1/ sample2/ …--speciesflag to override the species label in output (previously always guessed)--no-contamination-checkflag to skip cross-sample barcode overlap check (runs by default)Whitelist detection fix
3M-3pgex-may-2023.txt→v4-3p,3M-5pgex-jan-2023.txt→v4-5pCross-contamination check
mktemp -dinstead of the working directory (no moresorted.*.barcodes.tsvleft behind)bin/starsololib/qc.shdirectly; removed theexec bash scripts/solo_qc.shwrapperstarsolo qc --helpnow works consistently with other subcommandsREADME.mdlib/qc.sh