This tutorial is specifically tailored for Streptococcus pneumoniae and provides a step-by-step workflow for transforming raw read sequencing files (FASTQ) into an interactive Microreact visualisation. The tutorial begins with software environment setup. It then covers quality control and the generation of in silico data (such as GPSCs, serotypes, MLSTs, and AMR profiles) using the GPS Pipeline, followed by phylogenetic tree construction, and Microreact instance creation.
- Prerequisites & Environment Setup
- Quality Control & Generating in silico Data
- Building Phylogenetic Tree
- Creating Microreact Instance
You can use Linux, Windows (with WSL2), and macOS to follow this tutorial. We recommend using a system with at least 16GB of RAM.
- Install dependencies
- Install OpenJDK 17 (or later, up to 25) (use
SDKMAN!to install an appropiate version of Temurin distribution) - Install Docker or Singularity/Apptainer
- Linux: Docker Engine / Apptainer
- macOS: Docker Desktop for macOS
- Windows with WSL2: Docker Desktop for WSL2
- Install OpenJDK 17 (or later, up to 25) (use
- Clone and initialise the pipeline
gitshould come preinstalled on most systems. If not, follow the official installation guide.- Go into where you want to keep the pipeline
cd /path/of/your/choice - Clone the repository
git clone https://github.com/GlobalPneumoSeq/gps-pipeline.git - Go into the directory of the pipeline
cd gps-pipeline - (Optional) Initialise the pipeline, so it can be used any time without the Internet
If using Singularity/Apptainer, add
-profile singularityto the below command./run_pipeline --init
- Go into where you want to keep the pipeline
- Install Conda
- Install
snippy,snp-sites,fasttree,ete3in a Conda environment namedbuildtreeIf using Mac computers with Apple Silicon, append
--platform osx-64to the below command.conda create -n buildtree "snippy>=4.6.0" "snp-sites>=2.5.1" "fasttree>=2.2.0" "ete3>=3.1.3" - Install
gubbinsin a Conda environment namedgubbinsgubbinsneed to be installed to a separated Conda environment due to package conflict with other tools.conda create -n gubbins "gubbins>=3.4.3"
We use the GPS Pipeline to process raw read sequencing files (FASTQ) of Streptococcus pneumoniae samples, it will perform quality control (QC) and in silico typing (including serotype, MLST, GPSC, and AMR profile) fully automatic.
- Saving FASTQ files of all samples in a single directory
- Run the pipeline by specifying where the FASTQ files are stored (
--reads) and the output directory (--output)If using Singularity/Apptainer, add
-profile singularityto the below command../run_pipeline --reads /path/to/fastqs --output /path/to/output - Once the run is completed, you will find
results.csvin your specified output directory
Before building a tree, we will filter out samples that failed QC in the GPS Pipeline. We will be saving a copy of results.csv with only QC passed samples as results_qcpass.csv and a list of those samples as ids_qcpass.txt
- Go into GPS Pipeline output directory
cd /path/to/output - Extract GPS Pipeline results of QC-passed samples,
results_qcpass.csvwill be uploaded to your Microreact instance as metadataawk -F , ' FNR==1 || $6=="PASS" ' results.csv > results_qcpass.csv - Extract names of QC passed samples
awk -F , ' FNR >1 { print $1 } ' results_qcpass.csv > ids_qcpass.txt
We use snippy-multi script to run QC passed reads against the same reference.
- Run the following command to create an input file named
snippy_multi_input.tsvUpdate values of
READS_DIRandIDS_QCPASSto the correct ones.READS_DIR="/path/to/fastqs" IDS_QCPASS="/path/to/ids_qcpass.txt" while read -r ID; do echo ${ID}$'\t'${READS_DIR}/${ID}_1.fastq.gz$'\t'${READS_DIR}/${ID}_2.fastq.gz >> snippy_multi_input.tsv; done < ${IDS_QCPASS} - Get a reference sequence
Below command downloads a general reference sequence. You should use a suitable reference genome if you are working on a lineage/GPSC-specific analysis.
wget https://raw.githubusercontent.com/GlobalPneumoSeq/gps-pipeline/refs/heads/master/data/ATCC_700669_v1.fa
-
Activate
buildtreeConda environmentconda activate buildtree -
Run snippy-multi to generate the script
runme.sh, which will create the alignment in the next step--cpus $(nproc)uses all available CPU cores when running the generated script.Update the value of
--refif you are using an alternative reference sequence.snippy-multi snippy_multi_input.tsv --ref ATCC_700669_v1.fa --cpus $(nproc) > runme.sh -
Grant execution permission to the newly created file so that you can run it
chmod +x runme.sh -
Run the script
./runme.sh -
Clean alignment (see here for why the cleanup is needed)
snippy-clean_full_aln core.full.aln > clean.full.aln -
Remove intermediate directories
cut -f1 snippy_multi_input.tsv | xargs rm -rf
Depending on the purpose of your tree building and the nature of your samples, you might choose one of the following paths:
- Path A:
- Species-wide analysis (as
gubbinsonly works on samples within a strain or lineage), or quick preliminary test - Faster
- Species-wide analysis (as
- Path B:
- Strain/lineage-specific analysis
- Slower
- Run
snp-sitesto identify and extract the SNPs from the alignmentsnp-sites -o clean.full.SNPs.aln clean.full.aln - Run
fasttreeto build a treeFastTree -nt -gtr clean.full.SNPs.aln > clean.full.SNPs.aln.tree - Run
ete3via Python to prune reference from tree fileecho -e "from ete3 import Tree\nt = Tree('clean.full.SNPs.aln.tree')\nt.search_nodes(name='Reference')[0].detach()\nt.write(outfile='clean.full.SNPs.aln.noref.tree')" | python3 - The resulting
clean.full.SNPs.aln.noref.treewill be uploaded to your Microreact instance as tree file
- Activate
gubbinsConda environmentconda activate gubbins - Run
gubbinsto build a tree with recombination filtering--threads $(nproc)uses all available CPU cores when runninggubbinsrun_gubbins.py --mar --threads $(nproc) -p gubbins_output clean.full.aln - Activate
buildtreeConda environmentconda activate buildtree - Run
ete3via Python to prune reference from tree fileecho -e "from ete3 import Tree\nt = Tree('gubbins_output.final_tree.tre')\nt.search_nodes(name='Reference')[0].detach()\nt.write(outfile='gubbins_output.final_tree.noref.tre')" | python3 - The resulting
gubbins_output.final_tree.noref.trewill be uploaded to your Microreact instance as tree file
You can refer to its official documentation on how to use Microreact. In brief, you can create an informative instance in just few simple steps:
- Go to https://microreact.org/upload using your preferred web browser
- Drag and drop
results_qcpass.csv, andclean.full.SNPs.aln.noref.tree(Path A) orgubbins_output.final_tree.noref.tre(Path B) - Keep pressing
Continuebuttons until you see the visulisation - Pick a suitable colour column to change what the colours of the nodes are representing (e.g. Serotype)
In the GPS Project, we have standardised colour palette for GPSCs and serotypes. You can add a
<FIELD NAME>__colourcolumn (e.g.GPSC__colour) in your metadata (results_qcpass.csvin this case) to specify the colour palette of a column. - Add metadata blocks to visualise the in silico data generated by the GPS Pipeline to provide context to the tree
- Save and share the project