cgMLST Guide¶
This guide covers gmlst typing cgmlst, including mode selection, FASTA and FASTQ behavior, prefiltering, CDS-aware classification, and performance tuning. For the full CLI surface, see commands.md. For broader implementation context, see architecture.md.
Overview¶
cgMLST extends the classical seven-locus MLST idea to hundreds or thousands of loci across a core genome set. That gives you much finer resolution for outbreak analysis, surveillance, and population structure work.
In practical terms:
- MLST is small, stable, and fast for lineage naming
- cgMLST is larger, more discriminating, and better suited to high-resolution comparison
gmlst keeps both workflows under the same CLI, but typing cgmlst has extra logic for large schemes, CDS-aware classification, and backend-specific speed paths.
Quick Start¶
The default backend for cgMLST is minimap2.
gmlst typing cgmlst -s vparahaemolyticus_3 sample.fna
Typical TSV-style output:
FILE SCHEME ST dnaA ftsZ gyrB ...
sample.fna vparahaemolyticus_3 - 12 44 109 ...
For many cgMLST schemes, ST may be - because the main value is the per-locus profile, not a compact legacy ST label.
cgMLST Modes¶
Choose a mode based on dataset size and how much rescue logic you want.
standard¶
Conservative mode. No forced chew-style overrides.
Use it when:
- you want the plainest behavior
- you are comparing results across backends
- you are working with FASTQ, because chew-style modes are forced back to
standardthere
gmlst typing cgmlst -s vparahaemolyticus_3 \
--cgmlst-mode standard sample.fna
chew-fast¶
Adds exact-hash resolution, minimap2 hash prefiltering, automatic missing-locus refinement with a default cap of 500 loci, and targeted blastn evidence fallback with a default cap of 500 loci.
Use it when:
- you want a strong speed and recovery balance for FASTA assemblies
- your scheme is large enough that early filtering helps
gmlst typing cgmlst -s vparahaemolyticus_3 \
--cgmlst-mode chew-fast sample.fna
chew-ultrafast¶
Builds on chew-fast with representative-only main alignment, disabled CIGAR emission, an ultrafast minimap2 speed profile, a strict low-confidence rescue cap of 120 loci, and an adaptive second pass.
Use it when:
- your scheme has 1000 or more loci
- you need throughput first, then a focused rescue pass
gmlst typing cgmlst -s vparahaemolyticus_3 \
--cgmlst-mode chew-ultrafast sample.fna
chew-bsr¶
Adds protein-level exact-hash pre-resolution on top of chew-fast, with optional BSR confirmation controlled by environment variables.
Use it when:
- you need extra protein-aware confirmation behavior
- you are investigating edge cases where DNA-only pre-resolution is not enough
gmlst typing cgmlst -s vparahaemolyticus_3 \
--cgmlst-mode chew-bsr sample.fna
chew-balanced¶
Uses exact-hash resolution, minimap2 hash prefiltering, and targeted blastn fallback without the more aggressive ultrafast path.
Use it when:
- you want a middle ground between
standardand the most aggressive acceleration profiles
gmlst typing cgmlst -s vparahaemolyticus_3 \
--cgmlst-mode chew-balanced sample.fna
FASTA cgMLST¶
FASTA assemblies get the richest optimization path. This is where chew-style modes are active and where prefiltering, exact-hash resolution, and targeted evidence fallback have the most impact.
Common FASTA examples:
# Default path
gmlst typing cgmlst -s vparahaemolyticus_3 sample.fna
# Enable a custom prefilter shape
gmlst typing cgmlst -s vparahaemolyticus_3 \
--prefilter-k 31 \
--prefilter-top-n 20 \
--prefilter-min-loci-fraction 0.2 \
sample.fna
# Disable prefilter explicitly
gmlst typing cgmlst -s vparahaemolyticus_3 \
--no-prefilter sample.fna
# Use a faster chew-style mode for a large scheme
gmlst typing cgmlst -s vparahaemolyticus_3 \
--cgmlst-mode chew-ultrafast sample.fna
Why FASTA gets more options:
- assemblies give stable locus boundaries for exact-hash and CDS-aware logic
- minimap2 prefiltering is more effective on assembled contigs
- second-pass rescue is cheaper when the candidate space is already narrow
FASTQ cgMLST¶
FASTQ follows a KMA-first policy.
If you request -b minimap2 with FASTQ inputs, gmlst automatically switches to -b kma. Chew-style modes are compatibility-only in FASTQ mode and are forced back to standard.
gmlst typing cgmlst -s vparahaemolyticus_3 \
-b minimap2 reads_R1.fastq.gz reads_R2.fastq.gz
Effective behavior:
- backend is auto-switched to
kma - cgMLST mode is treated as
standard - per-sample threads may be auto-raised through
GMLST_CGMLST_FASTQ_KMA_AUTO_THREADS --call-policy chewbbacais rejected for FASTQ input
Recommended FASTQ run:
gmlst typing cgmlst -s vparahaemolyticus_3 \
-b kma -t 8 reads_R1.fastq.gz reads_R2.fastq.gz
Why this happens:
- chew-style accelerations are designed around FASTA assemblies
- KMA is a stronger fit for read mapping in cgMLST FASTQ workflows
- automatic switching avoids a slow or misleading configuration
Prefiltering¶
Prefiltering reduces the number of loci that need full downstream alignment or rescue work.
CLI options:
gmlst typing cgmlst -s vparahaemolyticus_3 \
--prefilter-k 31 \
--prefilter-top-n 20 \
--prefilter-min-loci-fraction 0.15 \
sample.fna
Meaning of the options:
--prefilter-k, k-mer size for the prefilter--prefilter-top-n, how many candidate loci to keep per query step--prefilter-min-loci-fraction, minimum fraction of loci required to keep the prefilter path active
Important behavior:
- schemes larger than the auto-skip threshold can bypass prefiltering
- the default auto-skip threshold is 3000 loci
- for
-b kmaand default-b minimap2, cgMLST prefilter can be skipped in favor of the persistent full-index path
CDS Prediction¶
gmlst uses Pyrodigal for cgMLST CDS prediction in the exact-hash and chew-style path.
To export predicted CDS coordinates:
gmlst typing cgmlst -s vparahaemolyticus_3 \
--cds-coordinates-out cds_coordinates.tsv \
sample.fna
Why coordinate export is useful:
- it helps compare predicted coding regions between runs
- it supports debugging chewBBACA-style locus classification differences
- it gives you a concrete record of what the CDS gate saw
Example coordinate TSV:
sample contig start end strand locus
sample.fna contig_1 1042 1983 + dnaA
sample.fna contig_1 4021 4899 - gyrB
Call Policy¶
The call policy controls how locus results are classified for reporting.
Default policy¶
gmlst typing cgmlst -s vparahaemolyticus_3 \
--call-policy default sample.fna
Use this when you want the normal gmlst interpretation without chewBBACA-style output classes.
chewBBACA-compatible policy¶
gmlst typing cgmlst -s vparahaemolyticus_3 \
--call-policy chewbbaca sample.fna
Important limits:
--call-policy chewbbacarequires FASTA assemblies- raw calls stay unchanged, but output uses chew-style per-locus class labels
- CDS-gated classification is enabled by default
CDS Gating¶
CDS gating decides whether chewBBACA-style classification should only consider matched sequence context that passes the CDS prediction filter.
# Default behavior when using --call-policy chewbbaca
gmlst typing cgmlst -s vparahaemolyticus_3 \
--call-policy chewbbaca \
--chew-cds-gate \
sample.fna
# Relax the gate
gmlst typing cgmlst -s vparahaemolyticus_3 \
--call-policy chewbbaca \
--no-chew-cds-gate \
sample.fna
Use the default gate when you want classification to stay close to coding-sequence expectations. Turn it off only when you are deliberately reviewing edge cases in lower-confidence sequence context.
Evidence Fallback¶
Low-confidence loci can go through targeted fallback paths depending on mode and environment settings.
Supported evidence fallback backends are:
blastnkmanucmer
These fallbacks are targeted, not full reruns. They are meant to rescue uncertain loci after the main pass has narrowed the search space.
Representative configuration:
export GMLST_CGMLST_EVIDENCE_FALLBACK_BACKEND=blastn
export GMLST_CGMLST_EVIDENCE_FALLBACK_MAX_LOCI=300
gmlst typing cgmlst -s vparahaemolyticus_3 sample.fna
Performance Tuning¶
Use these rules as a starting point:
- small to medium FASTA schemes, start with
standardorchew-balanced - large FASTA schemes, start with
chew-ultrafast - FASTQ, use
-b kmaand set-t 8to-t 16when CPU is available - many samples, add
--max-workersfor sample-level parallelism
Examples:
# Large scheme, fast assembly path
gmlst typing cgmlst -s vparahaemolyticus_3 \
--cgmlst-mode chew-ultrafast \
--max-workers 4 \
samples/*.fna
# FASTQ run with tuned threads
gmlst typing cgmlst -s vparahaemolyticus_3 \
-b kma -t 12 reads_R1.fastq.gz reads_R2.fastq.gz
Operational caveats:
--max-workers > 1switches to sample-level parallelism, which means per-sample backend threads are reduced to1--cds-coordinates-outis intended for single-worker runsnucmerignores thread settings
Large Scheme Handling¶
For schemes with 1000 or more loci:
- prefer
chew-ultrafastfor FASTA assemblies - expect the prefilter to matter more when the candidate space is still manageable
- remember that prefilter auto-skip can trigger above the configured threshold
- use
--max-workerscarefully, because each sample can still be expensive on its own
For very large schemes, the fastest run is not always the one with the most workers. You usually get better throughput by balancing sample-level parallelism with enough threads for the active backend.
Environment Variables¶
These variables affect cgMLST-specific behavior.
FASTA and minimap2 tuning¶
GMLST_MINIMAP2_FASTA_SPEED_PROFILE=default|fast|ultrafastGMLST_CGMLST_MINIMAP2_ULTRA_SECOND_PASS_MAX_LOCI=adaptive|<int>GMLST_CGMLST_MINIMAP2_HASH_REFINE_MAX_LOCI=<int>GMLST_CGMLST_MINIMAP2_HASH_PREFILTER=1
FASTQ and KMA tuning¶
GMLST_CGMLST_FASTQ_KMA_AUTO_THREADS=<int>GMLST_CGMLST_KMA_FASTQ_MEM_MODE=1|0GMLST_CGMLST_KMA_FASTQ_MEM_CONFIRM_MAX_LOCI=<int>
Prefilter and exact-hash controls¶
GMLST_CGMLST_PREFILTER_MAX_LOCI=<int>GMLST_CGMLST_EXACT_HASH_PREFILTER=1
CDS prediction and export¶
GMLST_CGMLST_CDS_PREDICTION_MODE=single|metaGMLST_CGMLST_CDS_TRAINING_FILE=/path/to/pyrodigal_training.trnGMLST_CGMLST_CDS_CLOSED_ENDS=1|0GMLST_CGMLST_CDS_COORDINATES_OUT=/path/to/cds_coordinates.tsv
Other related settings¶
GMLST_CACHE_DIR=/path/to/cacheGMLST_TMPDIR=/path/to/tmpGMLST_MINIMAP2_KMER_ENGINE=python|kmc|auto
Evidence fallback¶
GMLST_CGMLST_EVIDENCE_FALLBACK_BACKEND=none|blastn|kma|nucmerGMLST_CGMLST_EVIDENCE_FALLBACK_MAX_LOCI=<int>
Example environment setup¶
export GMLST_MINIMAP2_FASTA_SPEED_PROFILE=ultrafast
export GMLST_CGMLST_PREFILTER_MAX_LOCI=3000
export GMLST_CGMLST_FASTQ_KMA_AUTO_THREADS=8
export GMLST_CGMLST_EVIDENCE_FALLBACK_BACKEND=blastn
gmlst typing cgmlst -s vparahaemolyticus_3 sample.fna