Backend Comparison Guide¶
gmlst supports multiple alignment backends because MLST and cgMLST workloads are not all the same. Some runs need maximum confidence on assembled genomes, some need direct FASTQ support, and some need the fastest possible throughput on very large schemes. The best backend depends on your input type, scheme size, and whether you are optimizing for sensitivity, speed, or read mapping behavior.
In short, use blastn when you want the classic reference path on FASTA assemblies, kma when you are typing FASTQ reads and want a strong routine backend, minimap2 when you need very fast cgMLST on large datasets, and nucmer when you are working with more divergent organisms or unusually distant allele matches.
Backend Comparison Table¶
| Name | Input Types | Speed | Sensitivity | Best For | External Tool |
|---|---|---|---|---|---|
blastn |
FASTA | Moderate | Highest | Reference MLST, validation, assembled genomes | blastn, makeblastdb |
kma |
FASTA, FASTQ | Fast | High | FASTQ typing, routine MLST, cgMLST with reads | kma |
minimap2 |
FASTA, FASTQ | Very fast | High | Large cgMLST schemes, high throughput typing, fast FASTA pipelines | minimap2 (plus samtools for some FASTQ temp outputs when available) |
nucmer |
FASTA | Moderate | Very high on divergent matches | Distant organisms, sensitive cross-species matching | nucmer, show-coords |
How to Choose¶
| Scenario | Recommended Backend | Why |
|---|---|---|
| Small or routine MLST on assembled genomes | blastn |
Highest confidence, simple behavior, familiar gold-standard path |
| FASTQ MLST or FASTQ cgMLST | kma |
Direct paired-end support, read mapping workflow, strong routine behavior |
| Very large cgMLST on FASTA assemblies | minimap2 |
Fastest mainline path, multiple speed profiles, prefilter support |
| FASTQ typing with fast candidate generation and targeted validation | minimap2 |
Two-stage FASTQ path balances speed and confirmation |
| Divergent alleles or non-standard organisms | nucmer |
Sensitive for distant sequence matches |
| Result confirmation after a fast pass | blastn |
Useful as a slower second pass on flagged samples |
BLASTN (blastn)¶
blastn is the conservative reference backend for assembled genomes. It only accepts FASTA input and uses the BLAST+ database workflow, so it is usually the easiest backend to trust when you want a validation run or a final comparison point.
When to use¶
- Reference MLST on assembled genomes
- Validation after a fast screening pass
- Cases where you prefer sensitivity over raw speed
- Samples with borderline allele calls that need a familiar alignment path
Example commands¶
# Standard MLST on an assembly
gmlst typing mlst -s saureus_1 -b blastn sample.fasta
# Batch typing with multiple threads
gmlst typing mlst -s saureus_1 -b blastn -t 8 samples/*.fasta -o results.tsv
# Targeted cgMLST re-check on a flagged assembly
gmlst typing cgmlst -s vparahaemolyticus_3 -b blastn flagged_sample.fasta
Notes¶
- Input type: FASTA only
- Index builder:
makeblastdb - Threading: supported with
-t/--threads - Best role: highest-confidence typing on assemblies
Performance tips¶
- Use
-tfor larger runs. BLASTN benefits from multiple threads. - Prefer batch execution with
-oto keep output handling simple. - For large cgMLST schemes, BLASTN is often better as a fallback or review pass than as the first pass on every sample.
KMA (kma)¶
kma is the strongest general-purpose backend for FASTQ typing. It supports both FASTA and FASTQ, but it is especially valuable for direct read mapping, paired-end workflows, and routine cgMLST from reads.
When to use¶
- FASTQ MLST or cgMLST
- Paired-end read sets that you want to type directly
- Routine production workflows where speed and stable read mapping matter
- cgMLST runs where KMA behavior is preferred over minimap2 candidate scoring
Example commands¶
# MLST from paired-end reads
gmlst typing mlst -s saureus_1 -b kma reads/sample_R1.fastq.gz reads/sample_R2.fastq.gz
# cgMLST from FASTQ reads
gmlst typing cgmlst -s vparahaemolyticus_3 -b kma -t 8 reads/sample_R1.fastq.gz reads/sample_R2.fastq.gz
# FASTA typing with KMA
gmlst typing mlst -s saureus_1 -b kma sample.fasta
FASTQ behavior¶
gmlstauto-detects common paired-end naming patterns such as_R1/_R2,_1/_2, and.1/.2.- For FASTQ cgMLST, the CLI can auto-raise KMA threads with
GMLST_CGMLST_FASTQ_KMA_AUTO_THREADSwhen the run would otherwise stay on one thread. GMLST_CGMLST_KMA_FASTQ_MEM_MODE=1enables KMA-mem_modefor faster read mapping.GMLST_CGMLST_KMA_FASTQ_MEM_CONFIRM_MAX_LOCIcontrols a strict re-check stage for a limited number ofclosestloci after the fast-mem_modepass.
Notes¶
- Input type: FASTA and FASTQ
- Index builder: KMA index
- Threading: supported with
-t/--threads - Best role: preferred backend for FASTQ cgMLST and routine read-based typing
Performance tips¶
- For cgMLST, avoid
-t 1on large schemes. The CLI even warns that one-thread KMA can be very slow. - If you want fast routine FASTQ typing, keep
GMLST_CGMLST_KMA_FASTQ_MEM_MODE=1. - If exact recovery matters more than speed, increase the strict re-check budget with
GMLST_CGMLST_KMA_FASTQ_MEM_CONFIRM_MAX_LOCI.
minimap2 (minimap2)¶
minimap2 is the high-throughput backend for large cgMLST workloads. It supports both FASTA and FASTQ, but the internal path differs by input type.
For FASTA assemblies, gmlst uses a fast allele-to-genome path with optional exact-hash pre-resolution, representative prefiltering, adaptive rescue, and speed profiles. For FASTQ reads, gmlst uses a two-stage workflow: candidate generation first, then targeted remapping validation on uncertain loci.
When to use¶
- Large cgMLST on FASTA assemblies
- High-throughput batch typing
- FASTQ typing where you want fast shortlist generation plus targeted confirmation
- chewBBACA-style cgMLST workflows on FASTA assemblies
Example commands¶
# Default cgMLST on FASTA, minimap2 is the default backend here
gmlst typing cgmlst -s vparahaemolyticus_3 sample.fasta
# Explicit minimap2 backend with a chew-compatible mode
gmlst typing cgmlst -s vparahaemolyticus_3 -b minimap2 --cgmlst-mode chew-fast sample.fasta
# FASTQ typing with minimap2 candidate generation and targeted validation
gmlst typing mlst -s saureus_1 -b minimap2 reads/sample_R1.fastq.gz reads/sample_R2.fastq.gz
# Ultrafast large-scale cgMLST tuning
GMLST_MINIMAP2_FASTA_SPEED_PROFILE=ultrafast \
gmlst typing cgmlst -s vparahaemolyticus_3 -b minimap2 --cgmlst-mode chew-ultrafast samples/*.fasta -o cgmlst.tsv
FASTA path¶
The FASTA path is where most minimap2 cgMLST acceleration lives.
GMLST_MINIMAP2_FASTA_SPEED_PROFILE=default|fast|ultrafasttunes minimap2 seeding and chaining behavior.GMLST_CGMLST_EXACT_HASH_PREFILTER=1enables DNA exact-match pre-resolution before alignment.GMLST_CGMLST_MINIMAP2_HASH_PREFILTER=1enables hash-first candidate reduction.GMLST_CGMLST_PREFILTER_MAX_LOCIskips prefiltering automatically when the scheme is too large for the configured threshold.GMLST_CGMLST_MINIMAP2_HASH_REFINE_MAX_LOCIadds targeted missing-locus refinement.GMLST_CGMLST_MINIMAP2_HASH_LOCI_TOP_Nlimits how many loci are carried forward from the hash stage.GMLST_CGMLST_MINIMAP2_REPRESENTATIVE_MAIN_ALIGNMENT=1uses representative-only main alignment, useful in ultrafast-style workflows.GMLST_MINIMAP2_FASTA_EMIT_CIGAR=0disables FASTA CIGAR emission, which can help speed in representative-focused workflows.
FASTQ path¶
The FASTQ path has two stages.
- Candidate generation from read mappings and k-mer support scoring.
- Targeted remapping validation on uncertain loci before final allele calls.
Related knobs:
GMLST_MINIMAP2_KMER_ENGINE=python|kmc|autochooses the k-mer support scoring engine.autoprefers KMC when available and otherwise falls back to the built-in Python scorer.- When
samtoolsis installed, targeted validation can write BAM temp files. GMLST_TMPDIRcontrols where temporary files are created.
chewBBACA-compatible cgMLST modes¶
typing cgmlst supports the following mode values:
| Mode | Typical role |
|---|---|
standard |
Conservative default behavior |
chew-fast |
Faster chew-style pipeline with exact-hash, hash prefilter, refinement, and fallback |
chew-ultrafast |
Most aggressive FASTA-oriented throughput mode with representative alignment and second-pass rescue |
chew-bsr |
Adds protein-level exact-hash pre-resolution on top of the fast path |
chew-balanced |
Middle ground between speed and confirmation |
Important: these chew-style optimizations are FASTA-oriented. For FASTQ inputs, typing cgmlst auto-switches -b minimap2 to -b kma, and --cgmlst-mode becomes compatibility-only behavior.
Evidence fallback¶
Low-confidence loci can be sent to a targeted fallback backend.
GMLST_CGMLST_EVIDENCE_FALLBACK_BACKEND=none|blastn|kma|nucmerGMLST_CGMLST_EVIDENCE_FALLBACK_MAX_LOCI=<int>
This is useful when you want minimap2 to handle the main throughput path but still want a limited second opinion on uncertain loci.
MUMmer4 / nucmer (nucmer)¶
nucmer is the sensitive backend for FASTA assemblies when sequence divergence is a concern. It is a good fit for non-standard organisms, cross-species comparisons, and cases where allele matches may be more distant than usual MLST work.
When to use¶
- Distant or divergent assemblies
- Non-standard organisms or exploratory work
- Cross-species comparison where exact reference behavior matters less than sensitive matching
Example commands¶
# Sensitive MLST on a divergent assembly
gmlst typing mlst -s saureus_1 -b nucmer sample.fasta
# cgMLST review on a difficult assembly
gmlst typing cgmlst -s vparahaemolyticus_3 -b nucmer flagged_sample.fasta
Notes¶
- Input type: FASTA only
- Index builder: nucmer reference index workflow
- Threading: limited compared with BLASTN and KMA,
gmlstwarns that thread settings may be ignored - Best role: sensitive fallback for divergent sequences
Backend Selection Guide¶
Quick decision table¶
| If you have... | Choose... | Reason |
|---|---|---|
| Assembled genomes, final reference calls | blastn |
Highest-confidence classic path |
| Paired-end FASTQ reads | kma |
Direct read mapping and strong routine behavior |
| Thousands of cgMLST loci on FASTA | minimap2 |
Best throughput and acceleration features |
| Divergent assemblies | nucmer |
Sensitive matching for distant alleles |
| Fast screening, then selective confirmation | minimap2 or kma, then blastn |
Good speed first, slower validation second |
Simple decision tree¶
- Are your inputs FASTQ reads?
- Yes: start with
kma. - If you specifically want minimap2 FASTQ validation behavior, use
minimap2for MLST. For cgMLST FASTQ, the CLI will still normalize toward KMA. - Are your inputs FASTA assemblies?
- If you want the fastest large cgMLST path, use
minimap2. - If you want a conservative reference path, use
blastn. - If the organism or allele space looks more divergent than usual, try
nucmer.
Performance Tips¶
- Use
-t/--threadson BLASTN, KMA, and minimap2 runs that process large schemes or batches. - Use
--max-workersfor sample-level parallelism when typing many samples. - For large cgMLST on FASTA, start with
minimap2and only fall back on flagged loci or flagged samples. - For FASTQ cgMLST with KMA, avoid single-thread runs unless you are debugging.
- Keep output on disk with
-oduring large runs instead of printing everything to the terminal. - Reuse cached schemes and indexes for repeat typing. The first run pays the indexing cost, later runs are cheaper.
- If temporary storage is slow or cramped, move temp files with
GMLST_TMPDIR.
FASTQ-specific Notes¶
Paired-end auto detection¶
gmlst auto-detects paired FASTQ files by common naming patterns:
sample_R1.fastq.gz+sample_R2.fastq.gzsample_1.fq.gz+sample_2.fq.gzsample.1.fastq.gz+sample.2.fastq.gz
Example:
gmlst typing mlst -s saureus_1 -b kma reads/sample_R1.fastq.gz reads/sample_R2.fastq.gz
k-mer support scoring¶
For minimap2 FASTQ typing, GMLST_MINIMAP2_KMER_ENGINE controls the scoring engine:
python, built-in scorerkmc, KMC/KMC toolsauto, prefer KMC when installed, otherwise use Python
Targeted validation¶
minimap2 FASTQ mode is not just a single mapping pass. It builds candidate alleles, scores them, then remaps uncertain loci in a targeted validation stage. That makes it a good choice when you want speed without giving up all post-filter confirmation.
Related call types¶
Backend choice changes how evidence is collected, but per-locus calls still end up in the same five categories:
| Call type | Rule |
|---|---|
exact |
identity 100%, coverage >= 1.0 |
closest |
identity >= 95%, coverage >= 0.95 |
novel |
coverage >= 0.95, identity < 95% |
partial |
coverage > 0 and < 0.95 |
missing |
no match |