Understanding typo detection¶
This page explains the mental model behind the typos analyzer: how it detects typo-fix identifier pairs, what Levenshtein distance measures, and how the aggregation pipeline is structured. For configuration keys and the output schema, see the Typos reference.
What it measures¶
The typos analyzer detects typo-fix identifier pairs from source code in commit diffs using Levenshtein distance. It builds a dataset of probable typos and their corrections by analyzing UAST identifier changes across Git history.
Typo-fix pair detection¶
For each commit, the analyzer:
- Computes the diff between the old and new versions of each changed file
- Identifies delete/insert hunk pairs of equal size (same number of lines)
- Compares corresponding lines using Levenshtein distance
- For line pairs within the distance threshold, extracts UAST identifiers from the old and new versions
- If exactly one identifier changed between the two versions, it records a typo-fix pair
Levenshtein distance¶
The Levenshtein distance is the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another. A small distance between the old and new line suggests a typo fix rather than a semantic change.
Example
recievetoreceive-- distance 2 (transposed characters)lenghttolength-- distance 2 (transposed characters)calcualtetocalculate-- distance 2 (transposed characters)
Architecture¶
The typos analyzer follows the TC/Aggregator pattern:
- Consume phase: For each commit,
Consume()computes diffs, identifies line pairs within Levenshtein distance, and extracts UAST identifier changes. Per-commit typos are returned asTC{Data: []Typo}. The analyzer retains no per-commit state; only thelcontext(Levenshtein context) is kept as working state. - Aggregation phase: A
typos.Aggregatorcollects TCs into aSliceSpillStore[Typo].FlushTick()deduplicates typos bywrong|correctkey (keeping the first occurrence), returning aTickDatawith the unique set. - Serialization phase:
SerializeTICKs()assembles all tick data into ananalyze.Report{"typos": allTypos}, then delegates toComputeAllMetrics()for JSON, YAML, binary, or HTML plot output.
This separation enables streaming output, budget-aware memory spilling, and decoupled aggregation.
Use cases¶
- Typo dataset building: Build a corpus of real-world typos and their corrections from your project's history. This can train spell-checking tools or IDE plugins.
- Code quality auditing: Identify patterns of common misspellings in your codebase to add to a linting dictionary.
- API consistency: Detect identifier typos that may cause confusion (e.g.,
getUserByNmaevsgetUserByName). - Automated fix suggestions: Use the typo dataset to build automated correction rules for CI pipelines.
- Research: Academic research on developer typo patterns and their prevalence across different languages.
Limitations¶
- UAST required: Only languages with UAST parser support are analyzed. Identifiers in unsupported languages are not extracted.
- Single-identifier changes only: The analyzer only records a typo when exactly one identifier changes between the old and new lines. Multi-identifier changes are skipped to avoid false positives.
- Equal-length hunks only: Only delete/insert hunk pairs with the same number of lines are considered. A typo fix that also adds or removes lines will be missed.
- False positives: Intentional identifier renames with small Levenshtein distance (e.g.,
idxtojdx) will be reported as typos. - Deduplication: Typo pairs are deduplicated by the
wrong|correctkey. The same typo fixed in multiple commits is reported only once. - CPU intensive: Like all UAST-based analyzers, the typos analyzer parses both file versions for every changed file in every commit.
See also¶
- Typos reference — configuration keys and output schema.
- Quick start — run history analysis.