Typos Analyzer¶
The typos analyzer detects typo-fix identifier pairs from source code in commit diffs using Levenshtein distance. It builds a dataset of probable typos and their corrections by analyzing UAST identifier changes across Git history.
Quick Start¶
With a custom distance threshold:
Requires UAST
The typos analyzer needs UAST support to extract identifiers from source code. It is automatically enabled when the UAST pipeline is available.
Architecture¶
The typos analyzer follows the TC/Aggregator pattern:
- Consume phase: For each commit,
Consume()computes diffs, identifies line pairs within Levenshtein distance, and extracts UAST identifier changes. Per-commit typos are returned asTC{Data: []Typo}. The analyzer retains no per-commit state; only thelcontext(Levenshtein context) is kept as working state. - Aggregation phase: A
typos.Aggregatorcollects TCs into aSliceSpillStore[Typo].FlushTick()deduplicates typos bywrong|correctkey (keeping the first occurrence), returning aTickDatawith the unique set. - Serialization phase:
SerializeTICKs()assembles all tick data into ananalyze.Report{"typos": allTypos}, then delegates toComputeAllMetrics()for JSON, YAML, binary, or HTML plot output.
This separation enables streaming output, budget-aware memory spilling, and decoupled aggregation.
What It Measures¶
Typo-Fix Pair Detection¶
For each commit, the analyzer:
- Computes the diff between the old and new versions of each changed file
- Identifies delete/insert hunk pairs of equal size (same number of lines)
- Compares corresponding lines using Levenshtein distance
- For line pairs within the distance threshold, extracts UAST identifiers from the old and new versions
- If exactly one identifier changed between the two versions, it records a typo-fix pair
Levenshtein Distance¶
The Levenshtein distance is the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another. A small distance between the old and new line suggests a typo fix rather than a semantic change.
Example
recievetoreceive-- distance 2 (transposed characters)lenghttolength-- distance 2 (transposed characters)calcualtetocalculate-- distance 2 (transposed characters)
Configuration Options¶
| Option | Type | Default | Description |
|---|---|---|---|
TyposDatasetBuilder.MaximumAllowedDistance | int | 4 | Maximum Levenshtein distance between two lines to consider them a typo-fix candidate. Lower values produce fewer but higher-confidence results. |
Tuning the distance
- Distance 1-2: Very high confidence. Catches single-character typos.
- Distance 3-4 (default): Good balance. Catches transposition errors and short misspellings.
- Distance 5+: Lower confidence. May produce false positives from intentional renames.
Example Output¶
{
"typos": [
{
"wrong": "recieve",
"correct": "receive",
"file": "pkg/api/handler.go",
"commit": "a1b2c3d4e5f6...",
"line": 42
},
{
"wrong": "calcualte",
"correct": "calculate",
"file": "pkg/math/stats.go",
"commit": "f6e5d4c3b2a1...",
"line": 15
},
{
"wrong": "reponse",
"correct": "response",
"file": "pkg/api/client.go",
"commit": "1a2b3c4d5e6f...",
"line": 88
}
]
}
Use Cases¶
- Typo dataset building: Build a corpus of real-world typos and their corrections from your project's history. This can train spell-checking tools or IDE plugins.
- Code quality auditing: Identify patterns of common misspellings in your codebase to add to a linting dictionary.
- API consistency: Detect identifier typos that may cause confusion (e.g.,
getUserByNmaevsgetUserByName). - Automated fix suggestions: Use the typo dataset to build automated correction rules for CI pipelines.
- Research: Academic research on developer typo patterns and their prevalence across different languages.
Limitations¶
- UAST required: Only languages with UAST parser support are analyzed. Identifiers in unsupported languages are not extracted.
- Single-identifier changes only: The analyzer only records a typo when exactly one identifier changes between the old and new lines. Multi-identifier changes are skipped to avoid false positives.
- Equal-length hunks only: Only delete/insert hunk pairs with the same number of lines are considered. A typo fix that also adds or removes lines will be missed.
- False positives: Intentional identifier renames with small Levenshtein distance (e.g.,
idxtojdx) will be reported as typos. - Deduplication: Typo pairs are deduplicated by the
wrong|correctkey. The same typo fixed in multiple commits is reported only once. - CPU intensive: Like all UAST-based analyzers, the typos analyzer parses both file versions for every changed file in every commit.