Upcoming release¶
These notes describe the upcoming first release line of Codefang. No v* tag has been cut yet, so this page tracks everything staged in the [Unreleased] section of the changelog. When the first tag ships, the GoReleaser pipeline in the CI/CD workflow publishes the matching GitHub Release, and these notes are promoted to a versioned page.
Highlights¶
- Codefang is a complete rewrite of the original
src-d/herculesproject into a modern, idiomatic Go codebase with clean architecture and comprehensive tests. - The tool ships as two composable binaries:
uast(a Tree-sitter-based Universal AST parser for 60+ languages) andcodefang(the static and history analysis engine). - Several output formats target both humans and data warehouses:
timeseries,plot(interactive HTML), andndjson(streaming, warehouse-friendly). - An MCP server (
codefang mcp) exposes analysis to AI agents over the Model Context Protocol. - A streaming pipeline with checkpointing and crash recovery makes bounded-memory analysis of very large repositories practical.
- OpenTelemetry observability provides distributed tracing, RED metrics, and structured logging.
Upgrade notes¶
This is the first release line, so there is no prior version to upgrade from. If you tracked an unreleased build, note these behavior changes before adopting the release:
- Default file filtering changed. Analysis now excludes vendored and generated files by default, matching mature multi-language analyzers. To restore the previous behavior, pass
--include-vendored --include-generated. --languagesis now cross-phase. It narrows both static and history analysis through a single source of truth and fails fast on unknown language tokens. Pipelines that relied on static analysis ignoring this flag must drop it or supply valid tokens.- Output schema changes. Several fields moved from maps to sorted arrays so they can be UNNEST'd in columnar warehouses:
developers[].languages,activity[].by_developer, andfile_contributors[].contributors. Clone-pairfunc_a/func_bpaths are now relative. Update any consumers that read these fields. - Deprecated flags.
--skip-blacklistis now a no-op, and--blacklisted-prefixesis superseded by--extra-excluded-prefixes(identical semantics). Both emit a deprecation warning when passed.
Added¶
- Temporal anomaly detection analyzer (
history/anomaly) using Z-score statistical analysis over sliding windows to detect sudden quality degradation in commit history. - TimeSeries output format (
--format timeseries) that merges all analyzer outputs into a single chronologically ordered JSON array. - Plot output format (
--format plot) generating interactive HTML charts via go-echarts. - NDJSON output format (
--format ndjson) emitting one JSON line per analyzer result (with an optional metadata line) for streaming ingestion into warehouses such as ClickHouse. - MCP server (
codefang mcp) exposing analysis capabilities as tools for AI agents via the Model Context Protocol (stdio transport, JSON-RPC 2.0), with thecodefang_analyze,uast_parse, andcodefang_historytools. - Docker support with a multi-stage Dockerfile and a
debian:bookworm-slimruntime image that runs as a non-rootcodefanguser by default. - GitHub Actions integration (
action.yml) for automated code-quality checks in CI pipelines with configurable analyzers, formats, and quality gates. - OpenTelemetry observability with distributed tracing, RED metrics (rate, errors, duration), and structured logging with trace-context injection. Supports Jaeger, Prometheus, and OTLP collectors.
- Streaming pipeline for bounded-memory processing of large repositories via chunk-based processing with hibernate/boot cycles.
- Double-buffered chunk pipelining that overlaps pipeline prefetch with analyzer consumption for higher throughput on multi-chunk workloads.
- Checkpointing with automatic save after each processed chunk and crash recovery via the
--resumeflag. Checkpoint format v2 preserves aggregator spill state so resumed runs produce identical output. - Configuration system (
.codefang.yaml) with file, environment-variable, and CLI-flag support. Merge priority: CLI > env > file > defaults. - Large-scale scanning support for fleet analysis of thousands of repositories, with bare-repo support, GNU Parallel and Kubernetes orchestration patterns, and data-warehouse loading guides (Athena, Snowflake, Spark).
- Deep context propagation that eliminates
context.Background()calls in production hot paths for end-to-end tracing. - Attribute-filter span processor that enforces an allow-list of attribute key prefixes to prevent PII leakage to collectors.
- Health-check endpoints (
/healthz,/readyz,/metrics) for server-mode deployments. - Incremental scanning with the
--sinceflag, supporting Go durations, date strings, and RFC 3339 timestamps. - Memory-budget auto-tuning via the
--memory-budgetflag with automatic chunk-size calculation. - Watchdog stall detection with a configurable
worker_timeoutfor identifying hung workers. --include-vendoredflag (bool, defaultfalse) to re-include paths detected as vendored by enry/Linguist (vendor/,node_modules/,third_party/,testdata/,dist/, minified bundles, and more).--include-generatedflag (bool, defaultfalse) to re-include auto-generated files (*.pb.go,zz_generated_*.go,*_pb2.py,*.min.js, and content-header markers such asDO NOT EDIT,Code generated,@generated).--extra-excluded-prefixesflag (strings, default[]) for additional UNIX path prefixes to exclude for ecosystems enry does not know (for example.venv/,target/,.gradle/).source_filefield on every function record (relative path, for examplepkg/kubelet/kubelet.go) forstatic/complexity,static/halstead,static/cohesion, andstatic/comments.languagefield on every function record (for examplego,bash) for the same static analyzers.directoryfield on every function record (for examplepkg/kubelet) for the same static analyzers.start_timeandend_timefields (RFC 3339) on every history time-series tick forhistory/sentiment,history/anomaly,history/quality,history/devs, andhistory/file-history.emailfield on developer records, plusprimary_dev_email/secondary_dev_emailon bus-factor records anddeveloper1_email/developer2_emailon developer-coupling records, via a newSplitIdentityhelper.- Top-level
metadatasection in the report envelope withrepo_path,repo_name,analyzed_at(RFC 3339), andcodefang_version. - Per-analyzer
schemamanifest in each analyzer result with fieldtype,grain, anddescriptionfor automated ETL generation across all 17 analyzers. clone_type_distributioncomputed from the full pair population rather than the capped 1,000-pair sample, so Type-½/3 percentages are accurate on large codebases.
Changed¶
- Complete rewrite from the original
src-d/herculesproject into a modern Go codebase with idiomatic patterns, clean architecture, and comprehensive test coverage. - Split the tool into two binaries:
uast(Universal AST parser using Tree-sitter for 60+ languages) andcodefang(analysis engine for static and history analysis). Unix philosophy: small tools joined by pipes. - Replaced the original Babelfish/bblfsh parser with a Tree-sitter-based UAST that is faster, more reliable, and locally compiled, supporting 60+ programming languages.
- Added a DSL-based UAST mapping layer with a custom domain-specific language for transforming Tree-sitter ASTs into standardized UAST nodes.
- Vendored libgit2 (
third_party/libgit2), compiled as a static library for reproducible builds without external dependencies. - Replaced all
log.Printfandfmt.Printfcalls with structured logging vialog/slog. Instance loggers are enforced bysloglint. - Removed all backward-compatibility fallbacks in the quality, anomaly, sentiment, and devs
ParseReportDatapaths; formalized shotness shallow extraction; and integrated VADER (GoVader) for real sentiment analysis. - Default analysis output across both phases now excludes vendor and generated files, matching the convention of mature multi-language analyzers. Pass
--include-vendored --include-generatedto restore the previous behavior. Breaking change. - Made
--languagescross-phase. Static analysis previously ignored the flag; it now narrows both static and history phases through a singleinternal/analyzers/plumbing/langpathsource of truth, with fail-fast errors on unknown language tokens. - Pushed the
--languagesfilter down into libgit2 via a newcf_tree_diff_v2C ABI that forwards a pathspec togit_diff_options.pathspec, cutting tree-diff work on polyglot repositories with a narrow filter (about 34% wall-time and 36%cgocallCPU reduction on a synthetic fixture). - Replaced the
FileContentAnalyzer+WalksAllFilesmarker-interface pattern with explicitStaticAnalyzer(UAST) andRawFileAnalyzer(raw file) hierarchies sharing aFormattableAnalyzerbase, driven by explicit pipeline stages. This enabled relative source paths and per-record language stamping. - Changed
developers[].languagesfrom a map keyed by language name to a sorted[]LanguageStatsEntryarray so it can be UNNEST'd in columnar warehouses. Empty language strings becomeOther. Breaking change to output schema. - Changed
activity[].by_developerfrom amap[int]intto a sorted[]DeveloperCommitsarray of{dev_id, commits}. Breaking change to output schema. - Changed
file_contributors[].contributorsfrom amap[int]LineStatsto a sorted[]ContributorEntryarray of{dev_id, added, removed, changed}. Breaking change to output schema. - Changed clone-pair
func_a/func_bpaths from absolute to relative.
Deprecated¶
--skip-blacklistis now a no-op (the new default already excludes vendor and generated files); a Cobra deprecation warning fires when it is passed.--blacklisted-prefixesis superseded by--extra-excluded-prefixes(identical semantics); a Cobra deprecation warning fires when it is passed.
Removed¶
- Removed
// FRD: specs/frds/FRD-...mdcomments from all.gofiles;specs/is gitignored, so those references broke for anyone cloning the repo. Traceability now lives in FRDs and PR descriptions.
Fixed¶
- Fixed a data race in
internal/framework.PipelineSampler:t1Capturedwas a plainboolread by the sampler goroutine and written by the caller, causing intermittent races undergo test -race. It is now async/atomic.BoolusingCompareAndSwap, so at most one t1 heap profile is captured. Removed the unusedt0Capturedfield.
Security¶
- No security fixes are staged for this release.
Known issues¶
- No tagged release exists yet, so the changelog
[Unreleased]reference link points at the commits view rather than a version compare base. The link is replaced with a compare base once the firstv*tag ships. - Several output-schema changes listed under Changed are breaking. Consumers that parsed the previous map-shaped fields must migrate to the sorted-array shapes before adopting this release.