Large-Scale Repository Scanning¶
This guide covers using Codefang to periodically scan large heterogeneous codebases — thousands of repositories — and feed results into a data warehouse for analytics.
Overview¶
Codefang operates on one repository at a time. Scanning many repositories requires external orchestration that invokes codefang run per repo and collects the JSON output.
graph LR
A[Repo List] --> B[Orchestrator]
B --> C1[codefang run repo1]
B --> C2[codefang run repo2]
B --> C3[codefang run repoN]
C1 --> D[JSON Output]
C2 --> D
C3 --> D
D --> E[S3 / Object Storage]
E --> F[Data Warehouse] The streaming pipeline, checkpointing, and bounded-memory execution make each individual run safe on large repositories without manual tuning.
Bare Repository Support¶
Codefang uses libgit2 which opens both normal and bare repositories transparently. No cloning is required.
# Point directly at a bare repo
codefang run /data/backups/repositories/@hashed/ab/cd/abcdef.git \
-a history/* --format json --silent
This is relevant for:
- GitLab backups — contain bare repos under
repositories/@hashed/ - Gitolite mirrors — bare repos by convention
git clone --baremirrors
GitLab Backup Layout¶
GitLab backup tarballs contain bare repositories at predictable paths:
backup.tar
├── repositories/
│ └── @hashed/
│ ├── ab/cd/<sha256>.git # bare repo
│ ├── ab/cd/<sha256>.wiki.git # wiki (skip)
│ └── ...
└── db/
└── database.sql.gz # project name mapping
Extract and scan:
tar xf backup.tar -C /data/extract repositories/
find /data/extract/repositories -name "*.git" \
! -name "*.wiki.git" \
! -name "*.design.git" \
-type d > /tmp/repo_paths.txt
Output Format for DWH Ingestion¶
Use --format json for structured output suitable for data warehouse loading:
Clean output
The --silent flag suppresses progress output on stderr, keeping stdout clean for piping.
Adding Metadata¶
Wrap the output with repository metadata using jq before uploading:
codefang run "$REPO_PATH" -a history/* --format json --silent \
| jq --arg repo "$REPO_NAME" --arg date "$(date -u +%Y-%m-%d)" \
'{repo: $repo, scan_date: $date, results: .}'
TimeSeries Format¶
For time-series analytics, use --format timeseries to merge all analyzer outputs into a single chronologically-ordered JSON array:
Orchestration¶
GNU Parallel (Simple)¶
For a flat list of repository paths:
REPOS="/tmp/repo_paths.txt"
S3_BUCKET="s3://analytics/codefang"
DATE=$(date +%Y-%m-%d)
cat "$REPOS" | parallel -j 8 --joblog /tmp/codefang-jobs.log '
REPO_NAME=$(basename {} .git)
codefang run {} \
-a history/* \
--format json \
--memory-budget 4GiB \
--workers 4 \
--silent \
2>/tmp/codefang-logs/${REPO_NAME}.log \
| aws s3 cp - '"${S3_BUCKET}/${DATE}"'/${REPO_NAME}.json
'
This produces s3://analytics/codefang/2025-01-15/repo-name.json, partitioned by date.
Kubernetes Jobs¶
For large fleets (thousands of repos), use Kubernetes Jobs or Argo Workflows:
apiVersion: batch/v1
kind: Job
metadata:
name: codefang-scan-myrepo
spec:
backoffLimit: 2
template:
spec:
containers:
- name: codefang
image: codefang:latest
command:
- sh
- -c
- |
codefang run /workspace \
-a history/* \
--format json \
--memory-budget 4GiB \
--workers 4 \
--checkpoint \
--silent \
| aws s3 cp - s3://bucket/results.json
volumeMounts:
- name: repo
mountPath: /workspace
readOnly: true
- name: checkpoint
mountPath: /tmp/codefang-checkpoints
resources:
requests:
cpu: "4"
memory: 6Gi
limits:
memory: 8Gi
volumes:
- name: repo
persistentVolumeClaim:
claimName: repo-myrepo
- name: checkpoint
emptyDir: {}
restartPolicy: OnFailure
Key considerations
- Resource requests: 4 CPU / 6 Gi memory per pod handles most repositories
- Checkpoint volume: use
emptyDirfor ephemeral or a PVC for crash-recovery - Parallelism: 50-100 concurrent pods is typical for 60k repositories
- Scheduling: use a CronJob or Argo CronWorkflow for periodic scans
Memory Budget and Streaming¶
The --memory-budget flag controls how Codefang splits a repository's commit history into chunks. See Streaming Pipeline for details.
| Repository Size | Memory Budget | Expected Behavior |
|---|---|---|
| < 1k commits | 2 GiB | Single chunk, no hibernation |
| 1k-10k commits | 4 GiB | 2-10 chunks |
| 10k-100k commits | 4-8 GiB | 10-100 chunks with checkpoints |
| 100k+ commits | 8 GiB | Many chunks, checkpointing essential |
When unset, the budget defaults to 50% of system memory (capped at 4 GiB).
Incremental Scanning with --since¶
For periodic scans, use --since to analyze only new commits since the last run:
# Only commits from the last 7 days
codefang run /path/to/repo -a history/* --format json --since 168h
# Only commits after a specific date
codefang run /path/to/repo -a history/* --format json --since 2025-01-01
# RFC3339 timestamp
codefang run /path/to/repo -a history/* --format json --since 2025-01-01T00:00:00Z
| Format | Example | Meaning |
|---|---|---|
| Go duration | 168h | 7 days before now |
| Date only | 2025-01-01 | Midnight UTC on that date |
| RFC3339 | 2025-01-01T00:00:00Z | Exact timestamp |
Checkpointing and Crash Recovery¶
Checkpointing is enabled by default. After each fully processed chunk, the pipeline saves analyzer state to disk. If a run is interrupted (OOM kill, pod eviction, timeout), the next invocation with --resume automatically restarts from the last completed chunk.
codefang run /path/to/repo \
-a history/* \
--format json \
--checkpoint \
--resume \
--checkpoint-dir /tmp/codefang-checkpoints
Kubernetes
Mount the checkpoint directory on a PVC to survive pod restarts.
DWH Loading¶
Amazon Athena¶
Upload JSON to S3 with date partitioning, then create an external table:
CREATE EXTERNAL TABLE codefang_results (
repo STRING,
scan_date STRING,
results STRING
)
PARTITIONED BY (dt STRING)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://analytics/codefang/'
TBLPROPERTIES ('has_encrypted_data'='false');
MSCK REPAIR TABLE codefang_results;
Snowflake¶
CREATE STAGE codefang_stage URL='s3://analytics/codefang/'
CREDENTIALS=(AWS_KEY_ID='...' AWS_SECRET_KEY='...');
COPY INTO codefang_raw FROM @codefang_stage
FILE_FORMAT=(TYPE=JSON);
-- Flatten nested analyzer results
SELECT
raw:repo::STRING AS repo,
raw:scan_date::DATE AS scan_date,
f.value:author::STRING AS author,
f.value:commits::INT AS commits
FROM codefang_raw,
LATERAL FLATTEN(input => raw:results:devs:authors) f;
Spark / AWS Glue¶
df = spark.read.json("s3://analytics/codefang/2025-01-15/")
devs_df = df.select("repo", explode("results.devs.authors").alias("author"))
devs_df.write.partitionBy("scan_date").parquet("s3://warehouse/codefang/devs/")
Observability¶
Codefang includes OpenTelemetry support for monitoring scan fleet health. See Observability for full details.
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
codefang run /path/to/repo -a history/* --format json --silent
| Metric | Use |
|---|---|
codefang.request.duration.seconds | Identify slow repositories |
codefang.errors.total | Track failure rate across the fleet |
codefang.cache.hits / codefang.cache.misses | Tune cache sizes |
Sizing Estimates¶
| Parameter | Typical Value |
|---|---|
| Per-repo runtime (4 cores, 4 GiB) | 1-10 min |
| JSON output per repo | 100 KB - 50 MB |
| 60k repos at 100 concurrent pods | 10-100 hours per full scan |
| Total S3 per full scan | 10-100 GB |
| Incremental scan (7 days) | 5-20x faster than full |
The primary bottleneck is I/O (reading Git objects). Bare repos on local SSD or NVMe-backed EBS deliver the best throughput.