Skip to content

Large-Scale Repository Scanning

This guide covers using Codefang to periodically scan large heterogeneous codebases — thousands of repositories — and feed results into a data warehouse for analytics.

Overview

Codefang operates on one repository at a time. Scanning many repositories requires external orchestration that invokes codefang run per repo and collects the JSON output.

graph LR
    A[Repo List] --> B[Orchestrator]
    B --> C1[codefang run repo1]
    B --> C2[codefang run repo2]
    B --> C3[codefang run repoN]
    C1 --> D[JSON Output]
    C2 --> D
    C3 --> D
    D --> E[S3 / Object Storage]
    E --> F[Data Warehouse]

The streaming pipeline, checkpointing, and bounded-memory execution make each individual run safe on large repositories without manual tuning.

Bare Repository Support

Codefang uses libgit2 which opens both normal and bare repositories transparently. No cloning is required.

# Point directly at a bare repo
codefang run /data/backups/repositories/@hashed/ab/cd/abcdef.git \
  -a history/* --format json --silent

This is relevant for:

  • GitLab backups — contain bare repos under repositories/@hashed/
  • Gitolite mirrors — bare repos by convention
  • git clone --bare mirrors

GitLab Backup Layout

GitLab backup tarballs contain bare repositories at predictable paths:

backup.tar
├── repositories/
│   └── @hashed/
│       ├── ab/cd/<sha256>.git        # bare repo
│       ├── ab/cd/<sha256>.wiki.git   # wiki (skip)
│       └── ...
└── db/
    └── database.sql.gz               # project name mapping

Extract and scan:

tar xf backup.tar -C /data/extract repositories/

find /data/extract/repositories -name "*.git" \
  ! -name "*.wiki.git" \
  ! -name "*.design.git" \
  -type d > /tmp/repo_paths.txt

Output Format for DWH Ingestion

Use --format json for structured output suitable for data warehouse loading:

codefang run /path/to/repo -a history/* --format json --silent

Clean output

The --silent flag suppresses progress output on stderr, keeping stdout clean for piping.

Adding Metadata

Wrap the output with repository metadata using jq before uploading:

codefang run "$REPO_PATH" -a history/* --format json --silent \
  | jq --arg repo "$REPO_NAME" --arg date "$(date -u +%Y-%m-%d)" \
       '{repo: $repo, scan_date: $date, results: .}'

TimeSeries Format

For time-series analytics, use --format timeseries to merge all analyzer outputs into a single chronologically-ordered JSON array:

codefang run /path/to/repo -a history/devs,history/burndown \
  --format timeseries --silent

Orchestration

GNU Parallel (Simple)

For a flat list of repository paths:

REPOS="/tmp/repo_paths.txt"
S3_BUCKET="s3://analytics/codefang"
DATE=$(date +%Y-%m-%d)

cat "$REPOS" | parallel -j 8 --joblog /tmp/codefang-jobs.log '
  REPO_NAME=$(basename {} .git)
  codefang run {} \
    -a history/* \
    --format json \
    --memory-budget 4GiB \
    --workers 4 \
    --silent \
    2>/tmp/codefang-logs/${REPO_NAME}.log \
  | aws s3 cp - '"${S3_BUCKET}/${DATE}"'/${REPO_NAME}.json
'

This produces s3://analytics/codefang/2025-01-15/repo-name.json, partitioned by date.

Kubernetes Jobs

For large fleets (thousands of repos), use Kubernetes Jobs or Argo Workflows:

apiVersion: batch/v1
kind: Job
metadata:
  name: codefang-scan-myrepo
spec:
  backoffLimit: 2
  template:
    spec:
      containers:
        - name: codefang
          image: codefang:latest
          command:
            - sh
            - -c
            - |
              codefang run /workspace \
                -a history/* \
                --format json \
                --memory-budget 4GiB \
                --workers 4 \
                --checkpoint \
                --silent \
              | aws s3 cp - s3://bucket/results.json
          volumeMounts:
            - name: repo
              mountPath: /workspace
              readOnly: true
            - name: checkpoint
              mountPath: /tmp/codefang-checkpoints
          resources:
            requests:
              cpu: "4"
              memory: 6Gi
            limits:
              memory: 8Gi
      volumes:
        - name: repo
          persistentVolumeClaim:
            claimName: repo-myrepo
        - name: checkpoint
          emptyDir: {}
      restartPolicy: OnFailure

Key considerations

  • Resource requests: 4 CPU / 6 Gi memory per pod handles most repositories
  • Checkpoint volume: use emptyDir for ephemeral or a PVC for crash-recovery
  • Parallelism: 50-100 concurrent pods is typical for 60k repositories
  • Scheduling: use a CronJob or Argo CronWorkflow for periodic scans

Memory Budget and Streaming

The --memory-budget flag controls how Codefang splits a repository's commit history into chunks. See Streaming Pipeline for details.

Repository Size Memory Budget Expected Behavior
< 1k commits 2 GiB Single chunk, no hibernation
1k-10k commits 4 GiB 2-10 chunks
10k-100k commits 4-8 GiB 10-100 chunks with checkpoints
100k+ commits 8 GiB Many chunks, checkpointing essential

When unset, the budget defaults to 50% of system memory (capped at 4 GiB).

Incremental Scanning with --since

For periodic scans, use --since to analyze only new commits since the last run:

# Only commits from the last 7 days
codefang run /path/to/repo -a history/* --format json --since 168h

# Only commits after a specific date
codefang run /path/to/repo -a history/* --format json --since 2025-01-01

# RFC3339 timestamp
codefang run /path/to/repo -a history/* --format json --since 2025-01-01T00:00:00Z
Format Example Meaning
Go duration 168h 7 days before now
Date only 2025-01-01 Midnight UTC on that date
RFC3339 2025-01-01T00:00:00Z Exact timestamp

Checkpointing and Crash Recovery

Checkpointing is enabled by default. After each fully processed chunk, the pipeline saves analyzer state to disk. If a run is interrupted (OOM kill, pod eviction, timeout), the next invocation with --resume automatically restarts from the last completed chunk.

codefang run /path/to/repo \
  -a history/* \
  --format json \
  --checkpoint \
  --resume \
  --checkpoint-dir /tmp/codefang-checkpoints

Kubernetes

Mount the checkpoint directory on a PVC to survive pod restarts.

DWH Loading

Amazon Athena

Upload JSON to S3 with date partitioning, then create an external table:

CREATE EXTERNAL TABLE codefang_results (
  repo STRING,
  scan_date STRING,
  results STRING
)
PARTITIONED BY (dt STRING)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://analytics/codefang/'
TBLPROPERTIES ('has_encrypted_data'='false');

MSCK REPAIR TABLE codefang_results;

Snowflake

CREATE STAGE codefang_stage URL='s3://analytics/codefang/'
  CREDENTIALS=(AWS_KEY_ID='...' AWS_SECRET_KEY='...');

COPY INTO codefang_raw FROM @codefang_stage
  FILE_FORMAT=(TYPE=JSON);

-- Flatten nested analyzer results
SELECT
  raw:repo::STRING AS repo,
  raw:scan_date::DATE AS scan_date,
  f.value:author::STRING AS author,
  f.value:commits::INT AS commits
FROM codefang_raw,
  LATERAL FLATTEN(input => raw:results:devs:authors) f;

Spark / AWS Glue

df = spark.read.json("s3://analytics/codefang/2025-01-15/")
devs_df = df.select("repo", explode("results.devs.authors").alias("author"))
devs_df.write.partitionBy("scan_date").parquet("s3://warehouse/codefang/devs/")

Observability

Codefang includes OpenTelemetry support for monitoring scan fleet health. See Observability for full details.

export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
codefang run /path/to/repo -a history/* --format json --silent
Metric Use
codefang.request.duration.seconds Identify slow repositories
codefang.errors.total Track failure rate across the fleet
codefang.cache.hits / codefang.cache.misses Tune cache sizes

Sizing Estimates

Parameter Typical Value
Per-repo runtime (4 cores, 4 GiB) 1-10 min
JSON output per repo 100 KB - 50 MB
60k repos at 100 concurrent pods 10-100 hours per full scan
Total S3 per full scan 10-100 GB
Incremental scan (7 days) 5-20x faster than full

The primary bottleneck is I/O (reading Git objects). Bare repos on local SSD or NVMe-backed EBS deliver the best throughput.