acquire-codebase-knowledge

Use this skill when the user explicitly asks to map, document, or onboard into an existing codebase. Trigger for prompts like \"map this codebase\", \"document this architecture\", \"onboard me to this repo\", or \"create codebase docs\". Do not trigger for routine feature implementation, bug fixes, or narrow code edits unless the user asks for repository-level discovery.

Version 1.0.0 draft GNU GPL v3

Last synced: 25 Jun, 2026, 5:09 PM UTC

Open SKILL.md on GitHub

Version

1.0.0

Maturity

draft

Repository

agent-skills

License

GNU GPL v3

Repository: matt-riley/agent-skills
Source file: skills/acquire-codebase-knowledge/SKILL.md
Version: 1.0.0
Maturity: draft
Compatibility: Agent Skills-compatible coding agents.
License: GNU GPL v3

Acquire Codebase Knowledge

Produces seven populated documents in docs/codebase/ covering everything needed to work effectively on the project. Only document what is verifiable from files or terminal output — never infer or assume.

Use this skill when

The user explicitly asks to map, document, or onboard into an existing codebase.
Prompts include: "map this codebase", "document this architecture", "onboard me to this repo", or "create codebase docs".

Do not use this skill when

The task is a routine feature implementation, bug fix, or narrow code edit without a repository-level discovery request.
The user wants a quick file summary — just read the file directly.

Inputs to gather

Optional: specific area to focus on, e.g. "architecture only" or "testing and concerns". Default to all seven areas.

First move

Run scripts/scan.py from the project root first, then read intent documents (README.md, ARCHITECTURE.md, top-level docs) before opening source files.

Outputs

Before finishing, all of the following must be true:

Exactly these files exist in docs/codebase/: STACK.md, STRUCTURE.md, ARCHITECTURE.md, CONVENTIONS.md, INTEGRATIONS.md, TESTING.md, CONCERNS.md.
Every claim is traceable to source files, config, or terminal output.
Unknowns are marked as [TODO]; intent-dependent decisions are marked [ASK USER].
Every document includes a short "evidence" list with concrete file paths.
Final response includes numbered [ASK USER] questions and intent-vs-reality divergences.

Workflow

Copy and track this checklist:

- [ ] Phase 1: Run scan, read intent documents
- [ ] Phase 2: Investigate each documentation area
- [ ] Phase 3: Populate all seven docs in docs/codebase/
- [ ] Phase 4: Validate docs, present findings, resolve all [ASK USER] items

Focus Area Mode

If the user supplies a focus area (for example: "architecture only" or "testing and concerns"):

Always run Phase 1 in full.
Fully complete focus-area documents first.
For non-focus documents not yet analyzed, keep required sections present and mark unknowns as [TODO].
Still run the Phase 4 validation loop on all seven documents before final output.

Phase 1: Scan and Read Intent

Run the scan script from the target project root using the skill's repo-relative path. In repositories that keep skills at the top level, use:
```
python3 skills/acquire-codebase-knowledge/scripts/scan.py --output docs/codebase/.codebase-scan.txt
```
If the repository stores skills under .github/skills/, use the equivalent repo-relative path instead:
```
python3 .github/skills/acquire-codebase-knowledge/scripts/scan.py --output docs/codebase/.codebase-scan.txt
```
Keep the path repo-relative; do not hard-code an absolute machine-local skill path.
Search for PRD, TRD, README, ROADMAP, SPEC, DESIGN files and read them.
Summarise the stated project intent before reading any source code.

Phase 2: Investigate

Use the scan output to answer questions for each of the seven templates. Load references/inquiry-checkpoints.md for the full per-template question list.

If the stack is ambiguous (multiple manifest files, unfamiliar file types, no package.json), load references/stack-detection.md.

Phase 3: Populate Templates

Copy each template from assets/templates/ into docs/codebase/. Fill in this order:

STACK.md — language, runtime, frameworks, all dependencies
STRUCTURE.md — directory layout, entry points, key files
ARCHITECTURE.md — layers, patterns, data flow
CONVENTIONS.md — naming, formatting, error handling, imports
INTEGRATIONS.md — external APIs, databases, auth, monitoring
TESTING.md — frameworks, file organization, mocking strategy
CONCERNS.md — tech debt, bugs, security risks, perf bottlenecks

Use [TODO] for anything that cannot be determined from code. Use [ASK USER] where the right answer requires team intent.

Phase 4: Validate, Repair, Verify

Run this mandatory validation loop before finalizing:

Validate each doc against references/inquiry-checkpoints.md.
For each non-trivial claim, confirm at least one evidence reference exists.
If any required section is missing or unsupported:

Fix the document.
Re-run validation.

Repeat until all seven docs pass.

Then present a summary of all seven documents, list every [ASK USER] item as a numbered question, and highlight any Intent vs. Reality divergences from Phase 1.

Validation pass criteria:

No unsupported claims.
No empty required sections.
Unknowns use [TODO] rather than assumptions.
Team-intent gaps are explicitly marked [ASK USER].

Gotchas

Monorepos: Root package.json may have no source — check for workspaces, packages/, or apps/ directories. Each workspace may have independent dependencies and conventions. Map each sub-package separately.

Outdated README: README often describes intended architecture, not the current one. Cross-reference with actual file structure before treating any README claim as fact.

TypeScript path aliases: tsconfig.json paths config means imports like @/foo don't map directly to the filesystem. Map aliases to real paths before documenting structure.

Generated/compiled output: Never document patterns from dist/, build/, generated/, .next/, out/, or __pycache__/. These are artefacts — document source conventions only.

.env.example reveals required config: Secrets are never committed. Read .env.example, .env.template, or .env.sample to discover required environment variables.

devDependencies ≠ production stack: Only dependencies (or equivalent, e.g. [tool.poetry.dependencies]) runs in production. Document linters, formatters, and test frameworks separately as dev tooling.

Test TODOs ≠ production debt: TODOs inside test/, tests/, __tests__/, or spec/ are coverage gaps, not production technical debt. Separate them in CONCERNS.md.

High-churn files = fragile areas: Files appearing most in recent git history have the highest modification rate and likely hidden complexity. Always note them in CONCERNS.md.

Anti-Patterns

❌ Don't	✅ Do instead
"Uses Clean Architecture with Domain/Data layers." (when no such directories exist)	State only what directory structure actually shows.
"This is a Next.js project." (without checking `package.json`)	Check `dependencies` first. State what's actually there.
Guess the database from a variable name like `dbUrl`	Check manifest for `pg`, `mysql2`, `mongoose`, `prisma`, etc.
Document `dist/` or `build/` naming patterns as conventions	Source files only.

Enhanced Scan Output Sections

The scan.py script now produces the following sections in addition to the original output:

CODE METRICS — Total files, lines of code by language, largest files (complexity signals)
CI/CD PIPELINES — Detected GitHub Actions, GitLab CI, Jenkins, CircleCI, etc.
CONTAINERS & ORCHESTRATION — Docker, Docker Compose, Kubernetes, Vagrant configs
SECURITY & COMPLIANCE — Snyk, Dependabot, SECURITY.md, SBOM, security policies
PERFORMANCE & TESTING — Benchmark configs, profiling markers, load testing tools

Use these sections during Phase 2 to inform investigation questions and identify tool-specific patterns.

Guardrails

Must not invent or assume facts that cannot be traced to source files or terminal output.
Must not document files in dist/, build/, or other generated output directories as conventions.
Should use [TODO] for unknowns and [ASK USER] for intent-dependent decisions.
Should run scripts/scan.py before opening source files.
Should complete all seven required documents before finalizing.
May skip optional sections in templates when the repository does not have that complexity.

Validation

Confirm docs/codebase/ contains all seven required files after Phase 3 completes.
Spot-check two or three claims per document against the source files cited in the evidence list.
Review the final [ASK USER] list — every item should be a real intent ambiguity, not a missing terminal command.
Run scripts/scan.py again on the target repo and confirm the output is consistent with STACK.md.
Smoke test:
- should trigger: "Map this repo's architecture and onboarding path for a new engineer."
- should not trigger: "Write a README for this CLI for first-time users." (→ doc-coauthoring)

Examples

"Map this repo for a new developer" → run Phase 1 scan, produce all seven docs/codebase/ files with evidence lists and a numbered [ASK USER] block.
"Document only the architecture and testing layers" → pass "architecture and testing" as the focus area; produce only ARCHITECTURE.md and TESTING.md (use Focus Area Mode).
"Onboard me to this TypeScript monorepo" → run scan.py, read workspace manifests and CI config, then produce STACK.md calling out each package's runtime and toolchain separately.

Reference files

Asset	When to load
`scripts/scan.py`	Phase 1 — run first, before reading any code (Python 3.8+ required)
`references/inquiry-checkpoints.md`	Phase 2 — load for per-template investigation questions
`references/stack-detection.md`	Phase 2 — only if stack is ambiguous
`assets/templates/STACK.md`	Phase 3 step 1
`assets/templates/STRUCTURE.md`	Phase 3 step 2
`assets/templates/ARCHITECTURE.md`	Phase 3 step 3
`assets/templates/CONVENTIONS.md`	Phase 3 step 4
`assets/templates/INTEGRATIONS.md`	Phase 3 step 5
`assets/templates/TESTING.md`	Phase 3 step 6
`assets/templates/CONCERNS.md`	Phase 3 step 7

Template usage mode:

Default mode: complete only the "Core Sections (Required)" in each template.
Extended mode: add optional sections only when the repo complexity justifies them.

Skill metadata

SKILL.md

Acquire Codebase Knowledge

Use this skill when

Do not use this skill when

Inputs to gather

First move

Outputs

Workflow

Focus Area Mode

Phase 1: Scan and Read Intent

Phase 2: Investigate

Phase 3: Populate Templates

Phase 4: Validate, Repair, Verify

Gotchas

Anti-Patterns

Enhanced Scan Output Sections

Guardrails

Validation

Examples

Reference files