Master Plan: Sepia Infrastructure Overhaul
Status
State: Active Started: 2026-05-14
Context
This master plan sequences all 7 sub-plans into a logical execution order. Many plans have dependencies on each other — applying them in the wrong order creates regressions or unnecessary rework.
The plans were derived from a gap analysis against the shuttle branch, which went through the same hardening process on a similar Debian/Docker Compose stack.
Dependency Graph
Phase 1 ─────────────────────────────────────────────────────
│
├── 1. Infrastructure Hardening ◄── nothing depends on it
│ │ everything depends on it
│ ├── healthchecks → Phase 3 depends_on conditions
│ ├── cap_drop → verify after container updates
│ ├── resource limits → apply early, verify after updates
│ └── autoheal → works once healthchecks exist
│
└── 2. Dockerfile Hygiene ◄── independent of hardening
│ but HEALTHCHECK in Dockerfiles
│ feeds into hardening plan
│
Phase 2 ─┼────────────────────────────────────────────────────
│
└── 3. Container Updates ◄── needs healthchecks (Phase 1)
│ to verify post-update health
│
Phase 3 ─┼────────────────────────────────────────────────────
│
└── 4. DNS Ad-Blocker → Blocky ◄── independent
│ but best before networking
│
Phase 4 ─┼────────────────────────────────────────────────────
│
├── 5. Compose Best Practices ◄── needs healthchecks (Phase 1)
│ │ for depends_on conditions
│ │
├── 6. Networking Hardening ◄── after DNS migration settled
│ │ after all services stable
│ │
└── 7. Update Checker Polish ◄── feedback from Phase 2
needs healthchecks for verify
Execution Plan
Phase 1 — Foundation (Safety First)
These two plans can run in parallel.
1. Infrastructure Hardening → PLANS/active/infrastructure-hardening.md
Estimated effort: 1 session (2–3 hours) Risk: Low (adding healthchecks, resource limits — safe to apply)
Order within the plan:
1. /etc/docker/daemon.json → log rotation (requires Docker restart)
2. Healthchecks to 10 services (one at a time, verify after each)
3. Resource limits on high-impact services (timescaledb, dsmrdb, homeassistant, seafile, grafana)
4. cap_drop: [all] to HTTP-only services (grafana, docs, dsmr, influxdb — test s6 images)
5. Add depends_on with condition: service_healthy (dsmr, seafile, grafana)
6. Add autoheal container and labels
7. Docker socket hardening review
Blockers: None Unblocks: Phases 2, 3, 4 (everything needs healthchecks)
2. Dockerfile Hygiene → PLANS/active/dockerfile-hygiene.md
Estimated effort: 1 session (1–2 hours) Risk: Medium (rebuilding collectd — test on non-production first)
Order within the plan:
1. Rewrite collectd Dockerfile (multi-stage, non-root, combined RUNs, healthcheck)
2. Add HEALTHCHECK to caddy Dockerfile
3. Add HEALTHCHECK to docs Dockerfile
4. Rebuild all custom images: docker compose build
5. Verify each container starts and functions correctly
Blockers: None (independent) Unblocks: Phase 1 healthchecks (caddy + docs healthchecks)
Phase 2 — Service Updates
3. Container Updates → PLANS/active/container-updates-may-2026.md
Estimated effort: 1 session (1–2 hours) Risk: Medium (test each update, have rollback ready)
Order within the plan: 1. Patch-level (safe): grafana 11.4.0→11.6.14, borgmatic, seafile-mysql 2. Minor-level: influxdb 2.8.0→2.9.1, seafile-redis 8.4.0→8.6.3 3. Medium: homeassistant 2025.1.2→2025.12.5 4. Major (plan separately): esphome 2022.12.8→2026.4.5, timescaledb 2.7.1→2.19.3
Verification: Every container shows (healthy) in docker ps after update.
Blockers: Phase 1 healthchecks (to verify post-update health) Unblocks: Phase 4 polish (feedback for update-checker)
Phase 3 — DNS Migration
4. DNS Ad-Blocker → Blocky → PLANS/active/dns-ad-blocker-migration.md
Estimated effort: 1 session (1–2 hours) Risk: High (DNS outage = no internet for LAN during cutover — schedule carefully)
Order within the plan:
1. Create Blocky config (/opt/blocky/config.yml)
2. Create compose file (/opt/compose.blocky.yaml)
3. Test on alternate IP/port
4. Cutover: stop dns-ad-blocker, start Blocky
5. Verify DNS resolution from LAN
6. Update docs
7. Cleanup old files after 2 weeks
Blockers: None (independent) Unblocks: Phase 4 networking (clean network topology)
Phase 4 — Structure & Polish
5. Compose Best Practices → PLANS/active/compose-best-practices.md
Estimated effort: 1 session (1 hour) Risk: Low (labels and container_name changes are cosmetic/operational)
Order within the plan:
1. Add standardized labels to all services
2. Ensure consistent container_name on all services
3. Add depends_on chains (now that healthchecks exist from Phase 1)
Blockers: Phase 1 (healthchecks for depends_on conditions) Unblocks: None (cosmetic/operational)
6. Networking Hardening → PLANS/active/networking-hardening.md
Estimated effort: 1 session (2–3 hours) Risk: High (changing network topology can break inter-service communication)
Order within the plan:
1. Define networks in compose.yaml
2. Assign each service to appropriate networks
3. Deploy and verify:
- Caddy can proxy to all frontend services
- Grafana can query InfluxDB + TimescaleDB
- DSMR can reach dsmrdb + influxdb
- Seafile can reach mysql + redis
4. Document network_mode: host exceptions
Blockers: Phase 3 (DNS settled), Phase 1 (healthchecks) Unblocks: None (final infrastructure layer)
7. Update Checker Polish → PLANS/backlog/update-checker-polish.md
Estimated effort: 1 session (1 hour) Risk: Low
Order within the plan: 1. Schedule weekly report generation 2. Add report retention/pruning 3. Implement post-update health check verification 4. Add release notes URLs and grouping
Blockers: Phase 2 (feedback from running updates), Phase 1 (healthchecks for verify) Unblocks: Ongoing maintenance automation
Timeline
| Phase | Effort | Risk | When |
|---|---|---|---|
| Phase 1 — Hardening | 3–5 hours | Low–Med | This week |
| Phase 2 — Updates | 1–2 hours | Med | After Phase 1 |
| Phase 3 — DNS | 1–2 hours | High | After Phase 1 |
| Phase 4 — Structure | 4–6 hours | Low–High | After Phases 1–3 |
Rollback Strategy
- Phase 1:
git checkout -- compose.*.yamlreverts all compose changes. Docker daemon.json restore requires Docker restart. - Phase 2:
git checkout -- compose.*.yamlreverts image tags.docker compose pull+up -drestores old versions. - Phase 3: Stop Blocky, restart dns-ad-blocker. Swap compose includes back.
- Phase 4 networking: Remove network assignments, revert to default bridge.
docker compose down && docker compose up -don all services.
Related
All sub-plans in PLANS/active/ and PLANS/backlog/:
- infrastructure-hardening.md
- dockerfile-hygiene.md
- container-updates-may-2026.md
- dns-ad-blocker-migration.md
- compose-best-practices.md
- networking-hardening.md
- update-checker-polish.md (backlog)
Created: 2026-05-14