Master Plan: Sepia Infrastructure Overhaul

Status

State: Active Started: 2026-05-14

Context

This master plan sequences all 7 sub-plans into a logical execution order. Many plans have dependencies on each other — applying them in the wrong order creates regressions or unnecessary rework.

The plans were derived from a gap analysis against the shuttle branch, which went through the same hardening process on a similar Debian/Docker Compose stack.

Dependency Graph

Phase 1 ─────────────────────────────────────────────────────
  │
  ├── 1. Infrastructure Hardening  ◄── nothing depends on it
  │       │                             everything depends on it
  │       ├── healthchecks → Phase 3 depends_on conditions
  │       ├── cap_drop → verify after container updates
  │       ├── resource limits → apply early, verify after updates
  │       └── autoheal → works once healthchecks exist
  │
  └── 2. Dockerfile Hygiene  ◄── independent of hardening
          │                       but HEALTHCHECK in Dockerfiles
          │                       feeds into hardening plan
          │
Phase 2 ─┼────────────────────────────────────────────────────
          │
          └── 3. Container Updates  ◄── needs healthchecks (Phase 1)
                  │                     to verify post-update health
                  │
Phase 3 ─┼────────────────────────────────────────────────────
          │
          └── 4. DNS Ad-Blocker → Blocky  ◄── independent
                  │                           but best before networking
                  │
Phase 4 ─┼────────────────────────────────────────────────────
          │
          ├── 5. Compose Best Practices  ◄── needs healthchecks (Phase 1)
          │       │                          for depends_on conditions
          │       │
          ├── 6. Networking Hardening  ◄── after DNS migration settled
          │       │                       after all services stable
          │       │
          └── 7. Update Checker Polish  ◄── feedback from Phase 2
                                            needs healthchecks for verify

Execution Plan

Phase 1 — Foundation (Safety First)

These two plans can run in parallel.

1. Infrastructure Hardening → `PLANS/active/infrastructure-hardening.md`

Estimated effort: 1 session (2–3 hours) Risk: Low (adding healthchecks, resource limits — safe to apply)

Order within the plan: 1. /etc/docker/daemon.json → log rotation (requires Docker restart) 2. Healthchecks to 10 services (one at a time, verify after each) 3. Resource limits on high-impact services (timescaledb, dsmrdb, homeassistant, seafile, grafana) 4. cap_drop: [all] to HTTP-only services (grafana, docs, dsmr, influxdb — test s6 images) 5. Add depends_on with condition: service_healthy (dsmr, seafile, grafana) 6. Add autoheal container and labels 7. Docker socket hardening review

Blockers: None Unblocks: Phases 2, 3, 4 (everything needs healthchecks)

2. Dockerfile Hygiene → `PLANS/active/dockerfile-hygiene.md`

Estimated effort: 1 session (1–2 hours) Risk: Medium (rebuilding collectd — test on non-production first)

Order within the plan: 1. Rewrite collectd Dockerfile (multi-stage, non-root, combined RUNs, healthcheck) 2. Add HEALTHCHECK to caddy Dockerfile 3. Add HEALTHCHECK to docs Dockerfile 4. Rebuild all custom images: docker compose build 5. Verify each container starts and functions correctly

Blockers: None (independent) Unblocks: Phase 1 healthchecks (caddy + docs healthchecks)

Phase 2 — Service Updates

3. Container Updates → `PLANS/active/container-updates-may-2026.md`

Estimated effort: 1 session (1–2 hours) Risk: Medium (test each update, have rollback ready)

Order within the plan: 1. Patch-level (safe): grafana 11.4.0→11.6.14, borgmatic, seafile-mysql 2. Minor-level: influxdb 2.8.0→2.9.1, seafile-redis 8.4.0→8.6.3 3. Medium: homeassistant 2025.1.2→2025.12.5 4. Major (plan separately): esphome 2022.12.8→2026.4.5, timescaledb 2.7.1→2.19.3

Verification: Every container shows (healthy) in docker ps after update.

Blockers: Phase 1 healthchecks (to verify post-update health) Unblocks: Phase 4 polish (feedback for update-checker)

Phase 3 — DNS Migration

4. DNS Ad-Blocker → Blocky → `PLANS/active/dns-ad-blocker-migration.md`

Estimated effort: 1 session (1–2 hours) Risk: High (DNS outage = no internet for LAN during cutover — schedule carefully)

Order within the plan: 1. Create Blocky config (/opt/blocky/config.yml) 2. Create compose file (/opt/compose.blocky.yaml) 3. Test on alternate IP/port 4. Cutover: stop dns-ad-blocker, start Blocky 5. Verify DNS resolution from LAN 6. Update docs 7. Cleanup old files after 2 weeks

Blockers: None (independent) Unblocks: Phase 4 networking (clean network topology)

Phase 4 — Structure & Polish

5. Compose Best Practices → `PLANS/active/compose-best-practices.md`

Estimated effort: 1 session (1 hour) Risk: Low (labels and container_name changes are cosmetic/operational)

Order within the plan: 1. Add standardized labels to all services 2. Ensure consistent container_name on all services 3. Add depends_on chains (now that healthchecks exist from Phase 1)

Blockers: Phase 1 (healthchecks for depends_on conditions) Unblocks: None (cosmetic/operational)

6. Networking Hardening → `PLANS/active/networking-hardening.md`

Estimated effort: 1 session (2–3 hours) Risk: High (changing network topology can break inter-service communication)

Order within the plan: 1. Define networks in compose.yaml 2. Assign each service to appropriate networks 3. Deploy and verify: - Caddy can proxy to all frontend services - Grafana can query InfluxDB + TimescaleDB - DSMR can reach dsmrdb + influxdb - Seafile can reach mysql + redis 4. Document network_mode: host exceptions

Blockers: Phase 3 (DNS settled), Phase 1 (healthchecks) Unblocks: None (final infrastructure layer)

7. Update Checker Polish → `PLANS/backlog/update-checker-polish.md`

Estimated effort: 1 session (1 hour) Risk: Low

Order within the plan: 1. Schedule weekly report generation 2. Add report retention/pruning 3. Implement post-update health check verification 4. Add release notes URLs and grouping

Blockers: Phase 2 (feedback from running updates), Phase 1 (healthchecks for verify) Unblocks: Ongoing maintenance automation

Timeline

Phase	Effort	Risk	When
Phase 1 — Hardening	3–5 hours	Low–Med	This week
Phase 2 — Updates	1–2 hours	Med	After Phase 1
Phase 3 — DNS	1–2 hours	High	After Phase 1
Phase 4 — Structure	4–6 hours	Low–High	After Phases 1–3

Rollback Strategy

Phase 1: git checkout -- compose.*.yaml reverts all compose changes. Docker daemon.json restore requires Docker restart.
Phase 2: git checkout -- compose.*.yaml reverts image tags. docker compose pull + up -d restores old versions.
Phase 3: Stop Blocky, restart dns-ad-blocker. Swap compose includes back.
Phase 4 networking: Remove network assignments, revert to default bridge. docker compose down && docker compose up -d on all services.

All sub-plans in PLANS/active/ and PLANS/backlog/: - infrastructure-hardening.md - dockerfile-hygiene.md - container-updates-may-2026.md - dns-ad-blocker-migration.md - compose-best-practices.md - networking-hardening.md - update-checker-polish.md (backlog)

Created: 2026-05-14

Master Plan: Sepia Infrastructure Overhaul

Status

Context

Dependency Graph

Execution Plan

Phase 1 — Foundation (Safety First)

1. Infrastructure Hardening → PLANS/active/infrastructure-hardening.md

2. Dockerfile Hygiene → PLANS/active/dockerfile-hygiene.md

Phase 2 — Service Updates

3. Container Updates → PLANS/active/container-updates-may-2026.md

Phase 3 — DNS Migration

4. DNS Ad-Blocker → Blocky → PLANS/active/dns-ad-blocker-migration.md