Plan: Infrastructure Hardening

Status

State: Active Started: 2026-05-14

Context

Shuttle underwent a comprehensive hardening audit covering restart policies, healthchecks, secrets management, resource limits, capability dropping, and autoheal coverage. Sepia has none of these improvements. Several gaps leave services vulnerable to cascading failures and reduce observability.

Current state of sepia: - Restart policies: All 14 services have restart: unless-stopped ✓ - Healthchecks: Only 4/14 services have healthchecks (docs, dsmr, dsmrdb, influxdb, seafile) - Resource limits: None set on any service - cap_drop: None applied to any service - Autoheal: No autoheal container or labels exist - Docker socket mounts: Not reviewed - Docker daemon config: No /etc/docker/daemon.json (no global log rotation) - depends_on: Minimal, no condition: service_healthy used

Goals

[ ] Add healthchecks to all 10 services currently missing them
[ ] Add cap_drop: [all] to HTTP-only services (where compatible)
[ ] Set resource limits on high-impact services
[ ] Add depends_on with condition: service_healthy chains
[ ] Add Docker log rotation globally
[ ] Review and harden Docker socket mounts
[ ] Add autoheal container for automatic restart of unhealthy services
[ ] Review secrets pattern (.env usage is mostly already in place ✓)

Steps

Step 1: Healthchecks

Add healthchecks to services that currently have none:

Service	Image	Suggested Check	Notes
caddy	`caddy` (local build)	`wget --spider http://localhost:2019/config/`	Alpine-based, has wget
collectd	`collectd:bookworm` (local build)	`pidof collectd`	No HTTP endpoint
dns-ad-blocker	`oznu/dns-ad-blocker`	None — being replaced (separate plan)	Skip, planned for replacement
esphome	`esphome/esphome:2022.12.8`	`wget -qO- http://localhost:6052/`	Dashboard runs on 6052, has wget
grafana	`grafana/grafana:11.4.0`	`wget -qO- http://localhost:3000/api/health`	Grafana has a health API
homeassistant	`homeassistant/home-assistant:2025.1.2`	`wget -qO- http://localhost:8123/`	Has built-in health endpoint
seafile-mysql	`mariadb:11.8.5`	`mysqladmin ping -h localhost` or `healthcheck.sh`	MariaDB has built-in
seafile-redis	`redis:8.4.0`	`redis-cli ping`	Redis has built-in
borgmatic	`b3vis/borgmatic:v1.1.10-1.4.21`	None needed — oneshot/scheduled	Skip, runs on schedule
timescaledb	`timescale/timescaledb:2.7.1-pg14`	`pg_isready -U postgres`	PostgreSQL-based

Verification: docker ps shows (healthy) next to each container

Step 2: Capability Hardening

Add cap_drop: [all] to HTTP-only services that don't need elevated privileges:

Candidates: - grafana — HTTP-only, no special caps needed - docs — mkdocs server, HTTP-only - influxdb — HTTP+TCP API, no special caps - dsmr — Django app, HTTP-only - seafile-server — HTTP+file sync, may need CHOWN, DAC_OVERRIDE, FOWNER, SETUID, SETGID for s6-overlay - dns-ad-blocker — skip (being replaced)

Caution: Check base images for s6-overlay (LinuxServer, some Seafile images). These need specific POSIX capabilities (CHOWN, DAC_OVERRIDE, FOWNER, SETUID, SETGID). Test with docker compose config and monitor logs for capability-related errors.

Verification: Services start successfully and serve requests correctly

Step 3: Resource Limits

Set deploy.resources.limits on high-impact services:

Service	RAM Limit	CPU Limit	Reason
timescaledb	2G	2.0	Primary time-series DB
dsmrdb (postgres)	1G	1.0	DSMR reader DB
homeassistant	1G	1.0	Core automation
seafile-server	2G	2.0	File sync, memory-heavy
grafana	512M	1.0	Visualization

Verification: Run docker stats after restart, confirm limits are applied

Step 4: `depends_on` with Health Conditions

Update depends_on relationships to use condition: service_healthy:

Current:

# compose.dsmr.yaml
depends_on:
  - dsmrdb
  - influxdb

# compose.seafile.yaml
depends_on:
  - seafile-mysql
  - seafile-redis

Target:

depends_on:
  dsmrdb:
    condition: service_healthy
  influxdb:
    condition: service_healthy

Also add missing depends_on chains: - caddy → services it proxies to (once they have healthchecks) - grafana → influxdb, timescaledb (data sources)

Verification: docker compose up starts dependencies in correct order, waits for health before starting dependent services

Step 5: Docker Daemon Log Rotation

Create /etc/docker/daemon.json:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Then systemctl restart docker to apply.

Caution: This restarts the Docker daemon and all containers. Schedule during maintenance window.

Verification: docker info | grep -A5 "Logging Driver" shows the config

Step 6: Docker Socket Hardening

Audit which services mount the Docker socket and whether read-only is sufficient:

Service	Mount	Current	Target
caddy	Needs docker socket?	Not mounted	N/A
portainer/autoheal	Not running on sepia	N/A	Consider adding autoheal

If adding autoheal:

services:
  autoheal:
    image: willfarrell/autoheal
    container_name: autoheal
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - AUTOHEAL_CONTAINER_LABEL=autoheal
    restart: unless-stopped

Then add autoheal: true labels to all services with healthchecks.

Verification: docker ps --filter "label=autoheal" shows all targeted containers

Rollback

Healthchecks: git checkout -- compose.*.yaml reverts all changes
cap_drop: Remove the cap_drop line from compose files
Resource limits: Remove the deploy.resources.limits block
daemon.json: Remove the file and restart docker
autoheal: docker compose -f compose.autoheal.yaml down

JOURNAL/ via shuttle branch: 2026-05-07-infrastructure-hardening.md
REFERENCE/services.md
REFERENCE/ansible.md (for daemon.json setup via ansible)

Created: 2026-05-14