Skip to content

Plan: Infrastructure Hardening

Status

State: Active Started: 2026-05-14

Context

Shuttle underwent a comprehensive hardening audit covering restart policies, healthchecks, secrets management, resource limits, capability dropping, and autoheal coverage. Sepia has none of these improvements. Several gaps leave services vulnerable to cascading failures and reduce observability.

Current state of sepia: - Restart policies: All 14 services have restart: unless-stopped ✓ - Healthchecks: Only 4/14 services have healthchecks (docs, dsmr, dsmrdb, influxdb, seafile) - Resource limits: None set on any service - cap_drop: None applied to any service - Autoheal: No autoheal container or labels exist - Docker socket mounts: Not reviewed - Docker daemon config: No /etc/docker/daemon.json (no global log rotation) - depends_on: Minimal, no condition: service_healthy used

Goals

  • [ ] Add healthchecks to all 10 services currently missing them
  • [ ] Add cap_drop: [all] to HTTP-only services (where compatible)
  • [ ] Set resource limits on high-impact services
  • [ ] Add depends_on with condition: service_healthy chains
  • [ ] Add Docker log rotation globally
  • [ ] Review and harden Docker socket mounts
  • [ ] Add autoheal container for automatic restart of unhealthy services
  • [ ] Review secrets pattern (.env usage is mostly already in place ✓)

Steps

Step 1: Healthchecks

Add healthchecks to services that currently have none:

Service Image Suggested Check Notes
caddy caddy (local build) wget --spider http://localhost:2019/config/ Alpine-based, has wget
collectd collectd:bookworm (local build) pidof collectd No HTTP endpoint
dns-ad-blocker oznu/dns-ad-blocker None — being replaced (separate plan) Skip, planned for replacement
esphome esphome/esphome:2022.12.8 wget -qO- http://localhost:6052/ Dashboard runs on 6052, has wget
grafana grafana/grafana:11.4.0 wget -qO- http://localhost:3000/api/health Grafana has a health API
homeassistant homeassistant/home-assistant:2025.1.2 wget -qO- http://localhost:8123/ Has built-in health endpoint
seafile-mysql mariadb:11.8.5 mysqladmin ping -h localhost or healthcheck.sh MariaDB has built-in
seafile-redis redis:8.4.0 redis-cli ping Redis has built-in
borgmatic b3vis/borgmatic:v1.1.10-1.4.21 None needed — oneshot/scheduled Skip, runs on schedule
timescaledb timescale/timescaledb:2.7.1-pg14 pg_isready -U postgres PostgreSQL-based
  • Verification: docker ps shows (healthy) next to each container

Step 2: Capability Hardening

Add cap_drop: [all] to HTTP-only services that don't need elevated privileges:

Candidates: - grafana — HTTP-only, no special caps needed - docs — mkdocs server, HTTP-only - influxdb — HTTP+TCP API, no special caps - dsmr — Django app, HTTP-only - seafile-server — HTTP+file sync, may need CHOWN, DAC_OVERRIDE, FOWNER, SETUID, SETGID for s6-overlay - dns-ad-blocker — skip (being replaced)

Caution: Check base images for s6-overlay (LinuxServer, some Seafile images). These need specific POSIX capabilities (CHOWN, DAC_OVERRIDE, FOWNER, SETUID, SETGID). Test with docker compose config and monitor logs for capability-related errors.

  • Verification: Services start successfully and serve requests correctly

Step 3: Resource Limits

Set deploy.resources.limits on high-impact services:

Service RAM Limit CPU Limit Reason
timescaledb 2G 2.0 Primary time-series DB
dsmrdb (postgres) 1G 1.0 DSMR reader DB
homeassistant 1G 1.0 Core automation
seafile-server 2G 2.0 File sync, memory-heavy
grafana 512M 1.0 Visualization
  • Verification: Run docker stats after restart, confirm limits are applied

Step 4: depends_on with Health Conditions

Update depends_on relationships to use condition: service_healthy:

Current:

# compose.dsmr.yaml
depends_on:
  - dsmrdb
  - influxdb

# compose.seafile.yaml
depends_on:
  - seafile-mysql
  - seafile-redis

Target:

depends_on:
  dsmrdb:
    condition: service_healthy
  influxdb:
    condition: service_healthy

Also add missing depends_on chains: - caddy → services it proxies to (once they have healthchecks) - grafana → influxdb, timescaledb (data sources)

  • Verification: docker compose up starts dependencies in correct order, waits for health before starting dependent services

Step 5: Docker Daemon Log Rotation

Create /etc/docker/daemon.json:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Then systemctl restart docker to apply.

Caution: This restarts the Docker daemon and all containers. Schedule during maintenance window.

  • Verification: docker info | grep -A5 "Logging Driver" shows the config

Step 6: Docker Socket Hardening

Audit which services mount the Docker socket and whether read-only is sufficient:

Service Mount Current Target
caddy Needs docker socket? Not mounted N/A
portainer/autoheal Not running on sepia N/A Consider adding autoheal

If adding autoheal:

services:
  autoheal:
    image: willfarrell/autoheal
    container_name: autoheal
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - AUTOHEAL_CONTAINER_LABEL=autoheal
    restart: unless-stopped

Then add autoheal: true labels to all services with healthchecks.

  • Verification: docker ps --filter "label=autoheal" shows all targeted containers

Rollback

  • Healthchecks: git checkout -- compose.*.yaml reverts all changes
  • cap_drop: Remove the cap_drop line from compose files
  • Resource limits: Remove the deploy.resources.limits block
  • daemon.json: Remove the file and restart docker
  • autoheal: docker compose -f compose.autoheal.yaml down
  • JOURNAL/ via shuttle branch: 2026-05-07-infrastructure-hardening.md
  • REFERENCE/services.md
  • REFERENCE/ansible.md (for daemon.json setup via ansible)

Created: 2026-05-14