Plan: Infrastructure Hardening
Status
State: Active Started: 2026-05-14
Context
Shuttle underwent a comprehensive hardening audit covering restart policies, healthchecks, secrets management, resource limits, capability dropping, and autoheal coverage. Sepia has none of these improvements. Several gaps leave services vulnerable to cascading failures and reduce observability.
Current state of sepia:
- Restart policies: All 14 services have restart: unless-stopped ✓
- Healthchecks: Only 4/14 services have healthchecks (docs, dsmr, dsmrdb, influxdb, seafile)
- Resource limits: None set on any service
- cap_drop: None applied to any service
- Autoheal: No autoheal container or labels exist
- Docker socket mounts: Not reviewed
- Docker daemon config: No /etc/docker/daemon.json (no global log rotation)
- depends_on: Minimal, no condition: service_healthy used
Goals
- [ ] Add healthchecks to all 10 services currently missing them
- [ ] Add
cap_drop: [all]to HTTP-only services (where compatible) - [ ] Set resource limits on high-impact services
- [ ] Add
depends_onwithcondition: service_healthychains - [ ] Add Docker log rotation globally
- [ ] Review and harden Docker socket mounts
- [ ] Add autoheal container for automatic restart of unhealthy services
- [ ] Review secrets pattern (
.envusage is mostly already in place ✓)
Steps
Step 1: Healthchecks
Add healthchecks to services that currently have none:
| Service | Image | Suggested Check | Notes |
|---|---|---|---|
| caddy | caddy (local build) |
wget --spider http://localhost:2019/config/ |
Alpine-based, has wget |
| collectd | collectd:bookworm (local build) |
pidof collectd |
No HTTP endpoint |
| dns-ad-blocker | oznu/dns-ad-blocker |
None — being replaced (separate plan) | Skip, planned for replacement |
| esphome | esphome/esphome:2022.12.8 |
wget -qO- http://localhost:6052/ |
Dashboard runs on 6052, has wget |
| grafana | grafana/grafana:11.4.0 |
wget -qO- http://localhost:3000/api/health |
Grafana has a health API |
| homeassistant | homeassistant/home-assistant:2025.1.2 |
wget -qO- http://localhost:8123/ |
Has built-in health endpoint |
| seafile-mysql | mariadb:11.8.5 |
mysqladmin ping -h localhost or healthcheck.sh |
MariaDB has built-in |
| seafile-redis | redis:8.4.0 |
redis-cli ping |
Redis has built-in |
| borgmatic | b3vis/borgmatic:v1.1.10-1.4.21 |
None needed — oneshot/scheduled | Skip, runs on schedule |
| timescaledb | timescale/timescaledb:2.7.1-pg14 |
pg_isready -U postgres |
PostgreSQL-based |
- Verification:
docker psshows(healthy)next to each container
Step 2: Capability Hardening
Add cap_drop: [all] to HTTP-only services that don't need elevated privileges:
Candidates:
- grafana — HTTP-only, no special caps needed
- docs — mkdocs server, HTTP-only
- influxdb — HTTP+TCP API, no special caps
- dsmr — Django app, HTTP-only
- seafile-server — HTTP+file sync, may need CHOWN, DAC_OVERRIDE, FOWNER, SETUID, SETGID for s6-overlay
- dns-ad-blocker — skip (being replaced)
Caution: Check base images for s6-overlay (LinuxServer, some Seafile images). These need specific POSIX capabilities (CHOWN, DAC_OVERRIDE, FOWNER, SETUID, SETGID). Test with docker compose config and monitor logs for capability-related errors.
- Verification: Services start successfully and serve requests correctly
Step 3: Resource Limits
Set deploy.resources.limits on high-impact services:
| Service | RAM Limit | CPU Limit | Reason |
|---|---|---|---|
| timescaledb | 2G | 2.0 | Primary time-series DB |
| dsmrdb (postgres) | 1G | 1.0 | DSMR reader DB |
| homeassistant | 1G | 1.0 | Core automation |
| seafile-server | 2G | 2.0 | File sync, memory-heavy |
| grafana | 512M | 1.0 | Visualization |
- Verification: Run
docker statsafter restart, confirm limits are applied
Step 4: depends_on with Health Conditions
Update depends_on relationships to use condition: service_healthy:
Current:
# compose.dsmr.yaml
depends_on:
- dsmrdb
- influxdb
# compose.seafile.yaml
depends_on:
- seafile-mysql
- seafile-redis
Target:
depends_on:
dsmrdb:
condition: service_healthy
influxdb:
condition: service_healthy
Also add missing depends_on chains:
- caddy → services it proxies to (once they have healthchecks)
- grafana → influxdb, timescaledb (data sources)
- Verification:
docker compose upstarts dependencies in correct order, waits for health before starting dependent services
Step 5: Docker Daemon Log Rotation
Create /etc/docker/daemon.json:
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
Then systemctl restart docker to apply.
Caution: This restarts the Docker daemon and all containers. Schedule during maintenance window.
- Verification:
docker info | grep -A5 "Logging Driver"shows the config
Step 6: Docker Socket Hardening
Audit which services mount the Docker socket and whether read-only is sufficient:
| Service | Mount | Current | Target |
|---|---|---|---|
| caddy | Needs docker socket? | Not mounted | N/A |
| portainer/autoheal | Not running on sepia | N/A | Consider adding autoheal |
If adding autoheal:
services:
autoheal:
image: willfarrell/autoheal
container_name: autoheal
volumes:
- /var/run/docker.sock:/var/run/docker.sock
environment:
- AUTOHEAL_CONTAINER_LABEL=autoheal
restart: unless-stopped
Then add autoheal: true labels to all services with healthchecks.
- Verification:
docker ps --filter "label=autoheal"shows all targeted containers
Rollback
- Healthchecks:
git checkout -- compose.*.yamlreverts all changes - cap_drop: Remove the
cap_dropline from compose files - Resource limits: Remove the
deploy.resources.limitsblock - daemon.json: Remove the file and restart docker
- autoheal:
docker compose -f compose.autoheal.yaml down
Related
- JOURNAL/ via shuttle branch: 2026-05-07-infrastructure-hardening.md
- REFERENCE/services.md
- REFERENCE/ansible.md (for daemon.json setup via ansible)
Created: 2026-05-14