Operational Resilience and Monitoring¶
tradedesk is designed for production use with built-in resilience mechanisms and operational monitoring.
Resilience Features¶
Asyncio-Based Retry Scheduler¶
Subscription retries (price stream reconnects) use an asyncio-based RetryScheduler that is cancellable when the streamer closes. This replaces earlier threading-based approaches and ensures clean, deterministic shutdown without orphaned threads.
Behavior:
- Linear backoff: delay = attempt * STREAM_SUB_RETRY_BASE_DELAY_S
- Maximum retries: STREAM_SUB_MAX_RETRIES (default: 3)
- Automatic cancellation on streamer shutdown
Configuration: See docs/settings.md for STREAM_SUB_* environment variables.
Single-Flight OAuth Refresh¶
When multiple concurrent callers need to refresh an expired session token, they share a single /session authentication request instead of racing and creating duplicate tokens. This avoids:
- IG API rate limits on session creation
- Transient auth failures during high concurrency
- Wasted network requests
Automatic: No configuration needed. Token state is managed internally by TokenState lifecycle.
SIGTERM/SIGHUP Signal Handling¶
Live runners install signal handlers that cleanly shut down the portfolio when receiving SIGTERM (container stop, systemd shutdown) or SIGHUP (terminal disconnect):
- Signal handler sets shutdown event
- Event loop processes pending orders and closes connections gracefully
- Open positions remain open (not closed automatically)
- Process exits cleanly without orphaned tasks
Example: Kubernetes container stop → SIGTERM → portfolio shutdown → graceful exit.
Platform note: Signal handlers are installed on POSIX systems (Linux, macOS). Windows runners fall back to KeyboardInterrupt only.
Stale Stream Detection and Reconnect¶
The Lightstreamer price stream includes a heartbeat monitor that detects stale connections:
- Detection: Monitor checks stream age every
STREAM_HEARTBEAT_SLEEP_Sseconds - Threshold: If no data for
STREAM_MAX_STALE_Sseconds (default: 300s), reconnect is initiated - Reconnect delay:
STREAM_RECONNECT_DELAY_Sbetween attempts (default: 5s) - Log suppression: During expected market closures (>300s silence), warnings are suppressed to avoid spam
Tuning: Adjust STREAM_MAX_STALE_S for high-latency or unreliable networks.
Operational Metrics¶
tradedesk emits Prometheus metrics for operational visibility. Metrics are lazily imported — no hard dependency on prometheus_client.
Available Metrics¶
Counter: tradedesk_ig_auth_refreshes_total¶
Incremented each time a session token is refreshed. Labels: outcome (success/failure).
Gauge: tradedesk_ig_auth_refresh_inflight¶
Number of in-flight authentication refresh requests. Tracks single-flight OAuth mechanism effectiveness. No labels.
Counter: tradedesk_ig_subscription_retries_total¶
Incremented for each subscription retry attempt on the price stream. Tracks unreliable subscription requests. Labels: kind (retry type).
Counter: tradedesk_ig_stream_reconnects_total¶
Incremented each time the price stream reconnects. High counts may indicate network issues. Labels: reason (why reconnect was triggered).
Histogram: tradedesk_ig_stream_stale_seconds¶
Duration of stream silence before reconnect. Tracks impact of connectivity issues on data delivery.
Enabling Prometheus Metrics¶
Metrics are available if prometheus_client is installed:
Metrics are collected lazily and require no explicit configuration. Access them via the standard Prometheus client:
Example: Scraping in Prometheus¶
Configure a Prometheus scrape endpoint in your runner:
from prometheus_client import start_http_server
# Start Prometheus metrics HTTP server on port 8000
start_http_server(8000)
# Now run your portfolio
portfolio = MyPortfolio(...)
await portfolio.run(...)
Then configure Prometheus to scrape http://localhost:8000/metrics.
Operational Deployment Patterns¶
Container Orchestration (Kubernetes)¶
Deploy tradedesk runners in containers with these best practices:
- Mount IG credentials via secrets: Use environment variables from ConfigMaps/Secrets, not hardcoded
- Graceful shutdown: Set
terminationGracePeriodSecondsto 60+ (allow time for orders to clear) - Resource requests: Set appropriate CPU/memory limits
- Signal handling: SIGTERM is automatically handled; no custom shutdown scripts needed
Example Kubernetes manifest excerpt:
apiVersion: v1
kind: Pod
metadata:
name: tradedesk-runner
spec:
terminationGracePeriodSeconds: 60
containers:
- name: runner
image: my-tradedesk-runner:latest
env:
- name: IG_API_KEY
valueFrom:
secretKeyRef:
name: ig-credentials
key: api-key
- name: IG_USERNAME
valueFrom:
secretKeyRef:
name: ig-credentials
key: username
- name: IG_PASSWORD
valueFrom:
secretKeyRef:
name: ig-credentials
key: password
- name: IG_ENVIRONMENT
value: "DEMO"
- name: TRADEDESK_STREAM_MAX_STALE_S
value: "300"
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
Systemd Service¶
For systemd-managed services, SIGTERM is sent on shutdown:
[Service]
Type=simple
ExecStart=/usr/bin/python3 /opt/tradedesk/runner.py
Environment="IG_ENVIRONMENT=LIVE"
# Allow 60 seconds for graceful shutdown
TimeoutStopSec=60
# Restart on failure, but not immediately
Restart=on-failure
RestartSec=30
Failure Modes and Recovery¶
Stream Disconnection¶
Symptom: Prices stop updating, log shows "stream stale".
Response:
1. Monitor detects silence after STREAM_MAX_STALE_S
2. Streamer initiates reconnect
3. Subscriptions are retried up to STREAM_SUB_MAX_RETRIES times
4. If max retries exceeded, error is logged and positions remain open
Action: Check network connectivity, IG service status, and review logs.
Authentication Failure¶
Symptom: Log shows "auth failed" or "invalid session".
Response:
1. TokenState triggers refresh on next call
2. Single-flight mechanism ensures one /session request per refresh cycle
3. If refresh fails, subsequent orders may be rejected
Action: Verify IG credentials are valid and account permissions allow API access.
Order Confirmation Timeout¶
Symptom: Order placed but deal confirmation never arrives.
Response:
1. IG_DEAL_CONFIRM_TIMEOUT_S expires (default 10 seconds)
2. TimeoutError is raised to caller
3. Position state is inconsistent until manual intervention
Action: Increase IG_DEAL_CONFIRM_TIMEOUT_S on slow connections, or investigate IG service issues.
See Also¶
docs/settings.md— Tunable parameters and environment variablestradedesk/execution/ig/metrics.py— Prometheus metric definitionstradedesk/runner.py— Signal handler and portfolio lifecycletradedesk/execution/ig/price_streamer.py— Stream reconnection logic