Operational Resilience and Monitoring¶
tradedesk is designed for production use with built-in resilience mechanisms and operational monitoring.
Resilience Features¶
Asyncio-Based Retry Scheduler¶
Subscription retries (price stream reconnects) use an asyncio-based RetryScheduler that is cancellable when the streamer closes. This replaces earlier threading-based approaches and ensures clean, deterministic shutdown without orphaned threads.
Behavior:
- Exponential backoff with full jitter: delay = min(STREAM_SUB_RETRY_BASE_DELAY_S * 2 ** retry, STREAM_SUB_RETRY_MAX_DELAY_S) * uniform(0.5, 1.5). The jitter desynchronises retries across instruments so a group-wide subscription error does not trigger a thundering herd of simultaneous resubscriptions.
- Backoff ceiling: STREAM_SUB_RETRY_MAX_DELAY_S (default: 30.0s) caps the deterministic term before jitter
- Maximum retries: STREAM_SUB_MAX_RETRIES (default: 3)
- Automatic cancellation on streamer shutdown
Configuration: See docs/settings.md for STREAM_SUB_* environment variables.
Structural Subscription Errors — Abandon Without Retry¶
Not every subscription error is transient. IG Lightstreamer error code 21 ("Invalid group") is raised when a subscription item is malformed or names a non-existent group — for example a CHART item that requests an unsupported candle scale (IG CHART scales are SECOND / 1MINUTE / 5MINUTE / HOUR; there is no DAY scale). Retrying the identical item can never succeed, so the jittered-backoff retry above would loop indefinitely (retrying (1/3)...(3/3) retries exhausted, then again on every reconnect) and leave the affected sleeve permanently dark while the process otherwise reported healthy.
Both the market and chart subscription listeners now detect these structural errors and:
- Abandon the item immediately — no retry is scheduled. Transient errors (e.g. code 503) keep their normal jittered-backoff retry.
- Log a distinct
STRUCTURAL error ... will NOT retry ... Manual intervention requiredmarker atERRORlevel, naming the affected items/instrument so ops can identify and correct the offending subscription. - Increment the dedicated
tradedesk_ig_subscription_rejected_total{kind,code}counter (see Operational Metrics), so alerting can surface a dark sleeve instead of it hiding behind an otherwise-healthy process.
Detection matches the numeric code (21) or an "invalid group" substring in the error message, so it is robust to IG returning the code as an int or a string.
Single-Flight OAuth Refresh¶
When multiple concurrent callers need to refresh an expired session token, they share a single /session authentication request instead of racing and creating duplicate tokens. This avoids:
- IG API rate limits on session creation
- Transient auth failures during high concurrency
- Wasted network requests
Automatic: No configuration needed. Token state is managed internally by TokenState lifecycle.
SIGTERM/SIGHUP Signal Handling¶
Live runners install signal handlers that cleanly shut down the portfolio when receiving SIGTERM (container stop, systemd shutdown) or SIGHUP (terminal disconnect):
- Signal handler sets shutdown event
- Event loop processes pending orders and closes connections gracefully
- Open positions remain open (not closed automatically)
- Process exits cleanly without orphaned tasks
Example: Kubernetes container stop → SIGTERM → portfolio shutdown → graceful exit.
Platform note: Signal handlers are installed on POSIX systems (Linux, macOS). Windows runners fall back to KeyboardInterrupt only.
Stale Stream Detection and Reconnect¶
The Lightstreamer price stream includes a heartbeat monitor that detects stale connections:
- Detection: Monitor checks stream age every
STREAM_HEARTBEAT_SLEEP_Sseconds - Threshold: If no data for
STREAM_MAX_STALE_Sseconds (default: 300s), reconnect is initiated - Reconnect delay:
STREAM_RECONNECT_DELAY_Sbetween attempts (default: 5s) - Log suppression: During expected market closures (>300s silence), warnings are suppressed to avoid spam
Tuning: Adjust STREAM_MAX_STALE_S for high-latency or unreliable networks.
Pre-Reconnect Session Refresh and Unproductive-Reconnect Cap¶
Every reconnect attempt first refreshes the IG REST /session (CST/XST) before recreating the Lightstreamer client. The new tokens are then used to authenticate the fresh LS connection. This prevents the in-process reconnect loop from re-using stale session tokens that IG silently accepts but no longer streams data against.
A reconnect is treated as unproductive when:
- The LS connection reaches
CONNECTED:*but no real-time updates arrive withinSTREAM_UNPRODUCTIVE_GRACE_S(default 60s), or - The pre-reconnect
/sessionrefresh raises (network error, rate limit, etc.).
After STREAM_UNPRODUCTIVE_RECONNECT_CAP (default 3) consecutive unproductive reconnects, the streamer raises UnproductiveReconnectError. The supervising process (systemd, orchestrator) is expected to restart the container — which forces a completely fresh IG session and LS connection — instead of allowing the in-process loop to spin forever.
A productive session (any session that received at least one update before going stale) resets the consecutive-unproductive counter, so routine IG session-rollover does not accumulate toward the cap.
Structured Loki markers emitted on the reconnect path:
| Marker | Level | Fields |
|---|---|---|
reauth_attempted |
INFO | reason, attempt |
reauth_result |
INFO (ok) / ERROR (fail) | attempt, status, error, cst_refreshed |
reconnect_attempt |
INFO | attempt, cap |
reconnect_unproductive |
WARNING | attempt, grace_seconds, bars_received |
reconnect_surrender |
ERROR | attempts, last_error |
Tuning: Increase STREAM_UNPRODUCTIVE_GRACE_S for chart-only streams with long bar periods where the first bar close can legitimately exceed 60s.
Operational Metrics¶
tradedesk emits Prometheus metrics for operational visibility. Metrics are lazily imported — no hard dependency on prometheus_client.
Available Metrics¶
Counter: tradedesk_ig_auth_refreshes_total¶
Incremented each time a session token is refreshed. Labels: outcome (success/failure).
Gauge: tradedesk_ig_auth_refresh_inflight¶
Number of in-flight authentication refresh requests. Tracks single-flight OAuth mechanism effectiveness. No labels.
Counter: tradedesk_ig_subscription_retries_total¶
Incremented for each subscription retry attempt on the price stream. Tracks unreliable subscription requests. Labels: kind (retry type).
Counter: tradedesk_ig_subscription_rejected_total¶
Incremented when a subscription is rejected with a structural error (e.g. code 21 "Invalid group") that retries cannot fix; the item is abandoned and the affected sleeve stays dark until corrected. Unlike tradedesk_ig_subscription_retries_total, any non-zero value here demands manual intervention. Labels: kind (market / chart), code (the IG error code, e.g. 21).
Counter: tradedesk_ig_stream_reconnects_total¶
Incremented each time the price stream reconnects. High counts may indicate network issues. Labels: reason (why reconnect was triggered).
Histogram: tradedesk_ig_stream_stale_seconds¶
Duration of stream silence before reconnect. Tracks impact of connectivity issues on data delivery.
Enabling Prometheus Metrics¶
Metrics are available if prometheus_client is installed:
Metrics are collected lazily and require no explicit configuration. Access them via the standard Prometheus client:
Example: Scraping in Prometheus¶
Configure a Prometheus scrape endpoint in your runner:
from prometheus_client import start_http_server
# Start Prometheus metrics HTTP server on port 8000
start_http_server(8000)
# Now run your portfolio
portfolio = MyPortfolio(...)
await portfolio.run(...)
Then configure Prometheus to scrape http://localhost:8000/metrics.
Operational Deployment Patterns¶
Container Orchestration (Kubernetes)¶
Deploy tradedesk runners in containers with these best practices:
- Mount IG credentials via secrets: Use environment variables from ConfigMaps/Secrets, not hardcoded
- Graceful shutdown: Set
terminationGracePeriodSecondsto 60+ (allow time for orders to clear) - Resource requests: Set appropriate CPU/memory limits
- Signal handling: SIGTERM is automatically handled; no custom shutdown scripts needed
Example Kubernetes manifest excerpt:
apiVersion: v1
kind: Pod
metadata:
name: tradedesk-runner
spec:
terminationGracePeriodSeconds: 60
containers:
- name: runner
image: my-tradedesk-runner:latest
env:
- name: IG_API_KEY
valueFrom:
secretKeyRef:
name: ig-credentials
key: api-key
- name: IG_USERNAME
valueFrom:
secretKeyRef:
name: ig-credentials
key: username
- name: IG_PASSWORD
valueFrom:
secretKeyRef:
name: ig-credentials
key: password
- name: IG_ENVIRONMENT
value: "DEMO"
- name: TRADEDESK_STREAM_MAX_STALE_S
value: "300"
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
Systemd Service¶
For systemd-managed services, SIGTERM is sent on shutdown:
[Service]
Type=simple
ExecStart=/usr/bin/python3 /opt/tradedesk/runner.py
Environment="IG_ENVIRONMENT=LIVE"
# Allow 60 seconds for graceful shutdown
TimeoutStopSec=60
# Restart on failure, but not immediately
Restart=on-failure
RestartSec=30
Failure Modes and Recovery¶
Stream Disconnection¶
Symptom: Prices stop updating, log shows "stream stale".
Response:
1. Monitor detects silence after STREAM_MAX_STALE_S
2. Streamer initiates reconnect
3. Subscriptions are retried up to STREAM_SUB_MAX_RETRIES times
4. If max retries exceeded, error is logged and positions remain open
Action: Check network connectivity, IG service status, and review logs.
Authentication Failure¶
Symptom: Log shows "auth failed" or "invalid session".
Response:
1. TokenState triggers refresh on next call
2. Single-flight mechanism ensures one /session request per refresh cycle
3. If refresh fails, subsequent orders may be rejected
Action: Verify IG credentials are valid and account permissions allow API access.
Order Confirmation Timeout¶
Symptom: Order placed but deal confirmation never arrives.
Response:
1. IG_DEAL_CONFIRM_TIMEOUT_S expires (default 10 seconds)
2. TimeoutError is raised to caller
3. Position state is inconsistent until manual intervention
Action: Increase IG_DEAL_CONFIRM_TIMEOUT_S on slow connections, or investigate IG service issues.
See Also¶
docs/settings.md— Tunable parameters and environment variablestradedesk/execution/ig/metrics.py— Prometheus metric definitionstradedesk/runner.py— Signal handler and portfolio lifecycletradedesk/execution/ig/price_streamer.py— Stream reconnection logic