Scan catalogue¶

Overview¶

The v1 catalogue exposes 22 requirement-IDs across three families: 11 statistical-anomaly scans on single instruments (ANOM), 5 cross-instrument scans (CROSS), and 6 seasonality scans (SEAS). These map to 23 distinct scan_id@version pairs because cross.corr.*_rolling ships both Pearson and Spearman variants under requirement CROSS-02, and stats.autocorr.ljung_box ships both raw-returns and squared-returns variants under requirement ANOM-04.

Each scan emits a Finding::Result envelope per (scan x instrument(s) x timeframe x window x param-point). The envelope shape is locked — see findings_envelope.md. Per-scan variation is confined to effect.metric, effect.value, and the effect.extra keys documented below.

Invoke any scan from the CLI:

miner scan <scan_id@version> --instrument SYM:side --timeframe 15m \
  --window 2024-06-12:2024-06-13 --params lags=5

CROSS scans take two --instrument flags (one per leg). All scans are byte-identical reproducible across re-runs when given the same --seed — see the repro envelope in findings_envelope.md.

miner scans introspects the registered catalogue and emits one JSONL row per scan with the canonical scan_id@version, param_schema, arity, and finding_fields() shape.

ANOM — Single-instrument statistical scans¶

Eleven REQUIREMENTS rows (ANOM-01..11), twelve distinct scan_id@version pairs (the Ljung-Box family ships raw + squared-returns variants under ANOM-04).

stats.returns.profile@1¶

Returns-series profile dispatching log / simple / intraday / overnight returns by params.variant.

Arity: Single
effect.metric: returns_{variant}_mean (dynamic per-invocation)
effect.value: arithmetic mean of the chosen returns variant
effect.extra keys: mean, n, returns_vector, std, variant_label
Raw arrays: returns, timestamps_ms
Reference: numpy.diff (log returns) + numpy.mean / numpy.std
When to reach for it: baseline characterisation of any new instrument / timeframe before deeper tests
Requirement: ANOM-01

stats.summary.welford@1¶

Welford running-moments + IQR summary statistics on log returns or closes.

Arity: Single
effect.metric: summary_welford_mean
effect.value: arithmetic mean
effect.extra keys: excess_kurtosis, iqr, max, min, n, skew, std
Raw arrays: returns, timestamps_ms (or closes, timestamps_ms under series=close)
Reference: scipy.stats.describe
When to reach for it: quick distributional summary including skew + excess kurtosis
Requirement: ANOM-02

stats.vol.rolling@1¶

Rolling realised volatility + vol-of-vol over log returns.

Arity: Single
effect.metric: vol_rolling_last
effect.value: last-window realised volatility
effect.extra keys: values, vol_of_vol, window_length, window_starts_ms
Raw arrays: returns, timestamps_ms
Reference: pandas.Series.rolling(window).std (vol) + nested rolling std on vol (vol-of-vol)
When to reach for it: detecting volatility regime shifts and vol-of-vol clustering
Requirement: ANOM-03

stats.autocorr.ljung_box@1¶

Ljung-Box Q-statistic on raw log returns, testing serial correlation at lags 1..L.

Arity: Single
effect.metric: ljung_box_q
effect.value: Q-statistic at max lag (chi-squared with lags degrees of freedom)
effect.extra keys: acf, lags, p_values, q_stats
Raw arrays: returns, timestamps_ms
Reference: statsmodels.stats.diagnostic.acorr_ljungbox
When to reach for it: detecting linear autocorrelation in returns at short lags (predictability red flag for naive trend models)
Requirement: ANOM-04 (raw returns variant)

stats.autocorr.ljung_box_sq@1¶

Ljung-Box Q-statistic on squared log returns, testing ARCH effects / volatility clustering.

Arity: Single
effect.metric: ljung_box_q_squared
effect.value: Q[lags-1] on squared returns
effect.extra keys: acf, lags, p_values, q_stats, series_kind
Raw arrays: returns_squared, timestamps_ms
Reference: statsmodels.stats.diagnostic.acorr_ljungbox (applied to squared returns)
When to reach for it: ARCH / GARCH effect detection; non-trivial Q[sq] with trivial Q[raw] is the classic volatility-clustering signature
Requirement: ANOM-04 (squared returns variant)

stats.stationarity.adf@1¶

Augmented Dickey-Fuller unit-root test with AIC lag selection.

Arity: Single
effect.metric: adf_statistic
effect.value: τ (the test statistic)
effect.extra keys: crit_values, lag_selected, nobs, p_value, regression
Raw arrays: returns, timestamps_ms
Reference: statsmodels.tsa.stattools.adfuller (maxlag=None, regression='c'|'ct'|'ctt'|'nc', default AIC search)
When to reach for it: unit-root / mean-reversion screening before fitting any mean-reverting model
Requirement: ANOM-05

stats.stationarity.kpss@1¶

KPSS stationarity test with Schwert / Hobijn-Franses-Ooms auto-lag formula.

Arity: Single
effect.metric: kpss_statistic
effect.value: KPSS statistic (always non-negative)
effect.extra keys: crit_values, lag_truncation, p_value, regression
Raw arrays: returns, timestamps_ms
Reference: statsmodels.tsa.stattools.kpss (regression='c', nlags='auto')
When to reach for it: stationarity confirmation; KPSS rejects-with-low-p where ADF fails-to-reject is the textbook stationary case
Requirement: ANOM-06

stats.variance_ratio.lo_mackinlay@1¶

Lo-MacKinlay multi-k variance ratio test for random-walk behaviour.

Arity: Single
effect.metric: variance_ratio_max_k
effect.value: VR at the largest k in k_values
effect.extra keys: k_values, p_values, vr_values, z_stats
Raw arrays: returns, timestamps_ms
Reference: Lo & MacKinlay (1988); the arch Python package's variance-ratio module is the canonical reference (no direct statsmodels equivalent)
When to reach for it: detecting mean-reversion (VR<1) or trending (VR>1) deviations from random walk across multiple horizons
Requirement: ANOM-07

stats.heteroskedasticity.arch_lm@1¶

Engle ARCH-LM test for conditional heteroskedasticity.

Arity: Single
effect.metric: arch_lm_statistic
effect.value: LM statistic
effect.extra keys: f_p_value, f_statistic, lag, p_value
Raw arrays: returns, timestamps_ms
Reference: statsmodels.stats.diagnostic.het_arch (with nlags=L)
When to reach for it: detecting time-varying conditional variance; complements ljung_box_sq with a proper LM test
Requirement: ANOM-08

stats.normality.jarque_bera@1¶

Jarque-Bera test of normality based on sample skewness + excess kurtosis.

Arity: Single
effect.metric: jarque_bera_statistic
effect.value: JB statistic (chi-squared with 2 df under H0)
effect.extra keys: excess_kurtosis, n, p_value, skew
Raw arrays: returns, timestamps_ms
Reference: scipy.stats.jarque_bera
When to reach for it: sanity-checking the normality assumption baked into many parametric tests; fat-tailed returns reject hard
Requirement: ANOM-09

stats.outliers.z_and_mad@1¶

Combined z-score + Iglewicz-Hoaglin modified-z outlier detector.

Arity: Single
effect.metric: outliers_count
effect.value: number of bars flagged as outliers
effect.extra keys: mad, median, modified_z_threshold, outlier_indices, outlier_values_modified_z, outlier_values_z, z_threshold
Raw arrays: returns, timestamps_ms (or closes, timestamps_ms under series=close)
Reference: Iglewicz & Hoaglin (1993) — 0.6745 * (x - median) / MAD threshold
When to reach for it: identifying anomalous bars before any moment-based summary; MAD-based detector is robust where z-score is not
Requirement: ANOM-10

stats.drawdown.profile@1¶

Drawdown profile of the cumulative-returns equity curve.

Arity: Single
effect.metric: max_drawdown
effect.value: maximum drawdown (negative number; 0.0 if monotonically non-decreasing)
effect.extra keys: dd_distribution_p50_p95_p99, drawdown_durations_ms, equity_curve, peaks, time_to_recover_ms, troughs
Raw arrays: returns, timestamps_ms
Reference: standard running-maximum drawdown formulation; equivalent to pandas (cumsum.cummax - cumsum) on log returns
When to reach for it: worst-case loss profiling and recovery-time analysis; complements vol-based risk summaries
Requirement: ANOM-11

stats.regime.cusum_break@1¶

CUSUM structural-break / regime detection on the log-return series. Binary segmentation ("Bai-Perron-lite") finds breaks in the mean (drift) and/or volatility; emits one located finding per detected break.

Arity: Single
effect.metric: cusum_break
effect.value: the normalized CUSUM statistic at the detected break
effect.extra keys: break_index, break_ts_ms, post_mean, post_median, post_n, post_std, pre_mean, pre_median, pre_n, pre_std, regime_index, target
effect.effect_size: mean_shift (standardized mean shift) or vol_shift (post/pre std ratio), per break
params: target (mean/vol/both, default both), threshold (default 1.358 — Kolmogorov 5% critical value; lower is more sensitive), min_segment (default 30)
Raw arrays: closes, timestamps_ms
Reference: standardized CUSUM (sup of Brownian bridge) for mean; Inclán & Tiao (1994) CUSUM-of-squares for variance; binary segmentation for multiple breaks
When to reach for it: as a conditioning filter / meta-signal — gate entries to a single drift/volatility regime, or strengthen cointegration-breakdown detection. A stationary series flags no breaks.

CROSS — Two-instrument scans¶

Five REQUIREMENTS rows (CROSS-01..05), six distinct scan_id@version pairs (the rolling-correlation family ships Pearson + Spearman under CROSS-02). All five scan-IDs are Pair-arity and emit a length-2 instruments vector.

CROSS-01 (time-alignment primitive) is a reusable building block, NOT a stand-alone scan_id — it lives at crates/miner-core/src/scan/primitives/time_alignment.rs and is used by every two-instrument scan via inner_join.

cross.corr.pearson_rolling@1¶

Rolling Pearson correlation over aligned log returns.

Arity: Pair
effect.metric: pearson_corr_last
effect.value: last-window Pearson r
effect.extra keys: threshold, threshold_crossings, values, window_length, window_starts_ms
Raw arrays: returns_a, returns_b, timestamps_ms
Reference: hand-rolled rolling numpy.corrcoef (tolerance 1e-10 at goldens level)
When to reach for it: dynamic correlation tracking; threshold crossings flag regime changes
Requirement: CROSS-02 (Pearson variant)

cross.corr.spearman_rolling@1¶

Rolling Spearman ρ over aligned log returns with method='average' tie correction.

Arity: Pair
effect.metric: spearman_corr_last
effect.value: last-window Spearman ρ
effect.extra keys: threshold, threshold_crossings, values, window_length, window_starts_ms
Raw arrays: returns_a, returns_b, timestamps_ms
Reference: scipy.stats.spearmanr (default method='average'; tolerance 1e-8)
When to reach for it: monotonic-association tracking robust to extreme moves where Pearson breaks
Requirement: CROSS-02 (Spearman variant)

cross.ols.rolling@1¶

Rolling OLS regression on aligned log returns, emitting per-window β, α, R², residual_std.

Arity: Pair
effect.metric: ols_rolling_beta_last
effect.value: last-window β (hedge ratio)
effect.extra keys: alphas, betas, r2s, residual_stds, window_length, window_starts_ms
Raw arrays: returns_a, returns_b, timestamps_ms
Reference: statsmodels.regression.rolling.RollingOLS (tolerance 1e-9)
When to reach for it: time-varying hedge-ratio estimation; β instability is a pairs-trading red flag
Requirement: CROSS-03

cross.lead_lag.ccf@1¶

Cross-correlation function on a symmetric ±max_lag grid, with absolute-value argmax-lag.

Arity: Pair
effect.metric: lead_lag_argmax_lag
effect.value: argmax lag (integer-valued, cast to f64)
effect.extra keys: argmax_lag, argmax_value, ccf_values, lags, max_lag
Raw arrays: returns_a, returns_b, timestamps_ms
Reference: scipy.signal.correlate (mode='full') with normalisation, or statsmodels.tsa.stattools.ccf (tolerance 1e-10)
When to reach for it: detecting which instrument leads; non-zero argmax-lag is the canonical lead-lag signature
Requirement: CROSS-04

cross.cointegration.engle_granger@1¶

Engle-Granger two-step cointegration test (OLS hedge-ratio + ADF on residuals + OU half-life).

Arity: Pair
effect.metric: engle_granger_hedge_ratio
effect.value: β (the hedge ratio per D4-09)
effect.extra keys: adf_stat, hedge_ratio_alpha, ou_half_life, residual_std, residuals
Raw arrays: close_a, close_b, timestamps_ms (operates on LEVELS, not returns)
Reference: statsmodels.tsa.stattools.coint (y0=leg_a, y1=leg_b, trend='c')
When to reach for it: pairs-trading viability screen; cointegration + short OU half-life is the canonical statistical-arbitrage signal
Requirement: CROSS-05

SEAS — Seasonality scans¶

Six REQUIREMENTS rows (SEAS-01..06), six distinct scan_id@version pairs. All six are Single-arity.

seas.bucket.hour_of_day@1¶

Hour-of-day bucketing of returns into 24 buckets with per-bucket summary stats + t-stats.

Arity: Single
effect.metric: hour_of_day_max_abs_t_stat
effect.value: max-abs t-statistic across the 24 buckets
effect.extra keys: buckets, counts, iqrs, means, stds, t_stats
Raw arrays: returns, timestamps_ms
Reference: pandas.DataFrame.groupby(hour).agg(mean / std / count) + scipy.stats one-sample t-statistic per bucket
When to reach for it: intraday seasonality detection (FX session opens / closes; equities market open / close)
Requirement: SEAS-01

seas.bucket.day_of_week@1¶

Day-of-week bucketing (Mon-Sun) with per-bucket summary stats + t-stats.

Arity: Single
effect.metric: day_of_week_max_abs_t_stat
effect.value: max-abs t-statistic across the 7 buckets
effect.extra keys: buckets, counts, iqrs, means, stds, t_stats
Raw arrays: returns, timestamps_ms
Reference: pandas.DataFrame.groupby(dayofweek).agg(...) + scipy.stats one-sample t-statistic per bucket
When to reach for it: weekly seasonality (Monday-effect, weekend-gap effects)
Requirement: SEAS-02

seas.bucket.session@1¶

Trading-session bucketing with configurable session definitions (default FX-major: Asia / London / NY / Overlap per RESEARCH §1.8).

Arity: Single
effect.metric: session_max_abs_t_stat
effect.value: max-abs t-statistic across configured sessions
effect.extra keys: bucket_labels, counts, iqrs, means, session_boundaries_utc, stds, t_stats
Raw arrays: returns, timestamps_ms
Note: sessions are independent buckets; overlapping windows mean a bar is counted in every matching session. MAX_SESSIONS = 100 DOS-mitigation ceiling.
When to reach for it: FX-specific seasonality where time-of-day overlaps multiple regional sessions
Requirement: SEAS-03

seas.bucket.eom_som@1¶

End-of-month / start-of-month bucketing with cutoff_n trading days at each month edge (default 3 → 6 buckets EOM-3..EOM-1, SOM-1..SOM-3).

Arity: Single
effect.metric: eom_som_max_abs_t_stat
effect.value: max-abs t-statistic across the configured buckets
effect.extra keys: bucket_labels, counts, cutoff_n, iqrs, means, stds, t_stats
Raw arrays: returns, timestamps_ms
When to reach for it: month-end portfolio-rebalancing effects and month-start positioning flows
Requirement: SEAS-04

seas.test.anova_kruskal@1¶

Bundled parametric (ANOVA F-test) + non-parametric (Kruskal-Wallis) test for cross-bucket mean differences, with bucket source selected via buckets_via ∈ {hour_of_day, day_of_week, session, eom_som}.

Arity: Single
effect.metric: anova_f_statistic
effect.value: F-statistic
effect.extra keys: anova_p_value, group_count, kw_p_value, kw_stat, total_n
Raw arrays: returns, timestamps_ms
Reference: scipy.stats.f_oneway (parametric F-stat) and scipy.stats.kruskal (rank-based non-parametric)
Note: the bundled-output design (Plan 04-10) emits BOTH the F-stat and KW-stat in one envelope so consumers can compare parametric vs non-parametric significance without re-running.
When to reach for it: confirming the bucketed-seasonality screens with a formal hypothesis test
Requirement: SEAS-05

seas.event.pre_post_window@1¶

Pre / post event-window aggregation around caller-supplied event timestamps.

Arity: Single
effect.metric: event_post_window_mean
effect.value: arithmetic mean of post-event-window returns
effect.extra keys: event_count, post_window_bars, post_window_means, post_window_stds, pre_window_bars, pre_window_means, pre_window_stds
Raw arrays: returns, timestamps_ms
Note: caller supplies event_timestamps (UTC ms-since-epoch); events outside the bar range or with insufficient pre/post bars are silently skipped. MAX_EVENT_TIMESTAMPS = 100_000 DOS mitigation (Plan 04-10).
When to reach for it: event-study analysis around NFP / FOMC / ECB / earnings; arbitrary caller-defined event timestamps
Requirement: SEAS-06

seas.gap.overnight@1¶

Session/overnight gap statistics: detects close→next-open jumps that straddle a session boundary, then reports the gap-size distribution and gap-fill probability conditioned on direction × size bucket.

Arity: Single
effect.metric: overnight_gap_fill_rate
effect.value: overall fill rate (filled / total gaps; 0 when no gaps)
effect.extra keys: bucket_labels, down_counts, down_fill_counts, down_fill_prob, gap_count, gap_size_quantile_probs, gap_size_quantiles, hold_floor_caveat, median_bars_to_fill, size_bucket_edges, sparse_gaps, up_counts, up_fill_counts, up_fill_prob
Raw arrays: gap_bars_to_fill, gap_directions, gap_filled, gap_sizes, timestamps_ms (one entry per detected gap)
Params: boundary_gap_minutes (session/day-boundary; default 1.5× timeframe), size_bucket_edges (default [5e-4, 1e-3, 2e-3]), min_gap_threshold (default 0), resolution_hint (advisory), fill_lookahead_bars (default 48), min_obs_per_bucket (default 5), hold_floor_bars (default 12), sparse_gap_min_count (default 20)
Note: resolution-parameterized — boundaries come from the inter-bar time delta, so a continuous 24×5 FX series produces zero gaps and trips the sparse_gaps flag rather than emitting weak candidates. hold_floor_caveat trips when the data-driven median bars-to-fill is below the 12-bar arena floor (a gap-fill trade holds too few bars at this resolution — build at a finer resolution). Most valuable on equity-index / commodity CFDs; most useful at 15-min / 30-min bars.
When to reach for it: overnight / session gap-fill edges on index & commodity CFDs (Scout-ranked candidate, RAD-3548)
Requirement: RAD-3840

License¶

Licensed under the Apache License, Version 2.0. See: https://www.apache.org/licenses/LICENSE-2.0