Overview
Most supervised-learning pipelines for financial time series start with the same brittle step: pick a fixed look-ahead horizon, compute the forward return, then threshold it into a label. If the horizon is too short, labels are dominated by noise. If it is too long, the regime may change before the prediction materializes.
Trend scanning, introduced by Marcos López de Prado, replaces that fixed choice with an optimization over a set of candidate horizons. At each point in time, it fits a simple linear regression of price against time for every candidate window length, selects the horizon that produces the largest absolute t-statistic, and uses the sign of that t-statistic as the label.
The result is a label that adapts to the local structure of the price path. When a strong trend is running, the method naturally selects a longer window. When the signal is fleeting, it selects a shorter one. The t-statistic provides a built-in confidence score that can be used downstream for sample weighting or filtering.
Visual
Price Path Colored by Trend-Scan t-Value
Each point on the synthetic price path is colored by the trend-scanning t-statistic at that location. Strongly positive t-values (uptrend) appear in light blue; strongly negative t-values (downtrend) appear in dark blue. Flat or ambiguous zones sit in between.
Visual
|t-stat| vs Forward Horizon
For a single starting point, this chart shows the absolute t-statistic from the linear trend regression across every candidate horizon h. The selected horizon h* is the one that maximizes |t|, shown by the highlighted marker.
Visual
Fixed-Horizon vs Trend-Scanning Labels
A direct comparison of labels generated by a fixed 20-day forward return versus trend scanning. Fixed labeling assigns many ambiguous or contradictory labels in choppy zones, while trend scanning concentrates confident labels where the trend evidence is strongest.
Article Section
Why fixed-horizon labeling fails
The standard approach in financial machine learning is to compute the forward return over a fixed number of bars and classify it as up, down, or flat. The problem is that the choice of horizon is arbitrary and has a large impact on label quality.
A 5-day return label captures noise and microstructure. A 60-day return label may span two or more distinct regimes. Neither is clearly correct, and the downstream model is forced to learn from labels that do not reflect the actual trend structure of the data.
This is not a minor nuisance. Label quality is the ceiling for any supervised model. If labels are noisy or misaligned with the real trend, no amount of feature engineering or model complexity will recover the lost signal.
label quality is the ceiling for any supervised model
Fixed forward return
rᵢ = (Pₜ₊ₕ − Pₜ) / Pₜ
Label rule
yᵢ = sign(rᵢ) if |rᵢ| > τ, else 0
Article Section
The trend scanning procedure
Trend scanning works by fitting a linear regression of the price path against time for every candidate horizon in a specified range. At each observation, it evaluates all windows from h_min to h_max, computes the t-statistic of the slope coefficient, and selects the horizon where |t| is maximized.
The t-statistic measures how many standard errors the slope is away from zero. A large positive t means the price path over that window is well-described by an upward trend. A large negative t means a clear downtrend. Selecting the horizon with the largest |t| is equivalent to choosing the window where the linear trend explains the most variance relative to noise.
Regression model
Pₜ₊ⱼ = β₀ + β₁ · j + εⱼ, j = 0, 1, …, h
t-statistic of slope
tᵇ(β₁) = β̂₁ / SE(β̂₁)
Selected horizon
h* = argmaxₕ |tᵇ(β₁, h)|
Final label
yₜ = sign(tᵇ(β₁, h*))
Article Section
Python implementation
The core implementation requires only NumPy. The inner function computes the t-statistic for a single (start, horizon) pair using the standard OLS formula. The outer function loops over observations and candidate horizons, selecting the best one at each point.
import numpy as np
def _linear_trend_t_value(prices, start, horizon):
"""
Compute the t-statistic of the slope for a simple
linear regression of prices[start : start+horizon+1]
against an integer time index.
"""
y = prices[start : start + horizon + 1]
n = len(y)
if n < 3:
return 0.0
x = np.arange(n, dtype=np.float64)
x_bar = x.mean()
y_bar = y.mean()
ss_xx = np.sum((x - x_bar) ** 2)
ss_xy = np.sum((x - x_bar) * (y - y_bar))
if ss_xx == 0:
return 0.0
beta_1 = ss_xy / ss_xx
y_hat = y_bar + beta_1 * (x - x_bar)
residuals = y - y_hat
sse = np.sum(residuals ** 2)
mse = sse / (n - 2)
se_beta = np.sqrt(mse / ss_xx) if mse > 0 else 0.0
if se_beta == 0:
return 0.0
return beta_1 / se_beta
def trend_scanning_labels(prices, h_min=5, h_max=20):
"""
For each observation, scan forward horizons from h_min
to h_max, pick the one with the largest |t-stat|,
and return the label (+1 / -1) plus metadata.
"""
n = len(prices)
labels = np.zeros(n)
t_values = np.zeros(n)
best_horizons = np.zeros(n, dtype=int)
for i in range(n):
best_t = 0.0
best_h = h_min
for h in range(h_min, h_max + 1):
if i + h >= n:
break
t = _linear_trend_t_value(prices, i, h)
if abs(t) > abs(best_t):
best_t = t
best_h = h
labels[i] = np.sign(best_t)
t_values[i] = best_t
best_horizons[i] = best_h
return labels, t_values, best_horizons
# Generate a synthetic price path
np.random.seed(42)
returns = np.random.normal(0, 0.01, 200)
returns[40:80] += 0.005 # inject uptrend
returns[120:160] -= 0.005 # inject downtrend
prices = 100 * np.cumprod(1 + returns)
# Run trend scanning
labels, t_vals, horizons = trend_scanning_labels(
prices, h_min=5, h_max=30
)
print(f"Avg selected horizon: {horizons.mean():.1f}")
print(f"Fraction labeled up: {(labels == 1).mean():.2%}")
print(f"Fraction labeled dn: {(labels == -1).mean():.2%}")
Article Section
Why the t-statistic is the right metric
Using the t-statistic rather than raw return or slope magnitude has a specific advantage: it normalizes for volatility. A small slope in a low-volatility regime can have a higher t-value than a large slope in a high-volatility regime. This means the method naturally adjusts for local noise levels.
The t-statistic also provides a built-in confidence measure. Labels with |t| > 2 correspond roughly to significance at the 5% level under the null of no trend. Labels with |t| near zero are ambiguous and can be filtered or down-weighted in training.
This is a key practical advantage. Most labeling methods produce binary outputs with no confidence score. Trend scanning gives you a continuous measure of label reliability that can be fed directly into sample-weighted loss functions.
Confidence filter
|tᵇ| > 2 → include in training set practical rule
Sample weight
wᵢ = |tᵢ| / Σ|tⱼ|
Article Section
Comparison with triple-barrier labeling
The triple-barrier method, also from López de Prado, constructs labels by defining take-profit and stop-loss barriers plus a time barrier. The label depends on which barrier is hit first. It produces path-dependent labels that account for risk management.
Trend scanning is different in philosophy. It does not impose barriers. Instead, it asks: over what forward window is the linear trend evidence strongest? The two methods answer different questions and can be complementary in a pipeline.
Triple-barrier is useful when you need labels that reflect executable trading outcomes. Trend scanning is useful when you need labels that reflect directional structure for feature learning or regime classification.
Triple-barrier method
Labels based on which barrier (profit, loss, or time) is hit first. Path-dependent and execution-aware.
Trend scanning
Labels based on the horizon with strongest linear trend evidence. Confidence-weighted and regime-adaptive.
Article Section
Practical considerations
The choice of h_min and h_max defines the range of horizons the method can select from. Setting h_min too low exposes the method to noise. Setting h_max too high makes it slow and may span multiple regimes.
A reasonable starting point for daily equity data is h_min = 5 and h_max = 20. For intraday data with 5-minute bars, h_min = 12 and h_max = 48 covers one to four hours. These should be tuned to the frequency and the features being used.
Computational cost is O(n · H) where H = h_max − h_min. For large datasets, the inner loop can be vectorized or parallelized. The OLS formula is closed-form and does not require iterative optimization.
the inner loop is closed-form OLS and can be fully vectorized
Complexity
O(n · H) where H = h_max − h_min
Typical daily range
h_min = 5, h_max = 20 starting point
Article Section
Integration into an ML pipeline
Trend scanning labels integrate cleanly into standard financial ML workflows. The labels and t-values become the target variable and sample weights respectively. Features are computed as of time t, and the model learns to predict the direction and strength of the trend that will unfold.
Because the method selects different horizons at different points in time, it naturally produces a mixture of short-term and longer-term labels. This can improve model robustness compared to a fixed-horizon approach that forces the model to predict over one timescale.
Conclusion
Why the framework still holds up
Trend scanning solves a specific and important problem in financial machine learning: how to generate directional labels without committing to an arbitrary fixed horizon.
By selecting the look-ahead window where the linear trend evidence is strongest, the method produces labels that are both adaptive and confidence-scored. The t-statistic gives a natural measure of label quality that can be used for sample weighting and filtering.
The approach is simple to implement, computationally cheap, and has a clear statistical interpretation. It is not a forecasting model itself, but a better way to define the target variable that forecasting models are trained on. That distinction matters: the ceiling on any supervised learner is set by the quality of its labels, and trend scanning raises that ceiling.