Engineering alpha factors that predict returns
Based on a conceptual understanding of key factor categories, their rationale, and popular metrics, a key task is to identify new factors that may better capture the risks embodied by the return drivers laid out previously, or to find new ones. In either case, it will be important to compare the performance of innovative factors to that of known factors to identify incremental signal gains.
Key tools that facilitate the transformation of data into factors include the Python libraries for numerical computing, NumPy and pandas, as well as the Python wrapper around the specialized library for technical analysis, TA-Lib. Alternatives include the expression alphas developed in Zura Kakushadze's 2016 paper, 101 Formulaic Alphas, and implemented by the alphatools library. In addition, the Quantopian platform provides a large number of built-in factors to speed up the research process.
To apply one or more factors to an investment universe, we can use the Zipline backtesting library (which also includes some built-in factors) and evaluate their performance using the Alphalens library using metrics discussed in the following section.
How to engineer factors using pandas and NumPy
NumPy and pandas are the key tools for custom factor computations. This section demonstrates how they can be used to quickly compute the transformations that yield various alpha factors. If you are not familiar with these libraries, in particular pandas, which we will use throughout this book, please see the README for this chapter in the GitHub repo for links to documentation and tutorials.
The notebook feature_engineering.ipynb in the alpha_factors_in_practice directory contains examples of how to create various factors. The notebook uses data generated by the create_data.ipynb notebook in the data folder in the root directory of the GitHub repo, which is stored in HDF5 format for faster access. See the notebook storage_benchmarks.ipynb in the directory for Chapter 2, in the GitHub repo for a comparison of parquet, HDF5, and CSV storage formats for pandas DataFrames.
The NumPy library for scientific computing was created by Travis Oliphant in 2005 by integrating the older Numeric and Numarray libraries that had been developed since the mid-1990s. It is organized in a high-performance n-dimensional array data structure called ndarray, which enables functionality comparable to MATLAB.
The pandas library emerged in 2008 when Wes McKinney was working at AQR Capital Management. It provides the DataFrame data structure, which is based on NumPy's ndarray, but allows for more user-friendly data manipulation with label-based indexing. It includes a wide array of computational tools particularly well-suited to financial data, including rich time-series operations with automatic date alignment, which we will explore here.
The following sections illustrate some steps in transforming raw stock price data into selected factors. See the notebook feature_engineering.ipynb for additional detail and visualizations that we have omitted here to save some space. See the resources listed in the README for this chapter on GitHub for links to the documentation and tutorials on how to use pandas and NumPy.
Loading, slicing, and reshaping the data
After loading the Quandl Wiki stock price data on US equities, we select the 2000-18 time slice by applying pd.IndexSlice to pd.MultiIndex, which contains timestamp and ticker information. We then select and unpivot the adjusted close price column using the .stack() method to convert the DataFrame into wide format, with tickers in the columns and timestamps in the rows:
idx = pd.IndexSlice
with pd.HDFStore('../../data/assets.h5') as store:
prices = (store['quandl/wiki/prices']
.loc[idx['2000':'2018', :], 'adj_close']
.unstack('ticker'))
prices.info()
DatetimeIndex: 4706 entries, 2000-01-03 to 2018-03-27
Columns: 3199 entries, A to ZUMZ
Resampling – from daily to monthly frequency
To reduce training time and experiment with strategies for longer time horizons, we convert the business-daily data into month-end frequency using the available adjusted close price:
monthly_prices = prices.resample('M').last()
How to compute returns for multiple historical periods
To capture time-series dynamics like momentum patterns, we compute historical multi-period returns using the pct_change(n_periods) method, where n_periods identifies the number of lags. We then convert the wide result back into long format using .stack(), use .pipe() to apply the .clip() method to the resulting DataFrame, and winsorize returns at the [1%, 99%] levels; that is, we cap outliers at these percentiles.
Finally, we normalize returns using the geometric average. After using .swaplevel() to change the order of the MultiIndex levels, we obtain the compounded monthly returns over six different periods, ranging from 1 to 12 months:
outlier_cutoff = 0.01
data = pd.DataFrame()
lags = [1, 2, 3, 6, 9, 12]
for lag in lags:
data[f'return_{lag}m'] = (monthly_prices
.pct_change(lag)
.stack()
.pipe(lambda x:
x.clip(lower=x.quantile(outlier_cutoff),
upper=x.quantile(1-outlier_cutoff)))
.add(1)
.pow(1/lag)
.sub(1)
)
data = data.swaplevel().dropna()
data.info()
MultiIndex: 521806 entries, (A, 2001-01-31 00:00:00) to (ZUMZ, 2018-03-
31 00:00:00)
Data columns (total 6 columns):
return_1m 521806 non-null float64
return_2m 521806 non-null float64
return_3m 521806 non-null float64
return_6m 521806 non-null float64
return_9m 521806 non-null float64
return_12m 521806 non-null float6
We can use these results to compute momentum factors based on the difference between returns over longer periods and the most recent monthly return, as well as for the difference between 3- and 12-month returns, as follows:
for lag in [2,3,6,9,12]:
data[f'momentum_{lag}'] = data[f'return_{lag}m'].sub(data.return_1m)
data[f'momentum_3_12'] = data[f'return_12m'].sub(data.return_3m)
Using lagged returns and different holding periods
To use lagged values as input variables or features associated with the current observations, we use the .shift() method to move historical returns up to the current period:
for t in range(1, 7):
data[f'return_1m_t-{t}'] = data.groupby(level='ticker').return_1m.shift(t)
Similarly, to compute returns for various holding periods, we use the normalized period returns computed previously and shift them back to align them with the current financial features:
for t in [1,2,3,6,12]:
data[f'target_{t}m'] = (data.groupby(level='ticker')
[f'return_{t}m'].shift(-t))
The notebook also demonstrates how to compute various descriptive statistics for the different return series and visualize their correlation using the seaborn library.
Computing factor betas
We will introduce the Fama–French data to estimate the exposure of assets to common risk factors using linear regression in Chapter 7, Linear Models – From Risk Factors to Return Forecasts. The five Fama–French factors, namely market risk, size, value, operating profitability, and investment, have been shown empirically to explain asset returns. They are commonly used to assess the exposure of a portfolio to well-known drivers of risk and returns, where the unexplained portion is then attributed to the manager's idiosyncratic skill. Hence, it is natural to include past factor exposures as financial features in models that aim to predict future returns.
We can access the historical factor returns using the pandas-datareader and estimate historical exposures using the PandasRollingOLS rolling linear regression functionality in the pyfinance library, as follows:
factors = ['Mkt-RF', 'SMB', 'HML', 'RMW', 'CMA']
factor_data = web.DataReader('F-F_Research_Data_5_Factors_2x3',
'famafrench', start='2000')[0].drop('RF', axis=1)
factor_data.index = factor_data.index.to_timestamp()
factor_data = factor_data.resample('M').last().p(100)
factor_data.index.name = 'date'
factor_data = factor_data.join(data['return_1m']).sort_index()
T = 24
betas = (factor_data
.groupby(level='ticker', group_keys=False)
.apply(lambda x: PandasRollingOLS(window=min(T, x.shape[0]-1), y=x.return_1m, x=x.drop('return_1m', axis=1)).beta))
As mentioned previously, we will explore both the Fama–French factor model and linear regression in Chapter 7, Linear Models – From Risk Factors to Return Forecasts, in more detail. See the notebook feature_engineering.ipynb for additional examples, including the computation of lagged and forward returns.
How to add momentum factors
We can use the 1-month and 3-month results to compute simple momentum factors. The following code example shows how to compute the difference between returns over longer periods and the most recent monthly return, as well as for the difference between 3- and 12-month returns:
for lag in [2,3,6,9,12]:
data[f'momentum_{lag}'] = data[f'return_{lag}m'].sub(data.return_1m)
data[f'momentum_3_12'] = data[f'return_12m'].sub(data.return_3m)
Adding time indicators to capture seasonal effects
Basic factors also include seasonal anomalies like the January effect, which has been observed to cause higher returns for stocks during this month, possibly for tax reasons. This and other seasonal effects can be modeled through indicator variables that represent specific time periods such as the year and/or the month. These can be generated as follows:
dates = data.index.get_level_values('date')
data['year'] = dates.year
data['month'] = dates.month
How to create lagged return features
If you want to use lagged returns, that is, returns from previous periods as input variables or features to train a model that learns return patterns to predict future returns, you can use the .shift() method to move historical returns up to the current period. The following example moves the returns for the periods 1 to 6 months ago up by the corresponding lag so that they are associated with the observation for the current month:
for t in range(1, 7):
data[f'return_1m_t-{t}'] = data.groupby(level='ticker').return_1m.shift(t)
How to create forward returns
Similarly, you can create forward returns for the current period, that is, returns that will occur in the future, using .shift() with a negative period (assuming your data is sorted in ascending order):
for t in [1,2,3,6,12]:
data[f'target_{t}m'] = (data.groupby(level='ticker')
[f'return_{t}m'].shift(-t))
We will use forward returns when we train ML models starting in Chapter 6, The Machine Learning Process.
How to use TA-Lib to create technical alpha factors
TA-Lib is an open source library written in C++ with a Python interface that is widely used by trading software developers. It contains standardized implementations of over 200 popular indicators for technical analysis; that is, these indicators only use market data, namely price and volume information.
TA-Lib is compatible with pandas and NumPy, rendering its usage very straightforward. The following examples demonstrate how to compute two popular indicators.
Bollinger Bands consist of a simple moving average (SMA) surrounded by bands two rolling standard deviations below and above the SMA. It was introduced for the visualization of potential overbought/oversold conditions when the price dipped outside the two bands on the upper or lower side, respectively. The inventor, John Bollinger, actually recommended a trading system of 22 rules that generate trade signals.
We can compute the Bollinger Bands and, for comparison, the relative strength index described earlier in this section on popular alpha factors as follows.
We load the adjusted close for a single stock—in this case, AAPL:
with pd.HDFStore(DATA_STORE) as store:
data = (store['quandl/wiki/prices']
.loc[idx['2007':'2010', 'AAPL'],
['adj_open', 'adj_high', 'adj_low', 'adj_close',
'adj_volume']]
.unstack('ticker')
.swaplevel(axis=1)
.loc[:, 'AAPL']
.rename(columns=lambda x: x.replace('adj_', '')))
Then, we pass the one-dimensional pd.Series through the relevant TA-Lib functions:
from talib import RSI, BBANDS
up, mid, low = BBANDS(data.close, timeperiod=21, nbdevup=2, nbdevdn=2,
matype=0)
rsi = RSI(adj_close, timeperiod=14)
Then, we collect the results in a DataFrame and plot the Bollinger Bands with the AAPL stock price and the RSI with the 30/70 lines, which suggest long/short opportunities:
data = pd.DataFrame({'AAPL': data.close, 'BB Up': up, 'BB Mid': mid,
'BB down': low, 'RSI': rsi})
fig, axes= plt.subplots(nrows=2, figsize=(15, 8))
data.drop('RSI', axis=1).plot(ax=axes[0], lw=1, title='Bollinger Bands')
data['RSI'].plot(ax=axes[1], lw=1, title='Relative Strength Index')
axes[1].axhline(70, lw=1, ls='--', c='k')
axes[1].axhline(30, lw=1, ls='--', c='k')
The result, shown in Figure 4.4, is rather mixed—both indicators suggested overbought conditions during the early post-crisis recovery when the price continued to rise:
Figure 4.4: Bollinger Bands and relative strength index
Denoising alpha factors with the Kalman filter
The concept of noise in data relates to the domain of signal processing, which aims to retrieve the correct information from a signal sent, for example, through the air in the form of electromagnetic waves. As the waves move through space, environmental interference can be added to the originally pure signal in the form of noise, making it necessary to separate the two once received.
The Kalman filter was introduced in 1960 and has become very popular for many applications that require processing noisy data because it permits more accurate estimates of the underlying signal.
This technique is widely used to track objects in computer vision, to support the localization and navigation of aircraft and spaceships, and to control robotic motion based on noisy sensor data, besides its use in time series analysis.
Noise is used similarly in data science, finance, and other domains, implying that the raw data contains useful information, for instance, in terms of trading signals, that needs to be extracted and separated from irrelevant, extraneous information. Clearly, the fact that we do not know the true signal can make this separation rather challenging at times.
We will first review how the Kalman filter works and which assumptions it makes to achieve its objectives. Then, we will demonstrate how to apply it to financial data using the pykalman library.
How does the Kalman filter work?
The Kalman filter is a dynamic linear model of sequential data like a time series that adapts to new information as it arrives. Rather than using a fixed-size window like a moving average or a given set of weights like an exponential moving average, it incorporates new data into its estimates of the current value of the time series based on a probabilistic model.
More specifically, the Kalman filter is a probabilistic model of a sequence of observations z1, z2, …, zT and a corresponding sequence of hidden states x1, x2, …, xT (with the notation used by the pykalman library that we will demonstrate here). This can be represented by the following graph:
Figure 4.5: Kalman filter as a graphical model
Technically speaking, the Kalman filter takes a Bayesian approach that propagates the posterior distribution of the state variables x given their measurements z over time (see Chapter 10, Bayesian ML – Dynamic Sharpe Ratios and Pairs Trading, for more details on Bayesian inference). We can also view it as an unsupervised algorithm for tracking a single object in a continuous state space, where we will take the object to be, for example, the value of or returns on a security, or an alpha factor (see Chapter 13, Data-Driven Risk Factors and Asset Allocation with Unsupervised Learning).
To recover the hidden states from a sequence of observations that may become available in real time, the algorithm iterates between two steps:
- Prediction step: Estimate the current state of the process.
- Measurement step: Use noisy observations to update its estimate by averaging the information from both steps in a way that weighs more certain estimates higher.
The basic idea behind the algorithm is as follows: certain assumptions about a dynamic system and a history of corresponding measurements will allow us to estimate the system's state in a way that maximizes the probability of the previous measurements.
To achieve its objective of recovering the hidden state, the Kalman filter makes the following assumptions:
- The system that we are modeling behaves in a linear fashion.
- The hidden state process is a Markov chain so that the current hidden state xt depends only on the most recent prior hidden state xt-1.
- Measurements are subject to Gaussian, uncorrelated noise with constant covariance.
As a result, the Kalman filter is similar to a hidden Markov model, except that the state space of the latent variables is continuous, and both hidden and observed variables have normal distributions, denoted as with mean and standard
In mathematical terms, the key components of the model (and corresponding parameters in the pykalman implementation) are:
- The initial hidden state has a normal distribution: with initial_state_mean, and initial_state_covariance, .
- The hidden state xt+1 is an affine transformation of xt with transition_matrix A, transition_offset b, and added Gaussian noise with transition_covariance Q: .
- The observation zt is an affine transformation of the hidden state xt with observation_matrix C, observation_offset d, and added Gaussian noise with observation_covariance R: .
Among the advantages of a Kalman filter is that it flexibly adapts to non-stationary data with changing distributional characteristics (see Chapter 9, Time-Series Models for Volatility Forecasts and Statistical Arbitrage, for more details on stationarity).
Key disadvantages are the assumptions of linearity and Gaussian noise that financial data often violate. To address these shortcomings, the Kalman filter has been extended to systems with nonlinear dynamics in the form of the extended and the unscented Kalman filters. The particle filter is an alternative approach that uses sampling-based Monte Carlo approaches to estimate non-normal distributions.
How to apply a Kalman filter using pykalman
The Kalman filter is particularly useful for rolling estimates of data values or model parameters that change over time. This is because it adapts its estimates at every time step based on new observations and tends to weigh recent observations more heavily.
Except for conventional moving averages, the Kalman filter does not require us to specify the length of a window used for the estimate. Rather, we start out with our estimate of the mean and covariance of the hidden state and let the Kalman filter correct our estimates based on periodic observations. The code examples for this section are in the notebook kalman_filter_and_wavelets.ipynb.
The following code example shows how to apply the Kalman filter to smoothen the S&P 500 stock price series for the 2008-09 period:
with pd.HDFStore(DATA_STORE) as store:
sp500 = store['sp500/stooq'].loc['2008': '2009', 'close']
We initialize the KalmanFilter with unit covariance matrices and zero means (see the pykalman documentation for advice on dealing with the challenges of choosing appropriate initial values):
from pykalman import KalmanFilter
kf = KalmanFilter(transition_matrices = [1],
observation_matrices = [1],
initial_state_mean = 0,
initial_state_covariance = 1,
observation_covariance=1,
transition_covariance=.01)
Then, we run the filter method to trigger the forward algorithm, which iteratively estimates the hidden state, that is, the mean of the time series:
state_means, _ = kf.filter(sp500)
Finally, we add moving averages for comparison and plot the result:
sp500_smoothed = sp500.to_frame('close')
sp500_smoothed['Kalman Filter'] = state_means
for months in [1, 2, 3]:
sp500_smoothed[f'MA ({months}m)'] = (sp500.rolling(window=months * 21)
.mean())
ax = sp500_smoothed.plot(title='Kalman Filter vs Moving Average',
figsize=(14, 6), lw=1, rot=0)
The resulting plot in Figure 4.6 shows that the Kalman filter performs similarly to a 1-month moving average but is more sensitive to changes in the behavior of the time series:
Figure 4.6: Kalman filter versus moving average
How to preprocess your noisy signals using wavelets
Wavelets are related to Fourier analysis, which combines sine and cosine waves at different frequencies to approximate noisy signals. While Fourier analysis is particularly useful to translate signals from the time to the frequency domain, wavelets are useful for filtering out specific patterns that may occur at different scales, which, in turn, may correspond to a frequency range.
Wavelets are functions or wave-like patterns that decompose a discrete or continuous-time signal into components of different scales. A wavelet transform, in turn, represents a function using wavelets as scaled and translated copies of a finite-length waveform. This transform has advantages over Fourier transforms for functions with discontinuities and sharp peaks, and to approximate non-periodic or non-stationary signals.
To denoise a signal, you can use wavelet shrinkage and thresholding methods. First, you choose a specific wavelet pattern to decompose a dataset. The wavelet transform yields coefficients that correspond to details in the dataset.
The idea of thresholding is simply to omit all coefficients below a particular cutoff, assuming that they represent minor details that are not necessary to represent the true signal. These remaining coefficients are then used in an inverse wavelet transformation to reconstruct the (denoised) dataset.
We'll now use the pywavelets library to apply wavelets to noisy stock data. The following code example illustrates how to denoise the S&P 500 returns using a forward and inverse wavelet transform with a Daubechies 6 wavelet and different threshold values.
First, we generate daily S&P 500 returns for the 2008-09 period:
signal = (pd.read_hdf(DATA_STORE, 'sp500/stooq')
.loc['2008': '2009']
.close.pct_change()
.dropna())
Then, we select one of the Daubechies wavelets from the numerous built-in wavelet functions:
import pywt
pywt.families(short=False)
['Haar', 'Daubechies', 'Symlets', 'Coiflets', 'Biorthogonal', 'Reverse biorthogonal', 'Discrete Meyer (FIR Approximation)', 'Gaussian', 'Mexican hat wavelet', 'Morlet wavelet', 'Complex Gaussian wavelets', 'Shannon wavelets', 'Frequency B-Spline wavelets', 'Complex Morlet wavelets']
The Daubechies 6 wavelet is defined by a scaling function and the wavelet function itself (see the PyWavelet documentation for details and the accompanying notebook kalman_filter_and_wavelets.ipynb for plots of all built-in wavelet functions):
Figure 4.7: Daubechies wavelets
Given a wavelet function, we first decompose the return signal using the .wavedec function, which yields the coefficients for the wavelet transform. Next, we filter out all coefficients above a given threshold and then reconstruct the signal using only those coefficients using the inverse transform .waverec:
wavelet = "db6"
for i, scale in enumerate([.1, .5]):
coefficients = pywt.wavedec(signal, wavelet, mode='per')
coefficients[1:] = [pywt.threshold(i, value=scale*signal.max(), mode='soft') for i in coefficients[1:]]
reconstructed_signal = pywt.waverec(coefficients, wavelet, mode='per')
signal.plot(color="b", alpha=0.5, label='original signal', lw=2,
title=f'Threshold Scale: {scale:.1f}', ax=axes[i])
pd.Series(reconstructed_signal, index=signal.index).plot(c='k', label='DWT smoothing}', linewidth=1, ax=axes[i])
The notebook shows how to apply this denoising technique with different thresholds, and the resulting plot, shown in Figure 4.8, clearly shows how a higher threshold value yields a significantly smoother series:
Figure 4.8: Wavelet denoising with different thresholds