9.6. Is signal X useful in predicting returns?

In many “cross-sectional” asset pricing papers, the main results are shown in Table 1, and show whether we can form portfolios that generate alpha using some signal (a variable that describes something about stocks) to sort stocks into portfolios.

To make this analysis more rigorous, we can measure alpha six different ways (against six different asset pricing models). Among these are the CAPM and FF3 models we just estimated in prior pages.

The basic idea of this kind of table is that

  1. We measure something about stocks. Maybe it’s their size, or their recent returns. Let’s call this variable X.

  2. Then we sort stocks on X every month and divide them into 5 or 10 buckets. The stocks in each bucket constitute a portfolio, so we get the returns for that bucket for the next month. We also compute the “Long-Short” portfolio, which is always the return of the highest bucket minus the lowest bucket. The long-short portfolio is what you would get if you short the stocks in the lowest bucket to go long the highest bucket, and are “zero cost” portfolios.

  3. We repeat this across time so we can see how each portfolio/bucket does and show the average return in each bucket.

  4. To be more sophisticated, we do step 3 a few ways: We look at the portfolio returns compared to the market, and we also compute the portfolio alphas against some benchmark factor models.

  5. If we see significant results in the High Minus Low column, practitioners will say that X is an “anomaly”, in that the asset pricing model can’t explain its returns.


The plan for this code file comes from the “Assaying Anomalies” project, which is a protocol to evaluate whether a given factor X is useful in predicting returns.

  • Here is the repo containing that project’s code (mostly in Matlab as of Spring 2025, but will be in Python soon).

  • The paper describing that project, which you should read (it is breezy), and what you should cite is: Novy-Marx, Robert and Velikov, Mihail, Assaying Anomalies (February 13, 2024). Available at SSRN: https://ssrn.com/abstract=4338007 or http://dx.doi.org/10.2139/ssrn.4338007

  • If you want to go beyond the Table 1 we output here, the next things you should output are precisely what Velikov and Novy-Marx show in their paper after Table 1.

image.png

9.6.1. What “Table 1” looks like

In our code below, we are trying to replicate the structure of this table below, from Novy-Marx and Velikov.

image.png

# !pip install pandas-datareader # run this once, then comment

import pandas_datareader.famafrench as ff
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

# datasets = ff.get_available_datasets()
# datasets

9.6.2. Step 1: Get the factor portfolio returns.

df_factors = ff.FamaFrenchReader('F-F_Research_Data_5_Factors_2x3', start='1900-01-01').read()[0]

# add momentum to this
mom = ff.FamaFrenchReader('F-F_Momentum_Factor', start='1900-01-01').read()[0] # add momentum
mom.columns = ['Mom'] # rename
df_factors = pd.merge(df_factors, mom, left_index=True, right_index=True)
df_factors # FYI: contains Mkt-RF and RF, but no Mkt
Mkt-RF SMB HML RMW CMA RF Mom
Date
1963-07 -0.39 -0.41 -0.97 0.68 -1.18 0.27 0.90
1963-08 5.07 -0.80 1.80 0.36 -0.35 0.25 1.01
1963-09 -1.57 -0.52 0.13 -0.71 0.29 0.27 0.19
1963-10 2.53 -1.39 -0.10 2.80 -2.01 0.29 3.12
1963-11 -0.85 -0.88 1.75 -0.51 2.24 0.27 -0.74
... ... ... ... ... ... ... ...
2024-08 1.61 -3.65 -1.13 0.85 0.86 0.48 4.79
2024-09 1.74 -1.02 -2.59 0.04 -0.26 0.40 -0.60
2024-10 -0.97 -0.88 0.89 -1.38 1.03 0.39 2.87
2024-11 6.51 4.78 -0.05 -2.62 -2.17 0.40 0.90
2024-12 -3.17 -3.87 -2.95 1.82 -1.10 0.37 0.05

738 rows × 7 columns

9.6.3. Step 2: Construct portfolio returns from your signal

Above, we said that

  1. We measure something about stocks. Maybe it’s their size, or their recent returns. Let’s call this variable X.

  2. Then we sort stocks on X every month and divide them into 5 or 10 buckets. The stocks in each bucket constitute a portfolio, so we get the returns for that bucket for the next month. We also compute the “Long-Short” portfolio, which is always the return of the highest bucket minus the lowest bucket. The long-short portfolio is what you would get if you short the stocks in the lowest bucket to go long the highest bucket, and are “zero cost” portfolios.

This is where you would do that.

For simplicity, I’m just using pre-constructed portfolio returns on this page. To see one way to construct a signal from predictor variables and how we calculate the portfolio returns, visit the page on trading on stock return predictions.

# now, here we'd develop some "signal" and then create portfolio rets based on it
# I'm skipping... you figure that out

# I'll pretend I did that by grabbing 5 industry portfolio returns
df_portfolios = ff.FamaFrenchReader('5_Industry_Portfolios', start='1900-01-01').read()[0]
df_portfolios.columns = [f'Port{i+1}' for i in range(len(df_portfolios.columns))] # this is my anticipated portfolio number name scheme

df_portfolios.eval("HmL = Port5-Port1", inplace=True)

# Make each portfolio (except for HmL) excess returns
for col in ['Port1', 'Port2', 'Port3', 'Port4', 'Port5']:
    df_portfolios[col] = df_portfolios[col] - df_factors['RF']

portfolios = df_portfolios.columns.tolist()

df_portfolios

9.6.4. Step 3: Run the regressions.

reg_df = pd.merge(df_factors, df_portfolios, left_index=True, right_index=True)
reg_df
Mkt-RF SMB HML RMW CMA RF Mom Port1 Port2 Port3 Port4 Port5 HmL
Date
1975-01 13.66 12.91 8.28 -0.78 -0.90 0.58 -13.82 21.19 11.94 12.90 -1.32 17.31 -3.88
1975-02 5.56 -0.65 -4.45 1.16 -2.11 0.43 -0.61 4.23 4.63 9.34 15.61 2.44 -1.79
1975-03 2.66 4.00 2.38 1.26 -1.33 0.41 -2.04 7.75 1.40 -0.17 0.88 3.74 -4.01
1975-04 4.23 -0.71 -1.14 1.41 -1.34 0.44 1.38 2.75 6.82 2.53 0.21 2.83 0.08
1975-05 5.19 2.89 -4.10 -0.98 -0.60 0.44 -0.58 3.60 5.95 4.56 6.08 5.28 1.68
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2024-08 1.61 -3.65 -1.13 0.85 0.86 0.48 4.79 0.57 0.70 0.86 5.96 2.55 1.98
2024-09 1.74 -1.02 -2.59 0.04 -0.26 0.40 -0.60 4.01 1.54 2.69 -2.21 0.26 -3.75
2024-10 -0.97 -0.88 0.89 -1.38 1.03 0.39 2.87 -2.08 -2.38 -0.48 -3.50 0.72 2.80
2024-11 6.51 4.78 -0.05 -2.62 -2.17 0.40 0.90 10.52 6.01 4.76 -0.98 9.83 -0.69
2024-12 -3.17 -3.87 -2.95 1.82 -1.10 0.37 0.05 -0.82 -8.38 0.56 -5.79 -7.46 -6.64

600 rows × 13 columns

# define factor model formulas (these are the right side of the regression formulas)
# these are how formulas are specified for statsmodel's formula api
factor_models = {
    'r^e':  '1',
    'CAPM': 'Q("Mkt-RF")',
    'FF3':  'Q("Mkt-RF") + SMB + HML',
    'FF4':  'Q("Mkt-RF") + SMB + HML + Mom',
    'FF5':  'Q("Mkt-RF") + SMB + HML + RMW + CMA',
    'FF6':  'Q("Mkt-RF") + SMB + HML + RMW + CMA + Mom'
}
# pre built output table

index = pd.MultiIndex.from_product([factor_models.keys(), ['alpha', 't-stat']], names=['Model', 'Metric'])
results = pd.DataFrame(index=index, columns=portfolios, dtype=float)
results
Port1 Port2 Port3 Port4 Port5 HmL
Model Metric
r^e alpha NaN NaN NaN NaN NaN NaN
t-stat NaN NaN NaN NaN NaN NaN
CAPM alpha NaN NaN NaN NaN NaN NaN
t-stat NaN NaN NaN NaN NaN NaN
FF3 alpha NaN NaN NaN NaN NaN NaN
t-stat NaN NaN NaN NaN NaN NaN
FF4 alpha NaN NaN NaN NaN NaN NaN
t-stat NaN NaN NaN NaN NaN NaN
FF5 alpha NaN NaN NaN NaN NaN NaN
t-stat NaN NaN NaN NaN NaN NaN
FF6 alpha NaN NaN NaN NaN NaN NaN
t-stat NaN NaN NaN NaN NaN NaN
# Run regressions for each portfolio and model

full_reg_output = {} # to save everything, in case we want to access other stuff (like beta loadings)
for portfolio in portfolios:
    for model_name, formula in factor_models.items():
        reg = smf.ols(formula=f'{portfolio} ~ {formula}', data=reg_df).fit()
        # extract the intercept coef and t-stat
        alpha = reg.params['Intercept']
        t_stat = reg.tvalues['Intercept']
        results.at[(model_name, 'alpha'), portfolio] = alpha
        results.at[(model_name, 't-stat'), portfolio] = t_stat
        full_reg_output[(model_name, portfolio)] = reg

results
Port1 Port2 Port3 Port4 Port5 HmL
Model Metric
r^e alpha 0.661057 0.557900 0.646938 0.681369 0.597724 -0.063333
t-stat 3.900987 3.433497 3.230902 3.859817 3.052126 -0.647961
CAPM alpha 0.114768 0.038745 0.010464 0.203102 -0.043658 -0.158426
t-stat 1.630695 0.550711 0.118044 1.773375 -0.584753 -1.670608
FF3 alpha 0.091171 -0.052544 0.137807 0.302106 -0.186326 -0.277498
t-stat 1.290888 -0.815539 1.721098 2.730282 -3.023698 -3.132967
FF4 alpha 0.124516 -0.093283 0.208553 0.235236 -0.152898 -0.277413
t-stat 1.733414 -1.427806 2.585315 2.095277 -2.442501 -3.067951
FF5 alpha -0.074883 -0.184758 0.318594 0.133618 -0.180963 -0.106080
t-stat -1.142738 -2.948850 4.167975 1.209001 -2.945735 -1.222041
FF6 alpha -0.032642 -0.206952 0.362788 0.093130 -0.155887 -0.123244
t-stat -0.496089 -3.265914 4.717103 0.833356 -2.511528 -1.400810