-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How much historical data to use? #2
Comments
Hi @tommedema, thanks for asking! The estimator estimates the average (more precisely, the root mean square) spread in the estimation period. So if you use the last 3 days, you get the average spread in the last 3 days. If you use one month or one year, you get the average spread in the month or in the year. Of course, the less data you use, the less precise the estimate will be, as the spread would be computed from fewer observations. Put simply: each OHLC (open, high, low, close) candle is one observation. And you need at least a few observations to get a sensible estimate. So if you are using daily prices, you need at least a few days. If you are using minute prices, you need at least a few minutes. Assuming you are using daily prices, 3 days would give you estimates closer to the spread in the last days but with large estimation uncertainty. Using the last year would give you more precise estimates, but for the average spread in the last year (which may not be your goal). I would say that using one month of daily data is pretty common in the literature, but ultimately there is no one-size-fits-all solution. If you have intraday prices, best option would be using the estimator with all prices in a day to get the average spread in that day. Hope this helps! |
@eguidotti great answer, thanks for explaining it so well. |
Hi Mr. Guidotti. I found myself with the same question as tommedema. I would love to see a table showing the mean errors of daily estimates when calculated from a single day's OHLC (the previous day's), up to a month. Or if you could at least tell us which period happens to result in the least amount of error. It would be a lot more satisfying than using the recommendation of what's "pretty common in the literature", considering the results of your excellent estimator are not common in this literature! |
Hi @DominusMaximus, thanks for reaching out. I'm working on a new version of the paper and will try to include a table like the one you suggest. I'm re-opening this issue and will post here when the update is available. Thanks for your idea! |
Awesome. |
Currently the package returns a single value. As a user, the interface we need is where the input is an OHLC matrix (four columns), and the output is a Bid Ask matrix (two columns). Essentially, the Bid and Ask are to sequentially be estimated for each row, also without ever looking at any subsequent row. As per a prior comment:
As a user, ideally I don't want the average spread across the estimation period. Instead I want the estimated rolling value(s) for each row of the period. |
@impredicative As a user, you are free to apply the estimator using a rolling window or any subsample that you like. The estimator does need at least 3 rows to output a spread estimate. And the output is the average (more precisely, the root-mean-squared) spread within the estimation period. A value of 0.01 corresponds to a spread of 1%. If you need bid and ask prices for each single row in the period, than you need quote data and not this package. |
I got this working with Pandas with a rolling window: from typing import Optional
import bidask
import pandas as pd
_WINDOW_DEFAULT = 5 # Note: Lower value of 3 resulted in NaN estimates for SQQQ.
def _estimate_spread(ser: pd.Series, df: pd.DataFrame) -> float:
df_roll = df.loc[ser.index]
return bidask.edge(df_roll['open'], df_roll['high'], df_roll['low'], df_roll['close'])
def add_estimated_bid_and_ask(df: pd.DataFrame, /, *, source: str = 'close', min_spread: Optional[float] = 0.01, **kwargs) -> None:
if 'window' not in kwargs:
kwargs['window'] = _WINDOW_DEFAULT
assert kwargs['window'] >= 3 # bidask requires it.
est_spread_fraction = df['close'].rolling(**kwargs).apply(_estimate_spread, args=(df,)) # Depending on use case, can specify `center=True` for rolling.
df[f'est_{source}_spread'] = df[source] * est_spread_fraction
if min_spread is not None:
df[f'est_{source}_spread'].clip(lower=min_spread, inplace=True)
est_close_spread_half = df[f'est_{source}_spread'] / 2
df[f'est_{source}_bid'] = df[source] - est_close_spread_half # Sell price
df[f'est_{source}_bid'].clip(lower=0, inplace=True)
df[f'est_{source}_ask'] = df[source] + est_close_spread_half # Buy price
add_estimated_bid_and_ask(df) |
Sure, you can apply |
To me this is unclear and ambiguous, leaving much room for interpretation. Code will help.
The package name is bidask. The package should live up to its name. When backtesting, all three matter: bid, ask, spread. |
That doesn’t seem related to any issue with the package itself. Feel free to open another issue and clarify your question and I’d be happy to help |
@DominusMaximus The new version of the paper is now available at SSRN! Regarding your question:
One option would be to proceed as follows. First, apply the estimator on the full time series to get a first estimate. Then, compute the standard deviation of the returns to get an estimate of volatility. You should also have an idea of how many times the asset is traded per day. At this point you can simulate a price process with these parameters to find a good estimation window to use. Below, you can find a code snippet that simulates a price process with spread 1% and volatility 3% for an asset that is traded once per minute (390 times per day). The code then estimates the spread with several estimation windows and plots the bias, variance, and root mean squared error (rmse) of the estimates. One month of trading corresponds to 21 days and the figures show that it is a reasonable default. Indeed, the estimation window must be increased a lot to get any substancial reduction in the estimation error. On the other side, the error explodes when the estimation window is too short. Ultimately, the optimal estimation window depends on the spread, volatility, and trading frequency, and it is specific to the use case. I hope the code below helps to play with these parameters and better understand how much data to use for estimation: library(bidask)
set.seed(123)
spr <- 0.01 # 1% bid-ask spread
vol <- 0.03 # 3% daily volatility
trd <- 390 # 390 trades per day
# Simulate 10000 open/high/low/close daily prices
x <- sim(n = 10000, trades = trd, spread = spr, volatility = vol)
# Estimate the spread with several estimation windows
metrics <- sapply(3:252, function(width){
s <- spread(x, width)
c(
'width' = width,
'bias' = mean(s - spr),
'variance' = var(s - spr),
'rmse' = sqrt(mean((s - spr)^2))
)
})
# Plot the results
plot(x = metrics['width',], y = metrics['bias',], log = "x", main = "Bias", xlab = "Estimation window (days)", ylab = "Bias")
plot(x = metrics['width',], y = metrics['variance',], log = "x", main = "Variance", xlab = "Estimation window (days)", ylab = "Variance")
plot(x = metrics['width',], y = metrics['rmse',], log = "x", main = "Root Mean Squared Error", xlab = "Estimation window (days)", ylab = "RMSE") |
Great library, I love the pseudocode and python implementation.
I read the paper at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3892335 but cannot figure out what the recommended number of historical days to use is. Is 3 enough, a month, a year? Is it possible to use too much data if I just want to estimate today's spread?
The text was updated successfully, but these errors were encountered: