You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Consider this example: you have a OHLC(t) price timeseries for t in [0, 2, 4], where prices for t=1 and t=3 are missing (e.g. the asset was not trading at times t=1 and t=3).
As far as I understand it, EDGE requires at least two pairs of contiguous (consecutive) times (e.g. t, t+1, t+c, t+c+1 for some c>0) to estimate the bid-ask spread for time t. Like, t = [0, 1, 5, 6] or [1, 2, 3].
So in the example, EDGE cannot (correctly) produce an estimate because there are no contiguous/consecutive times. Is there a proper way around this?
It seems it would be incorrect to simply feed the data for t=0, 2, 4 into the EDGE estimator code (without NaNs) because differences in subsequent log prices ("log returns") would be exagerated. These are used in quantities like:
r2, r3, and r5 would be exagerated above whenever t skips over a time period because of the time gap.
On the other hand, including NaN / missing values in the data being fed to EDGE doesn't seem right either. For instance, if you replace every odd row of the test dataset with NaN (missing values), then the EDGE estimate would be undefined because r2, r3, and r5 above would be all NaN. Is it true that there is no valid bid-ask estimate in such a case?
If one excludes the missing rows from being passed, but also pass a new integer column t, I have played around unsuccessfully with trying a corrective factor, for example:
r2= (o-m1) /f(t-t1)
# same for r3, r5
Where f(t-t1) is a correction factor function such that f(1) equals 1, and presumably monotonic increasing. For volatility estimators, I know that f(x) = sqrt(x) is usually the appropriate correction factor for log returns (follows from Wiener Process). I tried sqrt and other powers (like 0 or 1), but all introduce bias.
Is this something where the math behind EDGE breaks down and there is no appropriate correction to be made for non-consecutive t? If so, the same concern applies to any gaps in trading times, like whenever the market is closed.
The text was updated successfully, but these errors were encountered:
Sorry for the delay on this, and great point! This sounds like a research idea rather than an issue 🙂 The reason is that, as you say, dropping missing values generally introduces heteroscedasticity and the variance of the estimator can greatly increase (but it should be ok for the bias). What I've noticed in my experience is that it is generally better to keep missing values (so to keep a regular time grid) and let the estimator work on the remaining data. Dropping even a small fraction of missing values may introduce large log-returns and make the estimation variance unstable. Of course, in the limit case where you have missing values every other period it is not possible to estimate the spread. However, in this (or similar) cases it is definitely possible to take every two observations and estimate the spread from that subset.
In general, I do not know how to optimally weight the observations to compute the spread with missing data. The issue is that they include both the volatility and the spread. So the square root scaling does not work because part of the return is driven by spread and that scaling would apply only to volatility.
eguidotti
changed the title
handling gaps where time is not consecutive / contiguous
Handling gaps where time is not consecutive / contiguous
Feb 14, 2025
Consider this example: you have a OHLC(t) price timeseries for
t
in [0, 2, 4], where prices for t=1 and t=3 are missing (e.g. the asset was not trading at times t=1 and t=3).As far as I understand it, EDGE requires at least two pairs of contiguous (consecutive) times (e.g. t, t+1, t+c, t+c+1 for some c>0) to estimate the bid-ask spread for time t. Like, t = [0, 1, 5, 6] or [1, 2, 3].
So in the example, EDGE cannot (correctly) produce an estimate because there are no contiguous/consecutive times. Is there a proper way around this?
It seems it would be incorrect to simply feed the data for t=0, 2, 4 into the EDGE estimator code (without NaNs) because differences in subsequent log prices ("log returns") would be exagerated. These are used in quantities like:
r2, r3, and r5 would be exagerated above whenever t skips over a time period because of the time gap.
On the other hand, including
NaN
/ missing values in the data being fed to EDGE doesn't seem right either. For instance, if you replace every odd row of the test dataset with NaN (missing values), then the EDGE estimate would be undefined because r2, r3, and r5 above would be all NaN. Is it true that there is no valid bid-ask estimate in such a case?If one excludes the missing rows from being passed, but also pass a new integer column
t
, I have played around unsuccessfully with trying a corrective factor, for example:Where
f(t-t1)
is a correction factor function such thatf(1)
equals 1, and presumably monotonic increasing. For volatility estimators, I know that f(x) = sqrt(x) is usually the appropriate correction factor for log returns (follows from Wiener Process). I tried sqrt and other powers (like 0 or 1), but all introduce bias.Is this something where the math behind EDGE breaks down and there is no appropriate correction to be made for non-consecutive
t
? If so, the same concern applies to any gaps in trading times, like whenever the market is closed.The text was updated successfully, but these errors were encountered: