-
Notifications
You must be signed in to change notification settings - Fork 683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catalog v2 #991
Comments
Hi @limx0 My suggestion will be as follows.
Some feedback and example usage for the modified catalog using Pandas as follows. It has some extra helper functions which may be added as feature in new catalog. bar_type = BarType.from_str('EUR/USD.IDEALPRO-1-MINUTE-MID-EXTERNAL')
start_time = pd.Timestamp("2005-01-01")
end_time = pd.Timestamp("2022-12-31")
as_nautilus = False
# Load Bars for given BarType
perf_start = time.perf_counter()
bars = catalog.bars(
instrument_ids=[bar_type.instrument_id.value],
filter_expr=("bar_type", "=", str(bar_type)),
as_nautilus=as_nautilus,
start=start_time,
end=end_time,
)
print(f"{as_nautilus=}, loaded {len(bars)} Bars, in {time.perf_counter() - perf_start:0.4f} seconds")
# RESULT: as_nautilus=False, loaded 6564442 Bars, in 8.4057 seconds
# RESULT: as_nautilus=True, loaded 6564442 Bars, in 89.0130 seconds
# Loads Bars parquet as QuoteTicks (loads BID and ASK bars and then convert to ticks)
perf_start = time.perf_counter()
ticks_data = catalog.bars_to_ticks(
instrument_ids=[bar_type.instrument_id.value],
start=start_time,
end=end_time,
)
print(f"as_nautilus=True, loaded {len(ticks_data)} Ticks, in {time.perf_counter() - perf_start:0.4f} seconds")
# RESULT: as_nautilus=True, loaded 26252848 Ticks, in 117.1738 seconds
# Load Bars to given BarType from smaller timeframe, for directly loading to Backtest engine or other analysis
perf_start = time.perf_counter()
bar_type = BarType.from_str('EUR/USD.IDEALPRO-5-MINUTE-MID-EXTERNAL')
bars = catalog.bars_resample(
instrument_ids=[bar_type.instrument_id.value],
price_type=bar_type.spec.price_type,
bar_delta=bar_type.spec.timedelta,
as_nautilus=as_nautilus,
start=start_time,
end=end_time,
raise_on_empty=False,
source_bar_delta=pd.Timedelta(minutes=1)
)
print(f"{as_nautilus=}, loaded {len(bars)} Bars, in {time.perf_counter() - perf_start:0.4f} seconds")
# RESULT: as_nautilus=False, loaded 1312949 Bars, in 9.0769 seconds
# RESULT: as_nautilus=True, loaded 1312949 Bars, in 28.1769 seconds |
Hey @rsmb7z, thanks for the feedback. Ultimately need to pass Nautilus objects ( Taking the pandas route looks more performant above when compared with our current implementation of What we will be taking a look at, which you correctly mention above, is the number of files we generate, which has a bit impact on performance. We opted for trying to choose a one-size-fits-all approach, but I think in v2 we will pass more of this responsibility (and flexibility) onto the user, guiding them to best practices (for example: efficient_querying). |
Some totally contrived and very initial timings can be found in a notebook here. More research required though regarding our requirements and actual benchmarks. |
We don't need to iterate files, is there particular reason to reach file by file? Can you try with singal request just like others:
Also setting |
Nice thanks @rsmb7z - updated notebook here. Pandas now on par with Polars for a simple read, but I imagine most of this will be lost when we do a proper benchmark of |
After couple of more tests I see the DataFusion is leading the way. Leaving us to the following challenges:
|
What about move queries and conversion into Nautilus objects to separate thread/s? Then the backtest engine thread would receive ready data. |
Thanks Brad for opening this thread. Below are some of my thoughts on the topic. I agree Nautilus should strive to leverage other projects for data management for sustainability. Generally speaking, as long as you adopt a single standard that is well-documented and well supported, we'll be okay. I've become a fan of the underlying Parquet storage format, and would be glad to see the project continue with it. When I think about the catalog, I am primarily concerned with having 1 set of data that I can reuse for multiple purposes (input to Nautilus, direct data analytics, input to non-Nautilus projects). I get most of my data from a few providers (Polygon, others), and when we dealing with 10-15 years of tick/trade level data across 1000s of equities, it is challenging to store that data more than once. I can't afford to store it more than twice (1 raw copy + 1 curated copy). So, what I want is a storage format that allows me to store all the attributes I get from a provider (even ones Nautilus doesn't natively recognize as being part of a tick or bar) so the stored data can be natively consumed by Nautilus as well as other tools (DuckDB, DVC, etc). It would be ideal if Nautilus natively provided a way to extend the base Nautilus dataset with my custom attributes. If I had to store extra attributes myself, the Nautilus data storage layout should be transparent enough for me to easily align my non-Nautilus data to my Nautilus-native data in a single dataset, and Nautilus should be tolerant of such extensions to the data schema (e.g. Nautilus should not break just because I added data elements to the catalog which it doesn't make use of). It would help me if I could represent the catalog as a DVC-style data-registry. This would let me build repositories using DVC to index into the master catalog of data, allowing me to tier my storage according to what data gets used the most. It seems like Plateau and DVC could work in a complimentary fashion, but I'm not 100% on that since Plateau is new to me. I like the way Plateau abstracts some of the data storage implementation and lets you work with factories and data frames. |
Another question that has been in my mind is the handling of data with different frequencies - some features only change on a daily basis, others change at open/close of a market, etc. For that data, it is inefficient to embed it with higher frequency data (trade/tick/minute), so it would be good to have frequency-aligned catalogs that get 'merged' with other frequencies during a backtest simulation. Today we've been thinking about accomplishing this with separate catalogs based on frequency, each loaded as a separate data config to the node, and subscribed to separately within our strategies (e.g. strategy is aware of data frequency, and subscribes appropriately). |
More features I'd look for in a catalog:
Some of these features may already exist in the catalog, I'm just trying to capture as many as I can remember wanting |
Hey @Harnas - yes this is something we will definitely look into, but I would like to focus on single-threaded performance first before we try and tackle this. It's also worth noting that much of what we do in the catalog is just to get the performance to an appropriate level (targeting low single digit microseconds to parse and merge ticks), because the vast majority of the time spend in a backtest will be user defined code in python. In my experience thus far though, I don't think users would notice a 2-3x speed up using threading if we can hit these benchmarks, because >90% of the backtest time will be spent running user code. We will definitely keep an eye on this though. |
Hey @Troubladore thanks for your thoughts - I'm going to comment on the main points/themes I can see with my thoughts:
We will definitely be striving for this. I will say that the data will be stored for optimal reading by nautilus (ie we will mangle prices to ints etc) so some transforms will be required to get it back to a "usable" format such as a Dataframe, but in my experience, these workflows (research etc) have a much lower bar for performance, so we can simple write some wrapper functions that undo any mangling we do, which I am confident from a performance perspective will not be noticed by any users.
This should currently be possible via subclassing of existing types (see BetfairTicker) or creating your own data types. Let me know if you have any more specific examples of this?
DVC is a great project, I believe it would be relatively easy though to integrate the two, but it won't be something we build into nautilus.
This should be able to be handled in a single data catalog already. The only caveat may be bars of different intervals (I'm not 100% sure as I don't use them, but we may have fixed this recently) - but we should fix this, you should be able to store different bar types in a single catalog. I can't think of any other types where this is applicable (non of the tick types make sense in my mind), but happy to hear if there's any other specific issues on this front.
Thread-safe reads should be fine, writes may be difficult and likely out of scope (depending on what we settle on the writing the catalog). You will likely have to come up with your own solution here, its not something we would want to maintain (again there are plenty of libraries that could help, and our time better spent improving the single thread performance I think).
This currently exists in the catalog but will likely be removed, due to the maintenance burden; it is difficult to foresee the many types of data users may add to the catalog, and given we are building on well known and documented formats, I think this is best handled by the user.
This is currently how data is written to the Catalog, and will remain this way. We will endeavour to provide a "best-effort" organisation format without any prior knowledge of the data users may write, but as per my comments above, users may benefit from tailoring or reorganising the Catalog to better suit their individual needs. I hope I've covered everything there - let me know if not or if you have any follow up questions / comments. |
Hey @rsmb7z - I'm not sure exactly what you mean by 1., but we do go straight to the correct partitional (bar/instrument_x) to read only the parquet files under the directory. For 2., yes you're totally spot on; this is the bottleneck, which will soon be replaced by rust and cython (little or no python). |
If this is so then perfect. I was referring to the old Catalog where it would lookup in the root and takes ages to go through all partitions even if when only specified partition (instrument_id) was specified.
Yes that shall give us significant performance improvements. Lets test once it is released. |
Hey @rsmb7z - I just realised that I misled you with my comment:
We used to do this, but then it got changed in a PR - I am going to re-implement it in an upcoming PR. My apologies! |
Dear limx0 Do you know a new release Dataframe database call ArticDB? It may help for your data management. Regards, |
Hi @TianQingX - yes I'm aware of ArcticDB (and its predecessor Arctic), it's a great product and could be worth investigating for some users. The setup is not dissimilar to parquet datasets, but potentially with a few more bells and whistles. The downside from what I can see would be the cost of deserializing the ArcticDB format into nautilus objects, otherwise it seems very close to what we're looking for. At this stage though, we have a good understanding of the arrow/parquet formats, and I think they fit out use case best. |
Hello, i am new to github, programming and nautilus. Above are mentions about storing and unify formats if i am correct. In the past i saw this rust project to store orderbook DATA just wanted to mensions this Also interesting Or i long term an influxdb connector Kind Regards Bram |
I'm using QuestDB to injest data from crypto exchanges, using InfluxDB Line Protocol to insert at really high throughput. I just put it here as a good option if you guys consider a full features database. Naked files + simple API is still a better choice if we don't actually need all that luxury features. |
Background
I'd like to propose a refactor of the data catalog, focusing on a couple of goals:
I'm opening this issue to engage our community and encourage users to share their opinions or ideas on the current implementation of the Data Catalog. Fundamentally, the goal of the catalog is to allow users to store their data in a simple format that can be queried in a performant way to load data into the backtest engine for backtesting. It should be relatively easy for users to get their data into the catalog. A secondary goal is storing data for research purposes.
I should also note there is on-going work to "rustify" the catalog, but this is mostly related to parsing/deserializing the data retreived from the catalog and will continue / won't be affected by this issue.
Current State
Our (mostly my) initial implementation of the catalog has a couple of issues:
All of this has made the maintenance burden quite high, and hacking on the catalog quite tough. We should consider if this is a core competency of the platform or if we can leverage other projects to achieve our desired outcome.
Proposal
The catalog has two main functions:
This seems like a common use case in the broader python community currently, and I am quite confident we can achieve our stated goals above by leveraging some existing packages in the python/rust ecosystem. I'll be splitting the writing/reading of the parquet files as existing packages seem to split in a similar way.
Writing data
Users need to be able to take raw data (tick or bar data typically) from an external source (recorded / purchased / downloaded) and write it into one or many parquet files somewhere (local or remote storage) in the correct format to ensure they can be read again in a performant way for backtesting or research.
At a minimum, I believe we need the following features:
We currently leverage pyarrow datasets for writing, which offers partitioning and writing to remote storage, but leaves some of the management of the catalog to us.
I am considering using something like plateau (built on top of pyarrow) which offers a few more features for working with datasets and builds on real world expertise working with datasets similar to ours. There are a couple of appealing features we don't currently have (or have attempting to implement but not very well) (taken from https://plateau.readthedocs.io/en/latest/spec/format_specification.html):
Plateau currently only works with DataFrames, so we would need to do some work to integrate with Nautilus objects, but this may be worth doing to get the above "for free".
Reading Data
For reading data, we are looking for the most performant way to read from a series of parquet files from a storage backend. Because streaming is a strong use case for backtesting, we ideally need to be able to also stream data from our storage, deserialize it and pass it into the engine. Currently we implement this reading/querying in the data catalog, but there are several projects of interest doing similar things that we could leverage. We need the following features:
The projects that come to mind are:
Each of these libraries are trying to achieve similar-ish goals to us, and could potentially be leveraged on the querying side. Ideally we would maintain a simple API over the top of one of these libraries, benefiting from their continued development and improvements to their querying framework. The idea being, we would ship off a query off to one of the libraries, which would then handle querying/filtering and return some sort of iterator of raw arrow bytes that we could convert (via rust) into nautilus objects.
Other considerations
I'm keen to hear any thoughts or ideas on the above, or anything I may have missed or misunderstood.
The text was updated successfully, but these errors were encountered: