-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enable Caching and Docs Update (#121)
* Docs of caching * Enable merging of existing storage_options * Enable protocol overwrite * Update tests * Pre-commit * Remove cache storage opts * Remove caching fixture * Address PR comments * Change caching path * Better comment * Implement test_get_fs_from_url * Make test more explicit * Add test on updating storage_options * Implement caching test * Pre-commit * Test Messagepack Driver Caching * Random paths in cache test * Version bump
- Loading branch information
Showing
18 changed files
with
1,887 additions
and
150 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
Caching | ||
======= | ||
|
||
Context | ||
------- | ||
|
||
There are advantages and disadvantages when fetching data from remote storage (e.g., buckets). | ||
Remote storage offers conveniences, such as easy data sharing between applications and users. | ||
Additionally, remote storage is often more cost-effective than local storage for long-term data | ||
repositories, which is particularly beneficial for large datasets. | ||
|
||
Despite these advantages, there are potential drawbacks to fetching data from remote storage. | ||
One significant concern is slower data retrieval, which can impact the performance of your Squirrel | ||
application. Retrieving data over a network connection introduces additional latency over accessing | ||
data locally. Another factor to consider is the pricing structure associated with remote storage. | ||
Typically, remote storage costs are not solely based on how long which amount of data is stored but | ||
also on the amount of data transferred across the network and the number of requests made when | ||
retrieving the data. Consequently, you may incur extra costs, especially if you need to fetch a | ||
large number of shards across the network connection. | ||
|
||
Caching to the Rescue | ||
--------------------- | ||
In Machine Learning workloads, models are often trained over multiple epochs, meaning you may need | ||
to fetch the same data multiple times during a run. To optimize this process, imagine if you could | ||
load the remote data only in the first pass (first epoch), store it locally, and subsequently | ||
access the data from the fast and inexpensive local disc or RAM (e.g., using ``/tmp`` which is | ||
typically residing on RAM using tmpfs). Precisely this functionality is offered through caching. | ||
|
||
Squirrel leverages the capabilities of the ``fsspec`` library, which includes a powerful caching | ||
feature out of the box. Each default Squirrel :py:class:`~squirrel.driver.Driver` such as | ||
:py:class:`~squirrel.driver.DataFrameDriver`, :py:class:`~squirrel.driver.FileDriver`, or | ||
:py:class:`~squirrel.driver.StoreDriver` accepts a ``storage_options`` argument, which is a | ||
dictionary passed down to the ``fsspec`` filesystem. This dictionary allows you to configure caching, | ||
among other things. For more detailed information, please refer to the ``fsspec`` | ||
`documentation <https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally>`_ | ||
on local caching. | ||
|
||
The code below shows an example of configuring caching for several drivers. Note that, as per the | ||
``fsspec`` documentation, only ``simplecache`` is "guaranteed thread/process-safe". | ||
|
||
.. code-block:: python | ||
from squirrel.driver import CsvDriver, FileDriver, MessagepackDriver | ||
so = {"protocol": "simplecache", "target_protocol": "gs", "cache_storage": "path/to/cache"} | ||
CsvDriver("gs://bucket/data.csv", engine="pandas", storage_options=so) # inherits from DataFrameDriver | ||
MessagepackDriver("gs://bucket/data-dir", storage_options=so) # inherits from StoreDriver | ||
FileDriver("gs://bucket/file.txt", storage_options=so) | ||
Let's observe the performance benefits of caching in action. The below code compares the performance of | ||
a :py:class:`~squirrel.driver.MessagepackDriver` with and without caching. The generated plot shows that | ||
the non-cached driver has a similar loading speed for all epochs. However, the cached driver stores the | ||
data on the local disk in the first epoch and reads it from the local disk in the subsequent epochs, | ||
making it much faster than the non-cached driver. You can check the code to generate the below figure | ||
out `here <https://github.com/merantix-momentum/squirrel-core/blob/main/squirrel/benchmark/msgpack_caching.py>`_. | ||
|
||
.. image:: ./msgpack_caching.svg | ||
:align: center | ||
:alt: MessagepackDriver with and without Caching | ||
|
||
Storage Options and Catalogs | ||
---------------------------- | ||
|
||
When using the :py:class:`~squirrel.catalog.Catalog`-API, some users will have written some | ||
``storage_options`` to the catalog (e.g., that their Google Cloud Service (GCS) bucket is | ||
``requester_pays=True``). As a new user, you might now want to provide additional ``storage_options`` | ||
(e.g., for caching). As shown below in the code, you can do so when you call | ||
:py:func:`~squirrel.catalog.catalog.CatalogSource.get_driver` on the | ||
:py:class:`~squirrel.catalog.catalog.CatalogSource`. Squirrel ensures that the new | ||
``storage_options`` passed to :py:func:`~squirrel.catalog.catalog.CatalogSource.get_driver` | ||
are merged with the pre-existing ``storage_options``. | ||
|
||
.. literalinclude:: ../examples/catalog_so_update.py | ||
:language: python |
Oops, something went wrong.