Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Caching and Docs Update #121

Merged
merged 18 commits into from
Aug 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions docs/advanced/caching.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
Caching
=======

Context
-------

There are advantages and disadvantages when fetching data from remote storage (e.g., buckets).
Remote storage offers conveniences, such as easy data sharing between applications and users.
Additionally, remote storage is often more cost-effective than local storage for long-term data
repositories, which is particularly beneficial for large datasets.

Despite these advantages, there are potential drawbacks to fetching data from remote storage.
One significant concern is slower data retrieval, which can impact the performance of your Squirrel
application. Retrieving data over a network connection introduces additional latency over accessing
data locally. Another factor to consider is the pricing structure associated with remote storage.
Typically, remote storage costs are not solely based on how long which amount of data is stored but
also on the amount of data transferred across the network and the number of requests made when
retrieving the data. Consequently, you may incur extra costs, especially if you need to fetch a
large number of shards across the network connection.

Caching to the Rescue
---------------------
In Machine Learning workloads, models are often trained over multiple epochs, meaning you may need
to fetch the same data multiple times during a run. To optimize this process, imagine if you could
load the remote data only in the first pass (first epoch), store it locally, and subsequently
access the data from the fast and inexpensive local disc or RAM (e.g., using ``/tmp`` which is
typically residing on RAM using tmpfs). Precisely this functionality is offered through caching.

Squirrel leverages the capabilities of the ``fsspec`` library, which includes a powerful caching
feature out of the box. Each default Squirrel :py:class:`~squirrel.driver.Driver` such as
:py:class:`~squirrel.driver.DataFrameDriver`, :py:class:`~squirrel.driver.FileDriver`, or
:py:class:`~squirrel.driver.StoreDriver` accepts a ``storage_options`` argument, which is a
dictionary passed down to the ``fsspec`` filesystem. This dictionary allows you to configure caching,
among other things. For more detailed information, please refer to the ``fsspec``
`documentation <https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally>`_
on local caching.

The code below shows an example of configuring caching for several drivers. Note that, as per the
``fsspec`` documentation, only ``simplecache`` is "guaranteed thread/process-safe".

.. code-block:: python

from squirrel.driver import CsvDriver, FileDriver, MessagepackDriver
axkoenig marked this conversation as resolved.
Show resolved Hide resolved

axkoenig marked this conversation as resolved.
Show resolved Hide resolved
so = {"protocol": "simplecache", "target_protocol": "gs", "cache_storage": "path/to/cache"}

CsvDriver("gs://bucket/data.csv", engine="pandas", storage_options=so) # inherits from DataFrameDriver
MessagepackDriver("gs://bucket/data-dir", storage_options=so) # inherits from StoreDriver
FileDriver("gs://bucket/file.txt", storage_options=so)

Let's observe the performance benefits of caching in action. The below code compares the performance of
a :py:class:`~squirrel.driver.MessagepackDriver` with and without caching. The generated plot shows that
the non-cached driver has a similar loading speed for all epochs. However, the cached driver stores the
data on the local disk in the first epoch and reads it from the local disk in the subsequent epochs,
making it much faster than the non-cached driver. You can check the code to generate the below figure
out `here <https://github.com/merantix-momentum/squirrel-core/blob/main/squirrel/benchmark/msgpack_caching.py>`_.

.. image:: ./msgpack_caching.svg
:align: center
:alt: MessagepackDriver with and without Caching

Storage Options and Catalogs
----------------------------

When using the :py:class:`~squirrel.catalog.Catalog`-API, some users will have written some
``storage_options`` to the catalog (e.g., that their Google Cloud Service (GCS) bucket is
``requester_pays=True``). As a new user, you might now want to provide additional ``storage_options``
(e.g., for caching). As shown below in the code, you can do so when you call
:py:func:`~squirrel.catalog.catalog.CatalogSource.get_driver` on the
:py:class:`~squirrel.catalog.catalog.CatalogSource`. Squirrel ensures that the new
``storage_options`` passed to :py:func:`~squirrel.catalog.catalog.CatalogSource.get_driver`
are merged with the pre-existing ``storage_options``.

.. literalinclude:: ../examples/catalog_so_update.py
:language: python
Loading