Enable Caching and Docs Update (#121)

* Docs of caching * Enable merging of existing storage_options * Enable protocol overwrite * Update tests * Pre-commit * Remove cache storage opts * Remove caching fixture * Address PR comments * Change caching path * Better comment * Implement test_get_fs_from_url * Make test more explicit * Add test on updating storage_options * Implement caching test * Pre-commit * Test Messagepack Driver Caching * Random paths in cache test * Version bump
merantix-momentum · Aug 4, 2023 · 1fc20b7 · 1fc20b7
1 parent fab86c3
commit 1fc20b7
Show file tree

Hide file tree

Showing 18 changed files with 1,887 additions and 150 deletions.
diff --git a/docs/advanced/caching.rst b/docs/advanced/caching.rst
@@ -0,0 +1,75 @@
+Caching
+=======
+
+Context
+-------
+
+There are advantages and disadvantages when fetching data from remote storage (e.g., buckets).
+Remote storage offers conveniences, such as easy data sharing between applications and users.
+Additionally, remote storage is often more cost-effective than local storage for long-term data
+repositories, which is particularly beneficial for large datasets.
+
+Despite these advantages, there are potential drawbacks to fetching data from remote storage.
+One significant concern is slower data retrieval, which can impact the performance of your Squirrel
+application. Retrieving data over a network connection introduces additional latency over accessing
+data locally. Another factor to consider is the pricing structure associated with remote storage.
+Typically, remote storage costs are not solely based on how long which amount of data is stored but
+also on the amount of data transferred across the network and the number of requests made when
+retrieving the data. Consequently, you may incur extra costs, especially if you need to fetch a
+large number of shards across the network connection.
+
+Caching to the Rescue
+---------------------
+In Machine Learning workloads, models are often trained over multiple epochs, meaning you may need
+to fetch the same data multiple times during a run. To optimize this process, imagine if you could
+load the remote data only in the first pass (first epoch), store it locally, and subsequently
+access the data from the fast and inexpensive local disc or RAM (e.g., using ``/tmp`` which is
+typically residing on RAM using tmpfs). Precisely this functionality is offered through caching.
+
+Squirrel leverages the capabilities of the ``fsspec`` library, which includes a powerful caching
+feature out of the box. Each default Squirrel :py:class:`~squirrel.driver.Driver` such as
+:py:class:`~squirrel.driver.DataFrameDriver`, :py:class:`~squirrel.driver.FileDriver`, or
+:py:class:`~squirrel.driver.StoreDriver` accepts a ``storage_options`` argument, which is a
+dictionary passed down to the ``fsspec`` filesystem. This dictionary allows you to configure caching,
+among other things. For more detailed information, please refer to the ``fsspec``
+`documentation <https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally>`_
+on local caching.
+
+The code below shows an example of configuring caching for several drivers. Note that, as per the
+``fsspec`` documentation, only ``simplecache`` is "guaranteed thread/process-safe".
+
+.. code-block:: python
+
+    from squirrel.driver import CsvDriver, FileDriver, MessagepackDriver
+
+    so = {"protocol": "simplecache", "target_protocol": "gs", "cache_storage": "path/to/cache"}
+
+    CsvDriver("gs://bucket/data.csv", engine="pandas", storage_options=so)  # inherits from DataFrameDriver
+    MessagepackDriver("gs://bucket/data-dir", storage_options=so)  # inherits from StoreDriver
+    FileDriver("gs://bucket/file.txt", storage_options=so)
+
+Let's observe the performance benefits of caching in action. The below code compares the performance of
+a :py:class:`~squirrel.driver.MessagepackDriver` with and without caching. The generated plot shows that
+the non-cached driver has a similar loading speed for all epochs. However, the cached driver stores the
+data on the local disk in the first epoch and reads it from the local disk in the subsequent epochs,
+making it much faster than the non-cached driver. You can check the code to generate the below figure
+out `here <https://github.com/merantix-momentum/squirrel-core/blob/main/squirrel/benchmark/msgpack_caching.py>`_.
+
+.. image:: ./msgpack_caching.svg
+   :align: center
+   :alt: MessagepackDriver with and without Caching
+
+Storage Options and Catalogs
+----------------------------
+
+When using the :py:class:`~squirrel.catalog.Catalog`-API, some users will have written some
+``storage_options`` to the catalog (e.g., that their Google Cloud Service (GCS) bucket is
+``requester_pays=True``). As a new user, you might now want to provide additional ``storage_options``
+(e.g., for caching). As shown below in the code, you can do so when you call
+:py:func:`~squirrel.catalog.catalog.CatalogSource.get_driver` on the
+:py:class:`~squirrel.catalog.catalog.CatalogSource`. Squirrel ensures that the new
+``storage_options`` passed to :py:func:`~squirrel.catalog.catalog.CatalogSource.get_driver`
+are merged with the pre-existing ``storage_options``.
+
+.. literalinclude:: ../examples/catalog_so_update.py
+    :language: python