From 0d6ca138555c6ae924826c233192254e43db4d93 Mon Sep 17 00:00:00 2001
From: Martin Durant <martin.durant@alumni.utoronto.ca>
Date: Mon, 17 Oct 2022 14:24:11 -0400
Subject: [PATCH 1/4] Very preliminary draft

---
 draft/ZEP0003.md | 115 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 115 insertions(+)
 create mode 100644 draft/ZEP0003.md

diff --git a/draft/ZEP0003.md b/draft/ZEP0003.md
new file mode 100644
index 0000000..3b03a57
--- /dev/null
+++ b/draft/ZEP0003.md
@@ -0,0 +1,115 @@
+---
+layout: default
+title: ZEP0000
+description: Variable chunk sizes
+parent: template
+nav_order: 1
+---
+
+# ZEP 3 — Variable chunking
+
+---
+
+Authors:
+* Martin Durant ([@martindurant](https://github.com/martindurant)), Anaconda, Inc.
+
+Status: Draft
+
+Type: Specification
+
+Created: 2022-10-17
+
+## Abstract
+
+To allow the chunks of a zarr array to be hyperrectangles rather than hypercubes, with the chunk
+lengths along any dimension a list of integers rather than a single chunk size.
+
+## Motivation and Scope
+
+Two specific use cases have motivated this, given below. However, this generalisation of Zarr's storage
+model can be seen as an optional enhancement, and the same data model as currently used by dask.array.
+
+- when producing a [kerchunked](https://github.com/fsspec/kerchunk) dataset, the native chunking of the targets 
+cannot be changed. It is common
+to have non-regular chunking on one dimension, such as a time dimension with one sample per day and chunks 
+of one month or one year. The change would allow these datasets to be read via kerchunk, and/or converted to
+zarr with equivalent chunking to the original
+- [awkward](https://github.com/scikit-hep/awkward) arrays, ragged arrays and sparse data can be represented as
+a set of one-dimensional arrays, with an appropriate metadata description convention. The size of a chunks
+of each component array corresponding to a logical chunk of the overall array will not, in general be equal
+with each other in a single chunk, nor consistent between chunks.
+- in some cases, parts of the overall data array may have very different data distributions, and it can
+be very convenient to partition the data by such characteristics to allow, for example, for more efficient encoding
+schemes.
+
+## Usage and Impact
+
+
+
+## Backward Compatibility
+
+
+This change is fully backward compatible - all old data will remain usable. However, data written with
+variable chunks will not be readable by older versions of Zarr.
+
+## Detailed description
+
+Currently, array metadata looks something like
+```json
+{
+    "chunks": [
+        10, 10
+    ],
+    "shape": [
+        100, 100
+    ],
+    ...
+}
+```
+
+The proposal is to allow metadata of the form
+```json
+{
+    "chunks": [
+        [5, 5, 5, 15, 15, 20, 35], 10
+    ],
+    "shape": [
+        100, 100
+    ],
+    ...
+}
+```
+Each element of `chunks`, corresponding to each dimension of the array, may be a single integer, as before,
+or a list of integers which add up to the size of the array in that dimension. In this example, the single value
+of `10` for the chunks on the second dimension would be identical to `[10, 10, 10, 10, 10, 10, 10, 10, 10, 10]`.
+The number of values in the list is equal to the number of chunks along that dimension.
+
+The data index bounds on a dimension of each hyperrectangle is formed by a cumulative sum of the chunks values, 
+starting at 0.
+```
+bounds_axis0 = [0, 5, 10, 15, 30, 45, 65, 100]
+bounds_axis1 = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
+```
+such that key "c0/0" contains values for indeces along the first dimension (0, 5] and (0, 10] on the second dimension.
+An array index of (17, 17) would be found in key "c3/1", index (2, 2).
+
+## Related Work
+
+
+## Implementation
+
+It is to be hoped that much code can be adapted from dask.array, which already allows variable chunk sizes
+on each dimension.
+
+## Alternatives
+
+
+## Discussion
+
+
+## References and Footnotes
+
+
+## Copyright
+
+This document has been placed in the public domain.

From 3dcef6e7119646ba1db3c2452d9cfd58ddcc0af6 Mon Sep 17 00:00:00 2001
From: Martin Durant <martin.durant@alumni.utoronto.ca>
Date: Tue, 18 Oct 2022 14:38:45 -0400
Subject: [PATCH 2/4] Add text from hackmd

---
 draft/ZEP0003.md | 100 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 77 insertions(+), 23 deletions(-)

diff --git a/draft/ZEP0003.md b/draft/ZEP0003.md
index 3b03a57..086b00c 100644
--- a/draft/ZEP0003.md
+++ b/draft/ZEP0003.md
@@ -21,7 +21,8 @@ Created: 2022-10-17
 
 ## Abstract
 
-To allow the chunks of a zarr array to be hyperrectangles rather than hypercubes, with the chunk
+To allow the chunks of a zarr array to be rectangular grid rather than a regular grid, 
+with the chunk
 lengths along any dimension a list of integers rather than a single chunk size.
 
 ## Motivation and Scope
@@ -31,58 +32,64 @@ model can be seen as an optional enhancement, and the same data model as current
 
 - when producing a [kerchunked](https://github.com/fsspec/kerchunk) dataset, the native chunking of the targets 
 cannot be changed. It is common
-to have non-regular chunking on one dimension, such as a time dimension with one sample per day and chunks 
+to have non-regular chunking on at least one dimension, such as a time dimension with one sample per day and chunks 
 of one month or one year. The change would allow these datasets to be read via kerchunk, and/or converted to
-zarr with equivalent chunking to the original
+zarr with equivalent chunking to the original. Such data cannot currently be represented in zarr.
 - [awkward](https://github.com/scikit-hep/awkward) arrays, ragged arrays and sparse data can be represented as
 a set of one-dimensional arrays, with an appropriate metadata description convention. The size of a chunks
 of each component array corresponding to a logical chunk of the overall array will not, in general be equal
-with each other in a single chunk, nor consistent between chunks.
+with each other in a single chunk, nor consistent between chunks, as each row in the matrix can have a variable number
+of non-zero values
+- sensor data, may not come in fixed increments; variably chunked storage would be great for parallel writing. 
+With variable chunk sizes, just need to make sure offsets are 
+correct once done. Otherwise, write locations for chunks are dependent on previous chunks.
 - in some cases, parts of the overall data array may have very different data distributions, and it can
 be very convenient to partition the data by such characteristics to allow, for example, for more efficient encoding
 schemes.
+- when filtering regular table data on one column and applying to other columns, you necessarily end up with an unequal
+number of values in each chunk, which zarr does not currently handle.
 
 ## Usage and Impact
 
+### Creation
+
+```python
+zarr.create(1000, chunks=((100, 300, 500, 100),))
+```
 
 
 ## Backward Compatibility
 
 
 This change is fully backward compatible - all old data will remain usable. However, data written with
-variable chunks will not be readable by older versions of Zarr.
+variable chunks will not be readable by older versions of Zarr. It would be reasonable to  wish to backport the
+feature to v2.
 
 ## Detailed description
 
-Currently, array metadata looks something like
+Currently, the array metadata specifies the chunking scheme like
+(see https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grid)
 ```json
 {
-    "chunks": [
-        10, 10
-    ],
-    "shape": [
-        100, 100
-    ],
-    ...
+  "type": "regular", 
+  "chunk_shape": [10, 10],
+  "separator":"/"
 }
 ```
 
 The proposal is to allow metadata of the form
 ```json
 {
-    "chunks": [
-        [5, 5, 5, 15, 15, 20, 35], 10
-    ],
-    "shape": [
-        100, 100
-    ],
-    ...
+  "type": "rectangular", 
+  "chunk_shape": [[5, 5, 5, 15, 15, 20, 35], 10],
+  "separator":"/"
 }
 ```
-Each element of `chunks`, corresponding to each dimension of the array, may be a single integer, as before,
+Each element of `chunk_shape`, corresponding to each dimension of the array, may be a single integer, as before,
 or a list of integers which add up to the size of the array in that dimension. In this example, the single value
 of `10` for the chunks on the second dimension would be identical to `[10, 10, 10, 10, 10, 10, 10, 10, 10, 10]`.
-The number of values in the list is equal to the number of chunks along that dimension.
+The number of values in the list is equal to the number of chunks along that dimension. Thus, a "rectangular"
+grid may be fully compatible as a "regular" grid.
 
 The data index bounds on a dimension of each hyperrectangle is formed by a cumulative sum of the chunks values, 
 starting at 0.
@@ -90,11 +97,47 @@ starting at 0.
 bounds_axis0 = [0, 5, 10, 15, 30, 45, 65, 100]
 bounds_axis1 = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
 ```
-such that key "c0/0" contains values for indeces along the first dimension (0, 5] and (0, 10] on the second dimension.
+such that key "c0/0" contains values for indices along the first dimension (0, 5] and (0, 10] on the second dimension.
 An array index of (17, 17) would be found in key "c3/1", index (2, 2).
 
 ## Related Work
 
+### Dask
+
+`dask.array` uses rectangular chunking internally, and is one of the major consumers of zarr data. Much of the
+code translating logical slices into slices on the individual chunks should be reusable.
+
+### Parquet/ Arrow
+
+Arrow describes tables as a collection of record batches. There is no restriction on the size of these batches. 
+This is not only very flexible, but can be used as an indexing strategy for low cardinality columns within parquet.
+
+```
+dataset_name/
+  year=2007/
+    month=01/
+       0.parq
+       1.parq
+       ...
+    month=02/
+       0.parq
+       1.parq
+       ...
+    month=03/
+    ...
+  year=2008/
+    month=01/
+    ...
+  ...
+```
+
+This feature was cited as one of the reasons parquet was chose over zarr for dask
+dataframes: https://github.com/dask/dask/issues/1599
+
+### awkward array
+
+https://github.com/zarr-developers/zarr-specs/issues/62
+
 
 ## Implementation
 
@@ -103,12 +146,23 @@ on each dimension.
 
 ## Alternatives
 
+### Just tune chunk sizes
+
+https://github.com/zarr-developers/zarr-specs/issues/62#issuecomment-1100806513
+
 
 ## Discussion
 
 
 ## References and Footnotes
 
+* Previous discussion:
+	* [Zarr Dask Table dask/dask#1599](https://github.com/dask/dask/issues/1599)
+	* [Protocol extensions for awkward arrays zarr-developers/zarr-specs#62](https://github.com/zarr-developers/zarr-specs/issues/62) 
+	* [Handling arrays with non-uniform chunking zarr-developers/zarr-specs#40](https://github.com/zarr-developers/zarr-specs/issues/40)
+	* [Chunk spec zarr-developers/zarr-spec#7](https://github.com/zarr-developers/zarr-specs/issues/7#issuecomment-468127219)
+
+
 
 ## Copyright
 

From e738f4c9e282b09ba160860781a3b325e50f76c4 Mon Sep 17 00:00:00 2001
From: Sanket Verma <svsanketverma5@gmail.com>
Date: Fri, 21 Oct 2022 18:03:27 +0530
Subject: [PATCH 3/4] Update ZEP0003.md

---
 draft/ZEP0003.md | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/draft/ZEP0003.md b/draft/ZEP0003.md
index 086b00c..a976999 100644
--- a/draft/ZEP0003.md
+++ b/draft/ZEP0003.md
@@ -1,15 +1,13 @@
 ---
 layout: default
-title: ZEP0000
+title: ZEP0003
 description: Variable chunk sizes
-parent: template
-nav_order: 1
+parent: draft ZEPs
+nav_order: 3
 ---
 
 # ZEP 3 — Variable chunking
 
----
-
 Authors:
 * Martin Durant ([@martindurant](https://github.com/martindurant)), Anaconda, Inc.
 

From b867afa6e1bb76587015c3c69bc914300247ea8c Mon Sep 17 00:00:00 2001
From: Martin Durant <martindurant@users.noreply.github.com>
Date: Fri, 21 Oct 2022 09:04:07 -0400
Subject: [PATCH 4/4] Update draft/ZEP0003.md

Co-authored-by: Isaac Virshup <ivirshup@gmail.com>
---
 draft/ZEP0003.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/draft/ZEP0003.md b/draft/ZEP0003.md
index a976999..bd7b0b3 100644
--- a/draft/ZEP0003.md
+++ b/draft/ZEP0003.md
@@ -10,6 +10,7 @@ nav_order: 3
 
 Authors:
 * Martin Durant ([@martindurant](https://github.com/martindurant)), Anaconda, Inc.
+* Isaac Virshup ([@ivirshup](https://github.com/martindurant)), Helmholtz Munich
 
 Status: Draft