Skip to content

Commit

Permalink
Merge pull request #65 from developmentseed/docs/local-run
Browse files Browse the repository at this point in the history
Improve documentation, restrict dependencies versions, switch to standard AWS S3 auth env variables
  • Loading branch information
emileten authored Apr 18, 2024
2 parents d064cd8 + bb94376 commit 47c80ee
Show file tree
Hide file tree
Showing 12 changed files with 236 additions and 262 deletions.
83 changes: 0 additions & 83 deletions CONTRIBUTING.md

This file was deleted.

83 changes: 83 additions & 0 deletions HAZARD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Hazard
Creation of climate hazard model data sets for OS-Climate.

*Hazard* is a Python library for creating climate hazard model data sets for OS-Climate applications. The datasets may be simply on-boarded from existing data or derived by transforming other data sources.

An important use of *hazard* is the preparation of hazard model data sets for use in the [Physrisk](https://github.com/os-climate/physrisk) physical climate risk analysis tool. In general the preparation is in the form of a pipeline, whereby data is sourced, transformed and stored in optimized form (generally to OS-Climate Amazon S3). It is desirable to leverage cloud computing services where tasks are memory-, I/O- and/or compute-intensive.

In line with the the *'treat your data as code'* approach and to ensure that the creation of any data set for OS-Climate is *repeatable* and *transparent*, a data set is associated with a particular Git commit of this repository.
A particular data set creation task is a Python script. These can be run on [OS-Climate JupyterHub](https://jupyterhub-odh-jupyterhub.apps.odh-cl2.apps.os-climate.org) environment (as script, notebook or pipeline).

## Introduction to data sets for hazard models
Hazards come in two varieties:
1. *Acute hazards*: **events**, for example heat waves, inundations (floods) and hurricanes, and
2. *Chronic hazards*: long-term shifts in climate parameters such as average temperature, sea-level or water stress indices.

See [methodology document](https://github.com/os-climate/physrisk/tree/main/methodology#:~:text=PhysicalRiskMethodology.pdf) for more details.

Two important types of model used in the assessment of the vulnerability of an asset (natural or financial) to an acute hazard are:

1. *Event-based models*, where the model provides a large number of individual simulated events, actual or plausible, and
2. *Return-period-based models*, where the model rather provides the statistic properties of the ensemble of events.

### Acute hazard model data sets

Return-period-based acute hazard model data sets contain event intensity as a function of return period for different locations. For example, the model might specify that in a certain region flood events with an inundation depth of 50 cm occur with a return period of 10 years (i.e. these are one in 10 year events) and events with an inundation depth of 100 cm occur with a return period of 200 years. In practice, flood models may have a granularity of 10 return periods or more.

An inundation depth of 100 cm for events with a 200 year return period implies that there is a probability of $1/200$ that a flood event occurs in a single year with an inundation depth greater than 100 cm (see [methodology document](https://github.com/os-climate/physrisk/tree/main/methodology#:~:text=PhysicalRiskMethodology.pdf) for discussion of different return period conventions). The probability here is an *exceedance probability*.

The dataset therefore has three dimensions (or axes); two spatial and return period.

### Chronic hazard model data sets

In contrast, chronic hazard model data sets only have the two spatial dimensions, under the convention that a single climate parameter is provided in each data set.

## Dataset creation/transformation: guidelines

[Xarrays](https://docs.xarray.dev/en/stable/) are the main containers used in creating or transforming data sets and are also used as an intermediary format when on-boarding data sets. [Dask is used](https://docs.xarray.dev/en/stable/user-guide/dask.html) to parallelize the calculations in cases where these are memory- or compute-intensive. Xarrays are chosen for convenience and performance when dealing with multidimensional data sets.

## Dataset storage format
In common with other types of geospatial data, hazard model data sets may be raster or vector, regions being determined by cells in the former case and, typically, by polygon boundaries in the latter.

### Raster data
For physical risk calculations, fast look up of data based on a large number (> millions) of latitude and longitude pairs is required. In order to access large multidimensional raster data sets efficiently, the Zarr format is preferred. Zarr is a compressed chunked format in which — in contrast to NetCDF4/HDF5 data — each chunk is a separate object in cloud object stores. This facilitates parallel read and write access, e.g. from a cluster of CPUs. The Zarr format is also convenient when using [xarrays](https://docs.xarray.dev/en/stable/) especially [with Dask](https://docs.xarray.dev/en/stable/user-guide/dask.html).

It is worth noting that use of a chunked format does not preclude federation of the data within a database (see for example the approach of [tileDB](https://tiledb.com/)).

### Raster chunk sizes and dimensions
As a guide to chunk size [the Zarr team notes](https://zarr.readthedocs.io/en/stable/tutorial.html) that 'chunks of at least 1 megabyte (1M) uncompressed size seem to provide better performance, at least when using the Blosc compression library.' [The Amazon Best Practices for S3](https://d1.awsstatic.com/whitepapers/AmazonS3BestPractices.pdf) moreover recommends making concurrent requests for byte ranges of an object at the granularity of 8-16 MB, in rough agreement to this.

For return-period-based data sets, the recommended dimensions are ('return period', 'latitude', 'longitude'). Each chunk should contain data for all return periods since this is needed for each latitude and longitude. For more efficient compression under the "C" (row-major) layout, return period is the first dimension.

```python
import zarr
import zarr.storage.MemoryStore

# create empty Zarr array containing return period data with 21600 latitudes and 43200 longitudes
shape = (10, 21600, 43200) # ('return period', 'latitude', 'longitude')
store = zarr.storage.MemoryStore(root="hazard.zarr")
root = zarr.open(store=store, mode="w")
z = root.create_dataset("example_array_path",
shape=(shape[0], shape[1], shape[2]),
chunks=(shape[0], 1000, 1000),
dtype="f4"
)
```

Note that each chunk contains all return period data for a spatial region of 1000×1000 cells.

### Affine transforms
An important and common case for raster data sets is that the transform to and from geospatial coordinates (e.g. altitude and longitudes) and raster cell location (e.g. index of the array) is *affine*. [Affine](https://pypi.org/project/affine/) is a convenient library for handling such transforms. Affine transforms are common in the metadata of GeoTIFF files as handled by, for example, the [rasterio](https://rasterio.readthedocs.io/en/latest/api/rasterio.transform.html) package.

### Use with xarrays

### Conventions for storage of Zarr arrays

The root for hazard Zarr arrays in a S3 bucket is 'hazard' or 'hazard_test' (for testing).
within this, Zarr arrays are stored in Zarr group hazard/hazard.zarr

The convention for paths to Zarr arrays is:
hazard/hazard.zarr/`<`path to array`>`/`<`version`>`/`<`array name`>`

Arrays are typically instances of models
-
102 changes: 35 additions & 67 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,84 +1,52 @@
# Hazard
Creation of climate hazard model data sets for OS-Climate.
# Quick start

*Hazard* is a Python library for creating climate hazard model data sets for OS-Climate applications. The datasets may be simply on-boarded from existing data or derived by transforming other data sources.
## Installation

An important use of *hazard* is the preparation of hazard model data sets for use in the [Physrisk](https://github.com/os-climate/physrisk) physical climate risk analysis tool. In general the preparation is in the form of a pipeline, whereby data is sourced, transformed and stored in optimized form (generally to OS-Climate Amazon S3). It is desirable to leverage cloud computing services where tasks are memory-, I/O- and/or compute-intensive.
Clone the repository :

## Using hazard
In line with the the *'treat your data as code'* approach and to ensure that the creation of any data set for OS-Climate is *repeatable* and *transparent*, a data set is associated with a particular Git commit of this repository.
A particular data set creation task is a Python script. These can be run on [OS-Climate JupyterHub](https://jupyterhub-odh-jupyterhub.apps.odh-cl2.apps.os-climate.org) environment (as script, notebook or pipeline).

## Introduction to data sets for hazard models
Hazards come in two varieties:
1. *Acute hazards*: **events**, for example heat waves, inundations (floods) and hurricanes, and
2. *Chronic hazards*: long-term shifts in climate parameters such as average temperature, sea-level or water stress indices.

See [methodology document](https://github.com/os-climate/physrisk/tree/main/methodology#:~:text=PhysicalRiskMethodology.pdf) for more details.

Two important types of model used in the assessment of the vulnerability of an asset (natural or financial) to an acute hazard are:

1. *Event-based models*, where the model provides a large number of individual simulated events, actual or plausible, and
2. *Return-period-based models*, where the model rather provides the statistic properties of the ensemble of events.

### Acute hazard model data sets

Return-period-based acute hazard model data sets contain event intensity as a function of return period for different locations. For example, the model might specify that in a certain region flood events with an inundation depth of 50 cm occur with a return period of 10 years (i.e. these are one in 10 year events) and events with an inundation depth of 100 cm occur with a return period of 200 years. In practice, flood models may have a granularity of 10 return periods or more.

An inundation depth of 100 cm for events with a 200 year return period implies that there is a probability of $1/200$ that a flood event occurs in a single year with an inundation depth greater than 100 cm (see [methodology document](https://github.com/os-climate/physrisk/tree/main/methodology#:~:text=PhysicalRiskMethodology.pdf) for discussion of different return period conventions). The probability here is an *exceedance probability*.

The dataset therefore has three dimensions (or axes); two spatial and return period.

### Chronic hazard model data sets

In contrast, chronic hazard model data sets only have the two spatial dimensions, under the convention that a single climate parameter is provided in each data set.

## Dataset creation/transformation: guidelines

[Xarrays](https://docs.xarray.dev/en/stable/) are the main containers used in creating or transforming data sets and are also used as an intermediary format when on-boarding data sets. [Dask is used](https://docs.xarray.dev/en/stable/user-guide/dask.html) to parallelize the calculations in cases where these are memory- or compute-intensive. Xarrays are chosen for convenience and performance when dealing with multidimensional data sets.
```
git clone git@github.com:os-climate/hazard.git
cd hazard
```

## Dataset storage format
In common with other types of geospatial data, hazard model data sets may be raster or vector, regions being determined by cells in the former case and, typically, by polygon boundaries in the latter.
Then use either `pdm` (recommended):

### Raster data
For physical risk calculations, fast look up of data based on a large number (> millions) of latitude and longitude pairs is required. In order to access large multidimensional raster data sets efficiently, the Zarr format is preferred. Zarr is a compressed chunked format in which — in contrast to NetCDF4/HDF5 data — each chunk is a separate object in cloud object stores. This facilitates parallel read and write access, e.g. from a cluster of CPUs. The Zarr format is also convenient when using [xarrays](https://docs.xarray.dev/en/stable/) especially [with Dask](https://docs.xarray.dev/en/stable/user-guide/dask.html).
```
pip install pdm
pdm config venv.with_pip True
pdm install
```

It is worth noting that use of a chunked format does not preclude federation of the data within a database (see for example the approach of [tileDB](https://tiledb.com/)).
Or `virtualenv`:

### Raster chunk sizes and dimensions
As a guide to chunk size [the Zarr team notes](https://zarr.readthedocs.io/en/stable/tutorial.html) that 'chunks of at least 1 megabyte (1M) uncompressed size seem to provide better performance, at least when using the Blosc compression library.' [The Amazon Best Practices for S3](https://d1.awsstatic.com/whitepapers/AmazonS3BestPractices.pdf) moreover recommends making concurrent requests for byte ranges of an object at the granularity of 8-16 MB, in rough agreement to this.
```
python -m venv .venv
source .venv/bin/activate
pip install -e .
```

For return-period-based data sets, the recommended dimensions are ('return period', 'latitude', 'longitude'). Each chunk should contain data for all return periods since this is needed for each latitude and longitude. For more efficient compression under the "C" (row-major) layout, return period is the first dimension.
## Usage

```python
import zarr
import zarr.storage.MemoryStore
A command line interface is exposed with the package. For example, this code snippet will run a cut-down version of a "days above temperature" indicator and write the output to `$HOME/hazard_example` :

# create empty Zarr array containing return period data with 21600 latitudes and 43200 longitudes
shape = (10, 21600, 43200) # ('return period', 'latitude', 'longitude')
store = zarr.storage.MemoryStore(root="hazard.zarr")
root = zarr.open(store=store, mode="w")
z = root.create_dataset("example_array_path",
shape=(shape[0], shape[1], shape[2]),
chunks=(shape[0], 1000, 1000),
dtype="f4"
)
```
source .venv/bin/activate
mkdir -p $HOME/hazard_example
os_climate_hazard days_tas_above_indicator --store $HOME/hazard_example
```

Note that each chunk contains all return period data for a spatial region of 1000×1000 cells.

### Affine transforms
An important and common case for raster data sets is that the transform to and from geospatial coordinates (e.g. altitude and longitudes) and raster cell location (e.g. index of the array) is *affine*. [Affine](https://pypi.org/project/affine/) is a convenient library for handling such transforms. Affine transforms are common in the metadata of GeoTIFF files as handled by, for example, the [rasterio](https://rasterio.readthedocs.io/en/latest/api/rasterio.transform.html) package.
# Contributing

### Use with xarrays
Patches may be contributed via pull requests from forks to
https://github.com/os-climate/hazard.

### Conventions for storage of Zarr arrays
All changes must pass the automated test suite, along with various static checks.

The root for hazard Zarr arrays in a S3 bucket is 'hazard' or 'hazard_test' (for testing).
within this, Zarr arrays are stored in Zarr group hazard/hazard.zarr
The easiest way to run these is via:
```
pdm run all
```

The convention for paths to Zarr arrays is:
hazard/hazard.zarr/`<`path to array`>`/`<`version`>`/`<`array name`>`
# Hazard modelling

Arrays are typically instances of models
-
For more modelling-specific information, see `HAZARD.md`.
2 changes: 1 addition & 1 deletion pdm.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

46 changes: 23 additions & 23 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -30,29 +30,29 @@ classifiers = [
]

dependencies = [
"cftime",
"dask[distributed]",
"fire",
"fsspec",
"geopandas",
"h5netcdf",
"mapbox",
"matplotlib",
"mercantile",
"mkdocs",
"numpy",
"python-dotenv",
"pyproj",
"pydantic",
"pymdown-extensions",
"rasterio",
"rioxarray",
"seaborn",
"shapely",
"s3fs",
"xarray",
"xclim",
"zarr",
"cftime>=1.6.3,<2.0.0",
"dask[distributed]>=2023.5.0,<2023.6.0",
"fire>=0.6.0,<1.0.0",
"fsspec>=2024.3.1,<2024.4.0",
"geopandas>=0.13.2,<1.0.0",
"h5netcdf>=1.1.0,<2.0.0",
"mapbox>=0.18.1,<1.0.0",
"matplotlib>=3.7.5,<4.0.0",
"mercantile>=1.2.1,<2.0.0",
"mkdocs>=1.5.3,<2.0.0",
"numpy>=1.24.4,<2.0.0",
"python-dotenv>=1.0.1,<2.0.0",
"pyproj>=3.5.0,<4.0.0",
"pydantic>=2.6.4,<3.0.0",
"pymdown-extensions>=10.7.1,<11.0.0",
"rasterio>=1.3.9,<2.0.0",
"rioxarray>=0.13.4,<1.0.0",
"seaborn>=0.13.2,<1.0.0",
"shapely>=2.0.3,<3.0.0",
"s3fs>=2024.3.1,<2024.4.0",
"xarray>=2023.1.0,<2023.2.0",
"xclim>=0.47.0,<1.0.0",
"zarr>=2.16.1,<3.0.0",
]

[project.urls]
Expand Down
Loading

0 comments on commit 47c80ee

Please sign in to comment.