Evaluation of Preprocessing Methods of Sentinel-2 Data and their Impact on Traditional Empirical and Modern Machine Learning Based Satellite Derived Bathymetry Methods
The goal of this project is to look at the influence of different preprocessing methods for Sentinel-2 data products and their influence on traditional algorithms like the Stumpf Log-Ratio Method (Stumpf et al., 2003) in contrast to modern approaches like LightGBM (Ke et al., 2017).
As of now the conference paper still needs to be published and linked to this repository.
While I tried to make sure that everything in this repository can be inspected and executed by interested readers, I am aware that
after a while newer Python and package versions will break the project's code. I provide an environment.yml
file that documents
the exact versions of all dependencies I used on my system. I also documented each notebook in a way that should make the intention
of each step very clear, so that even if a reader wants to migrate the complete analysis or just parts to another language or environment,
this can be done even without using the provided code.
The analysis looks at three different areas:
- Section of shallow ocean water near the north-west corner
of the Bahamas
BBox: (25.23467352,-78.43272685,25.31877266,-78.23940804)
- Section of shallow ocean water the west coast of Puerto Rico
BBox: (18.14442526,-67.24112119,18.17335221,-67.18944271)
- Mille Lacs Lake in Minnesota, USA
BBox: (46.099296265601545,-93.83878319721899,46.377612102131366,-93.44756526336063)
The data used for this environment consists of:
- Shapefiles for certain AOIs created in QGIS
- Bathymetry maps from various sources
- Sentinel-2 L1C scenes and derived L2A and Acolite products
The data needed to reproduce this analysis will be shared with the accompanying paper on Zenodo.
The Bathymetry Sources are:
- Mille Lacs Lake: Lakes Data for Minnesota Bathybase Entry
- Puerto Rico: Grid Export NOAA NCEI Data Viewer
- Bahamas: handed down from previous project. The source reference is unfortunately lost.
This project was mainly executed on a Laptop PC (Lenovo ThinkPad E14 Gen 2, Intel Core(TM) i7-1165G7, 32 GB RAM, Windows 10 21 H2).
While especially the modelling notebooks can make good use of additional CPU resources a machine with lower specs should be still
sufficient to repeat all processing steps. Windows users should be able to directly recreate the
conda from the environment.yml
file in this repository. Linux and macOS users will need to
adapt the environment as some transitive dependencies are currently locked at Windows specific versions.
In the notebooks
directory of this repository you will find numerated Jupyter Notebooks which can be subdivided into the following
process steps:
- Bathymetry Map Preprocessing ( 00 - Puerto Rico, 01 - Bahamas, 02 - Mille Lacs Lake)
- Sentinel-2 Data Preprocessing and Dataset Merge ( 03 - Puerto Rico, 04 - Bahamas, 05 - Mille Lacs Lake)
- Stumpf Log-Regression Fitting and Evaluation ( 06 - Puerto Rico, 07 - Bahamas, 08 - Mille Lacs Lake)
- LightGBM Fitting and Evaluation ( 09 - Puerto Rico, 10 - Bahamas, 11 - Mille Lacs Lake)
Each notebook includes a detailed description of the current context and each taken step. I tried to document each notebook in a way that they can also be read in isolation. In some instances (especially when comparing results) I add references to other notebooks. If you wish to read a more condensed writeup of the project please feel free to follow the link to my conference paper.
While working on this project I produced a rather generic eolearn_extras
module which contains some eo-learn tasks which could
be useful to others and a less generic collection of helper code in the notebooks/sdb_utils
directory. All the code is available
freely under the MIT license. If you find any bugs or need further assistance please don't hesitate to open an issue.
The general analysis approach can be seen in Figure 1. As both the traditional as well as the modern model are supervised learning algorithms we need to provide ground truth values for training. Those values can be extracted from bathymetry maps which represent the depth profile (or underwater topography) of areas of inland or ocean water. Two possible repositories are Bathybase and the National Oceanic and Atmospheric Administration's (NOAA) National Centers for Environmental Information (NCEI) bathymetry portal.
For a given area of interest (AOI) which either includes the extent of the whole bathymetry map or a particular subsection we search for Sentinel-2 scenes which contain the AOI at a time with no cloud obstruction and - in the case of regions which experience low temperatures - no ice formation. Once a fitting scene is found we download the complete Standard Archive Format for Europe (SAFE) archive and store it for further preprocessing. It is essential not to use partial downloads (e.g. with the sentinelsat Python package) because further preprocessing methods assume that the SAFE archives are complete.
In this project two preprocessing methods for atmospheric correction are evaluated against the top of atmosphere (TOA) L1C product. One is the L2A product generated by using the Sen2Cor processor (Main-Knorn et al., 2017) while the other is the resulting data product produced by applying the Acolite (Vanhellmont and Ruddick, 2016) processor. Table 1 shows the exact version of the used operating system (OS) as well as the versions of the processors.
Software | Version |
---|---|
Windows OS | 21H2 Build 19044.1706 |
Sen2Cor | 2.10.01-win64 |
Acolite | Generic Git - Hash dafc2d4bced4864f0bc111b9e0d3348ff16a5336 |
Table 1: Used software for executing preprocessor
All further processing of the acquired raster images to create analysis ready data (ARD) is done using the eo-learn framework. You can fnd a detailed description of all steps for data preprocessing, modelling and model evaluation in the notebooks folder of this repository.
Fig 1: General analysis approach