Skip to content

JDASoftwareGroup/kartothek

Folders and files

NameName
Last commit message
Last commit date
Nov 22, 2021
Nov 29, 2021
Nov 29, 2021
Dec 10, 2021
Nov 20, 2021
Nov 29, 2021
Feb 9, 2021
Feb 9, 2021
May 6, 2019
Jul 23, 2020
Jul 21, 2020
Feb 8, 2021
Dec 10, 2021
Nov 20, 2021
Nov 28, 2019
May 7, 2019
Feb 18, 2021
Nov 29, 2021
Jun 10, 2021
May 13, 2019
Nov 20, 2021
May 14, 2019
Nov 29, 2021
Nov 29, 2021
Nov 20, 2021
Jun 10, 2021

Repository files navigation

Kartothek

Build Status Documentation Status codecov.io License: MIT Anaconda-Server Badge Anaconda-Server Badge

Kartothek is a Python library to manage (create, read, update, delete) large amounts of tabular data in a blob store. It stores data as datasets, which it presents as pandas DataFrames to the user. Datasets are a collection of files with the same schema that reside in a blob store. Kartothek uses a metadata definition to handle these datasets efficiently. For distributed access and manipulation of datasets Kartothek offers a Dask interface.

Storing data distributed over multiple files in a blob store (S3, ABS, GCS, etc.) allows for a fast, cost-efficient and highly scalable data infrastructure. A downside of storing data solely in an object store is that the storages themselves give little to no guarantees beyond the consistency of a single file. In particular, they cannot guarantee the consistency of your dataset. If we demand a consistent state of our dataset at all times, we need to track the state of the dataset. Kartothek frees us from having to do this manually.

The kartothek.io module provides building blocks to create and modify these datasets in data pipelines. Kartothek handles I/O, tracks dataset partitions and selects subsets of data transparently.

Installation

Installers for the latest released version are availabe at the Python package index and on conda.

# Install with pip
pip install kartothek
# Install with conda
conda install -c conda-forge kartothek

What is a (real) Kartothek?

A Kartothek (or more modern: Zettelkasten/Katalogkasten) is a tool to organize (high-level) information extracted from a source of information.