Skip to content

Latest commit

 

History

History
166 lines (116 loc) · 5.86 KB

File metadata and controls

166 lines (116 loc) · 5.86 KB

Using CrateDB with turbodbc

About

This section of the documentation describes how to connect to CrateDB with turbodbc, by providing a few example programs.

The examples use the unixODBC implementation of ODBC, and the PostgreSQL ODBC driver, for connecting to the PostgreSQL wire protocol interface of CrateDB.

This folder also contains Dockerfile files providing environments to exercise the examples on different operating systems, like Arch Linux, Red Hat (CentOS), Debian, and SUSE Linux.

Introduction

Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. In addition to complying with the Python Database API Specification 2.0, turbodbc offers built-in NumPy and Apache Arrow support for improved performance. Their slogan is:

Don’t wait minutes for your results, just blink.

Note: The description texts have been taken from turbodbc's documentation 1:1.

Description

Its primary target audience are data scientists that use databases for which no efficient native Python drivers are available.

For maximum compatibility, turbodbc complies with the Python Database API Specification 2.0 (PEP 249). For maximum performance, turbodbc internally relies on batched data transfer instead of single-record communication as other popular ODBC modules do.

Why should I use turbodbc instead of other ODBC modules?

  • Short answer: turbodbc is faster.
  • Slightly longer answer: turbodbc is faster, much faster if you want to work with NumPy.
  • Medium-length answer: The author has tested turbodbc and pyodbc (probably the most popular Python ODBC module) with various databases (Exasol, PostgreSQL, MySQL) and corresponding ODBC drivers. He found turbodbc to be consistently faster.

Smooth. What is the trick?

Turbodbc exploits buffering.

  • Turbodbc implements both sending parameters and retrieving result sets using buffers of multiple rows/parameter sets. This avoids round trips to the ODBC driver and (depending how well the ODBC driver is written) to the database.
  • Multiple buffers are used for asynchronous I/O. This allows to interleave Python object conversion and direct database interaction (see performance options below).
  • Buffers contain binary representations of data. NumPy arrays contain binary representations of data. Good thing they are often the same, so instead of converting, the driver can just copy data.

Setup

Install prerequisites

Arch Linux:

# See `dockerfiles/archlinux.Dockerfile`.

CentOS Stream:

dnf install --enablerepo=crb -y boost-devel g++ postgresql-odbc python3 python3-devel python3-pip unixODBC-devel

Debian:

apt-get install --yes build-essential libboost-dev odbc-postgresql unixodbc-dev

macOS/Homebrew:

brew install psqlodbc unixodbc

SUSE Linux Enterprise Server:

# See `dockerfiles/sles.Dockerfile`.

Install Python sandbox

# Create Python virtualenv and install dependency packages.
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade --requirement=requirements-prereq.txt
pip install --upgrade --requirement=requirements.txt --verbose

Note

The turbodbc pip installation documentation says: Please pip install numpy before installing turbodbc, because turbodbc will search for the numpy Python package at installation/compile time. If NumPy is not installed, turbodbc will not compile the NumPy support features. Similarly, please pip install pyarrow before installing turbodbc if you would like to use the Apache Arrow support.

Usage

Run CrateDB:

docker run --rm -it --publish=4200:4200 --publish=5432:5432 crate \
    -Cdiscovery.type=single-node -Ccluster.routing.allocation.disk.threshold_enabled=false

Invoke demo program on workstation:

python demo.py

Exercise demo program using Docker, on different operating systems:

docker build --progress=plain --tag local/python-turbodbc-demo --file=dockerfiles/archlinux.Dockerfile .
docker build --progress=plain --tag local/python-turbodbc-demo --file=dockerfiles/centos.Dockerfile .
docker build --progress=plain --tag local/python-turbodbc-demo --file=dockerfiles/debian.Dockerfile .
docker build --progress=plain --tag local/python-turbodbc-demo --file=dockerfiles/sles.Dockerfile .

docker run --rm -it --volume=$(pwd):/src --network=host local/python-turbodbc-demo python3 /src/demo.py

Backlog

The patch just contains a basic example within demo.py. Advanced usage examples to be exercised are tracked within the backlog.