Skip to content

Commit

Permalink
ARROW-14892: [Python][C++] GCS Bindings (#12763)
Browse files Browse the repository at this point in the history
Incorporate GCS file system into python and other bug fixes.

Bugs/Other changes:
- Add GCS bindings mostly based on AWS bindings in Python and associated unit tests
- Tell was incorrect, it double counted when the stream was constructed with an offset.
- Missed setting the define in config.cmake which means `FileSystemFromUri was never tested and didn't compile this is now fixed`
- Refine logic for GetFileInfo with a single path to recognize prefixes followed by a slash as a directory.  This allows datasets to work as expected with a toy dataset generated on local-filesystem and copied to the cloud (I believe this is typical of how other systems write to GCS as well.
- Switch convention for creating directories to always end in "/" and make use of this as another indicator.  From testing with a sample iceberg table it appears this is the convention used for hive-partitioning, so I assume this is common practice for other Hive related writers (i.e. what we want to support).  
- Fix bug introduced in a5e45ce which caused failures when a deletion occurred on a bucket (not an object in the bucket).
- Ensure output streams are closed on destruction (this is consistent with S3)
 


Lead-authored-by: Micah Kornfield <micahk@google.com>
Co-authored-by: emkornfield <emkornfield@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
  • Loading branch information
emkornfield and emkornfield authored Jun 12, 2022
1 parent f6c2751 commit 7b5912d
Show file tree
Hide file tree
Showing 46 changed files with 772 additions and 103 deletions.
1 change: 1 addition & 0 deletions .github/workflows/python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ jobs:
ARROW_DATASET: ON
ARROW_FLIGHT: ON
ARROW_GANDIVA: ON
ARROW_GCS: OFF
ARROW_HDFS: ON
ARROW_JEMALLOC: ON
ARROW_ORC: ON
Expand Down
7 changes: 3 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,6 @@ jobs:
DOCKER_RUN_ARGS: >-
"
-e ARROW_BUILD_STATIC=OFF
-e ARROW_GCS=OFF
-e ARROW_ORC=OFF
-e ARROW_USE_GLOG=OFF
-e CMAKE_UNITY_BUILD=ON
Expand Down Expand Up @@ -99,11 +98,11 @@ jobs:
-e ARROW_GCS=OFF
-e ARROW_MIMALLOC=OFF
-e ARROW_ORC=OFF
-e ARROW_SUBSTRAIT=OFF
-e ARROW_PARQUET=OFF
-e ARROW_S3=OFF
-e CMAKE_UNITY_BUILD=ON
-e ARROW_SUBSTRAIT=OFF
-e CMAKE_BUILD_PARALLEL_LEVEL=2
-e CMAKE_UNITY_BUILD=ON
-e PARQUET_BUILD_EXAMPLES=OFF
-e PARQUET_BUILD_EXECUTABLES=OFF
-e Protobuf_SOURCE=BUNDLED
Expand Down Expand Up @@ -154,8 +153,8 @@ jobs:
-e ARROW_PARQUET=OFF
-e ARROW_PYTHON=ON
-e ARROW_S3=OFF
-e CMAKE_UNITY_BUILD=ON
-e CMAKE_BUILD_PARALLEL_LEVEL=2
-e CMAKE_UNITY_BUILD=ON
-e PARQUET_BUILD_EXAMPLES=OFF
-e PARQUET_BUILD_EXECUTABLES=OFF
-e Protobuf_SOURCE=BUNDLED
Expand Down
1 change: 1 addition & 0 deletions appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ environment:
# (as generated by cmake)
- JOB: "Toolchain"
GENERATOR: Ninja
ARROW_GCS: "ON"
ARROW_S3: "ON"
ARROW_BUILD_FLIGHT: "ON"
ARROW_BUILD_GANDIVA: "ON"
Expand Down
5 changes: 5 additions & 0 deletions ci/docker/conda-python.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,11 @@ RUN mamba install -q \
nomkl && \
mamba clean --all

# XXX The GCS testbench was already installed in conda-cpp.dockerfile,
# but we changed the installed Python version above, so we need to reinstall it.
COPY ci/scripts/install_gcs_testbench.sh /arrow/ci/scripts
RUN /arrow/ci/scripts/install_gcs_testbench.sh default

ENV ARROW_PYTHON=ON \
ARROW_BUILD_STATIC=OFF \
ARROW_BUILD_TESTS=OFF \
Expand Down
2 changes: 2 additions & 0 deletions ci/docker/debian-11-cpp.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ ENV ARROW_BUILD_TESTS=ON \
ARROW_DEPENDENCY_SOURCE=SYSTEM \
ARROW_FLIGHT=ON \
ARROW_GANDIVA=ON \
ARROW_GCS=ON \
ARROW_HOME=/usr/local \
ARROW_ORC=ON \
ARROW_PARQUET=ON \
Expand All @@ -99,6 +100,7 @@ ENV ARROW_BUILD_TESTS=ON \
AWSSDK_SOURCE=BUNDLED \
CC=gcc \
CXX=g++ \
google_cloud_cpp_storage_SOURCE=BUNDLED \
ORC_SOURCE=BUNDLED \
PATH=/usr/lib/ccache/:$PATH \
Protobuf_SOURCE=BUNDLED
2 changes: 2 additions & 0 deletions ci/docker/fedora-35-cpp.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ ENV ARROW_BUILD_TESTS=ON \
ARROW_FLIGHT=ON \
ARROW_GANDIVA_JAVA=ON \
ARROW_GANDIVA=ON \
ARROW_GCS=ON \
ARROW_HOME=/usr/local \
ARROW_ORC=ON \
ARROW_PARQUET=ON \
Expand All @@ -92,6 +93,7 @@ ENV ARROW_BUILD_TESTS=ON \
AWSSDK_SOURCE=BUNDLED \
CC=gcc \
CXX=g++ \
google_cloud_cpp_storage_SOURCE=BUNDLED \
ORC_SOURCE=BUNDLED \
PARQUET_BUILD_EXECUTABLES=ON \
PARQUET_BUILD_EXAMPLES=ON \
Expand Down
1 change: 1 addition & 0 deletions ci/docker/linux-apt-docs.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ ENV ARROW_BUILD_STATIC=OFF \
ARROW_BUILD_TESTS=OFF \
ARROW_BUILD_UTILITIES=OFF \
ARROW_FLIGHT=ON \
ARROW_GCS=ON \
ARROW_GLIB_VALA=false \
ARROW_PYTHON=ON \
ARROW_S3=ON \
Expand Down
3 changes: 3 additions & 0 deletions ci/docker/python-wheel-manylinux-test.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,6 @@ FROM ${arch}/python:${python}
# test dependencies in a docker image
COPY python/requirements-wheel-test.txt /arrow/python/
RUN pip install -r /arrow/python/requirements-wheel-test.txt

COPY ci/scripts/install_gcs_testbench.sh /arrow/ci/scripts/
RUN PYTHON=python /arrow/ci/scripts/install_gcs_testbench.sh default
1 change: 1 addition & 0 deletions ci/docker/ubuntu-20.04-cpp.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ RUN apt-get update -y -q && \
nlohmann-json3-dev \
pkg-config \
protobuf-compiler \
python3-dev \
python3-pip \
python3-rados \
rados-objclass-dev \
Expand Down
2 changes: 2 additions & 0 deletions ci/docker/ubuntu-22.04-cpp.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@ ENV ARROW_BUILD_TESTS=ON \
ARROW_FLIGHT=ON \
ARROW_FLIGHT_SQL=ON \
ARROW_GANDIVA=ON \
ARROW_GCS=ON \
ARROW_HDFS=ON \
ARROW_HOME=/usr/local \
ARROW_INSTALL_NAME_RPATH=OFF \
Expand All @@ -175,6 +176,7 @@ ENV ARROW_BUILD_TESTS=ON \
ARROW_WITH_ZLIB=ON \
ARROW_WITH_ZSTD=ON \
AWSSDK_SOURCE=BUNDLED \
google_cloud_cpp_storage_SOURCE=BUNDLED \
GTest_SOURCE=BUNDLED \
ORC_SOURCE=BUNDLED \
PARQUET_BUILD_EXAMPLES=ON \
Expand Down
22 changes: 18 additions & 4 deletions ci/scripts/install_gcs_testbench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,24 @@ if [ "$#" -ne 1 ]; then
exit 1
fi

if [ "$(uname -m)" != "x86_64" ]; then
echo "GCS testbench won't install on non-x86 architecture"
exit 0
fi
case "$(uname -m)" in
aarch64|arm64|x86_64)
: # OK
;;
*)
echo "GCS testbench is installed only on x86 or arm architectures: $(uname -m)"
exit 0
;;
esac

case "$(uname -s)-$(uname -m)" in
Darwin-arm64)
# Workaround for https://github.com/grpc/grpc/issues/28387 .
# Build grpcio instead of using wheel.
# storage-testbench 0.16.0 pins grpcio to 1.44.0.
${PYTHON:-python3} -m pip install --no-binary :all: "grpcio==1.44.0"
;;
esac

version=$1
if [[ "${version}" -eq "default" ]]; then
Expand Down
1 change: 1 addition & 0 deletions ci/scripts/python_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ export PYARROW_WITH_CUDA=${ARROW_CUDA:-OFF}
export PYARROW_WITH_DATASET=${ARROW_DATASET:-ON}
export PYARROW_WITH_FLIGHT=${ARROW_FLIGHT:-OFF}
export PYARROW_WITH_GANDIVA=${ARROW_GANDIVA:-OFF}
export PYARROW_WITH_GCS=${ARROW_GCS:-OFF}
export PYARROW_WITH_HDFS=${ARROW_HDFS:-ON}
export PYARROW_WITH_ORC=${ARROW_ORC:-OFF}
export PYARROW_WITH_PLASMA=${ARROW_PLASMA:-OFF}
Expand Down
2 changes: 2 additions & 0 deletions ci/scripts/python_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ export ARROW_DEBUG_MEMORY_POOL=trap
: ${PYARROW_TEST_DATASET:=${ARROW_DATASET:-ON}}
: ${PYARROW_TEST_FLIGHT:=${ARROW_FLIGHT:-ON}}
: ${PYARROW_TEST_GANDIVA:=${ARROW_GANDIVA:-ON}}
: ${PYARROW_TEST_GCS:=${ARROW_GCS:-ON}}
: ${PYARROW_TEST_HDFS:=${ARROW_HDFS:-ON}}
: ${PYARROW_TEST_ORC:=${ARROW_ORC:-ON}}
: ${PYARROW_TEST_PARQUET:=${ARROW_PARQUET:-ON}}
Expand All @@ -47,6 +48,7 @@ export PYARROW_TEST_CUDA
export PYARROW_TEST_DATASET
export PYARROW_TEST_FLIGHT
export PYARROW_TEST_GANDIVA
export PYARROW_TEST_GCS
export PYARROW_TEST_HDFS
export PYARROW_TEST_ORC
export PYARROW_TEST_PARQUET
Expand Down
3 changes: 2 additions & 1 deletion ci/scripts/python_wheel_macos_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ echo "=== (${PYTHON_VERSION}) Building Arrow C++ libraries ==="
: ${ARROW_DATASET:=ON}
: ${ARROW_FLIGHT:=ON}
: ${ARROW_GANDIVA:=OFF}
: ${ARROW_GCS:=OFF}
: ${ARROW_GCS:=ON}
: ${ARROW_HDFS:=ON}
: ${ARROW_JEMALLOC:=ON}
: ${ARROW_MIMALLOC:=ON}
Expand Down Expand Up @@ -148,6 +148,7 @@ export PYARROW_INSTALL_TESTS=1
export PYARROW_WITH_DATASET=${ARROW_DATASET}
export PYARROW_WITH_FLIGHT=${ARROW_FLIGHT}
export PYARROW_WITH_GANDIVA=${ARROW_GANDIVA}
export PYARROW_WITH_GCS=${ARROW_GCS}
export PYARROW_WITH_HDFS=${ARROW_HDFS}
export PYARROW_WITH_ORC=${ARROW_ORC}
export PYARROW_WITH_PARQUET=${ARROW_PARQUET}
Expand Down
3 changes: 2 additions & 1 deletion ci/scripts/python_wheel_manylinux_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ echo "=== (${PYTHON_VERSION}) Building Arrow C++ libraries ==="
: ${ARROW_DATASET:=ON}
: ${ARROW_FLIGHT:=ON}
: ${ARROW_GANDIVA:=OFF}
: ${ARROW_GCS:=OFF}
: ${ARROW_GCS:=ON}
: ${ARROW_HDFS:=ON}
: ${ARROW_JEMALLOC:=ON}
: ${ARROW_MIMALLOC:=ON}
Expand Down Expand Up @@ -144,6 +144,7 @@ export PYARROW_INSTALL_TESTS=1
export PYARROW_WITH_DATASET=${ARROW_DATASET}
export PYARROW_WITH_FLIGHT=${ARROW_FLIGHT}
export PYARROW_WITH_GANDIVA=${ARROW_GANDIVA}
export PYARROW_WITH_GCS=${ARROW_GCS}
export PYARROW_WITH_HDFS=${ARROW_HDFS}
export PYARROW_WITH_ORC=${ARROW_ORC}
export PYARROW_WITH_PARQUET=${ARROW_PARQUET}
Expand Down
15 changes: 13 additions & 2 deletions ci/scripts/python_wheel_unix_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ source_dir=${1}
: ${ARROW_FLIGHT:=ON}
: ${ARROW_SUBSTRAIT:=ON}
: ${ARROW_S3:=ON}
: ${ARROW_GCS:=ON}
: ${CHECK_IMPORTS:=ON}
: ${CHECK_UNITTESTS:=ON}
: ${INSTALL_PYARROW:=ON}
Expand All @@ -39,6 +40,7 @@ export PYARROW_TEST_CYTHON=OFF
export PYARROW_TEST_DATASET=ON
export PYARROW_TEST_FLIGHT=${ARROW_FLIGHT}
export PYARROW_TEST_GANDIVA=OFF
export PYARROW_TEST_GCS=${ARROW_GCS}
export PYARROW_TEST_HDFS=ON
export PYARROW_TEST_ORC=ON
export PYARROW_TEST_PANDAS=ON
Expand Down Expand Up @@ -69,6 +71,9 @@ import pyarrow.orc
import pyarrow.parquet
import pyarrow.plasma
"
if [ "${PYARROW_TEST_GCS}" == "ON" ]; then
python -c "import pyarrow._gcsfs"
fi
if [ "${PYARROW_TEST_S3}" == "ON" ]; then
python -c "import pyarrow._s3fs"
fi
Expand All @@ -81,8 +86,14 @@ import pyarrow.plasma
fi

if [ "${CHECK_UNITTESTS}" == "ON" ]; then
# Install testing dependencies
pip install -U -r ${source_dir}/python/requirements-wheel-test.txt
# Generally, we should install testing dependencies here to install
# built wheels without testing dependencies. Testing dependencies are
# installed in ci/docker/python-wheel-manylinux-test.dockerfile to
# reduce test time.
#
# We also need to update dev/tasks/python-wheels/*.yml when we need
# to add more steps to prepare testing dependencies.

# Execute unittest, test dependencies must be installed
python -c 'import pyarrow; pyarrow.create_library_symlinks()'
python -m pytest -r s --pyargs pyarrow
Expand Down
7 changes: 5 additions & 2 deletions cpp/src/arrow/filesystem/api.h
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,11 @@

#include "arrow/filesystem/filesystem.h" // IWYU pragma: export
#include "arrow/filesystem/hdfs.h" // IWYU pragma: export
#include "arrow/filesystem/localfs.h" // IWYU pragma: export
#include "arrow/filesystem/mockfs.h" // IWYU pragma: export
#ifdef ARROW_GCS
#include "arrow/filesystem/gcsfs.h" // IWYU pragma: export
#endif
#include "arrow/filesystem/localfs.h" // IWYU pragma: export
#include "arrow/filesystem/mockfs.h" // IWYU pragma: export
#ifdef ARROW_S3
#include "arrow/filesystem/s3fs.h" // IWYU pragma: export
#endif
3 changes: 1 addition & 2 deletions cpp/src/arrow/filesystem/filesystem.cc
Original file line number Diff line number Diff line change
Expand Up @@ -695,8 +695,7 @@ Result<std::shared_ptr<FileSystem>> FileSystemFromUriReal(const Uri& uri,
if (scheme == "gs" || scheme == "gcs") {
#ifdef ARROW_GCS
ARROW_ASSIGN_OR_RAISE(auto options, GcsOptions::FromUri(uri, out_path));
ARROW_ASSIGN_OR_RAISE(auto gcsfs, GcsFileSystem::Make(options, io_context));
return gcsfs;
return GcsFileSystem::Make(options, io_context);
#else
return Status::NotImplemented("Got GCS URI but Arrow compiled without GCS support");
#endif
Expand Down
6 changes: 4 additions & 2 deletions cpp/src/arrow/filesystem/filesystem.h
Original file line number Diff line number Diff line change
Expand Up @@ -452,7 +452,8 @@ class ARROW_EXPORT SlowFileSystem : public FileSystem {

/// \brief Create a new FileSystem by URI
///
/// Recognized schemes are "file", "mock", "hdfs" and "s3fs".
/// Recognized schemes are "file", "mock", "hdfs", "viewfs", "s3",
/// "gs" and "gcs".
///
/// \param[in] uri a URI-based path, ex: file:///some/local/path
/// \param[out] out_path (optional) Path inside the filesystem.
Expand All @@ -463,7 +464,8 @@ Result<std::shared_ptr<FileSystem>> FileSystemFromUri(const std::string& uri,

/// \brief Create a new FileSystem by URI with a custom IO context
///
/// Recognized schemes are "file", "mock", "hdfs" and "s3fs".
/// Recognized schemes are "file", "mock", "hdfs", "viewfs", "s3",
/// "gs" and "gcs".
///
/// \param[in] uri a URI-based path, ex: file:///some/local/path
/// \param[in] io_context an IOContext which will be associated with the filesystem
Expand Down
Loading

0 comments on commit 7b5912d

Please sign in to comment.