-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-14892: [Python][C++] GCS Bindings #12763
Conversation
|
CC @coryan for C++ changes. |
@@ -536,8 +546,7 @@ class GcsFileSystem::Impl { | |||
gcs::ReadFromOffset offset) { | |||
auto stream = client_.ReadObject(path.bucket, path.object, generation, offset); | |||
ARROW_GCS_RETURN_NOT_OK(stream.status()); | |||
return std::make_shared<GcsInputStream>(std::move(stream), path, gcs::Generation(), | |||
offset, client_); | |||
return std::make_shared<GcsInputStream>(std::move(stream), path, generation, client_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume offset
into this function is 1000. Without the offset
parameter passed to GcsInputStream
its Tell()
function will return 0 when you are in fact reading byte 1000. That seems like the wrong semantics to me, but maybe it is the expected behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empirically the underlying tell returns 1000. I was observing doubling of expected tell value. The python test that found this wrote N bytes then seeked to N/2 and tried reading. The reading called the FS tell which returned N which caused zeo bytes to be read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the GCS code and it does appear to keep the ReadAt offset to return for tell but I might have missed something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. SGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield I'm curious this wasn't caught by the C++ tests. Is it possible to enhance the generic filesystem tests to cover this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry for the tell case we can certainly add a test case. Filed ARROW-16226 to track this.
when running docker locally the python tests seem to hang forever when attempting to reach GCS test_bench which explains the timeouts, not sure if this could be some sort of config missing for docker? (all tests pass when run without docker). This could also be a test-bench versioning issue. It turns out this was testbench not getting installed properly into the conda env. |
Also CC @jorisvandenbossche if you have time to look |
@github-actions crossbow submit -g python |
@github-actions crossbow submit -g wheel |
FWIW, @emkornfield I had to revert your last commit (608b6ec) in order to get tests to work. It seems if the testbench suite doesn't properly initialize, the GcsFileSystem tests hang until the ctest timeout. (This is using |
You may be running into postmanlabs/httpbin#673 which we worked around in googleapis/storage-testbench#301 . The latest release should have these fixes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @emkornfield . You'll find a bunch of comments below.
Also, can you ensure you rebase on the latest git master?
.github/workflows/cpp.yml
Outdated
@@ -128,6 +128,8 @@ jobs: | |||
ARROW_GANDIVA: ON | |||
ARROW_HDFS: ON | |||
ARROW_HOME: /usr/local | |||
# TODO(ARROW-16102): Enable this once we can figure out builds. | |||
ARROW_GCS: OFF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield Did you mean to add this? ARROW-16102 is fixed already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, ARROW-16102 was fixed in between this was posted for review and it got reviewed. this will be removed in the rebase.
ci/scripts/python_test.sh
Outdated
export PYARROW_TEST_HDFS | ||
export PYARROW_TEST_ORC | ||
export PYARROW_TEST_PARQUET | ||
export PYARROW_TEST_S3 | ||
|
||
# DO NOT SUBMIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this there? Is this PR actually ready? Did you perhaps forget to push some followup changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemed like an ugly hack. I was hoping someone could point me to a better solution. I will try removing after the rebase to see if the tests now pass, as the prior install command (install_gcs_testbench.sh) has been updated slightly at HEAD
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
struct GcsCredentials { | ||
explicit GcsCredentials(std::shared_ptr<google::cloud::Credentials> c) | ||
struct GcsCredentialsHolder { | ||
explicit GcsCredentialsHolder(std::shared_ptr<google::cloud::Credentials> c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this constructor is necessary, just let C++ define it implicitly for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needed for std::make_shared
python/pyarrow/tests/conftest.py
Outdated
try: | ||
proc = subprocess.Popen(args, env=env) | ||
except OSError: | ||
pytest.skip('`gcs test bench` command cannot be located') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this message? Is "gcs test bench" an actual command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most copy and past from s3, I rephrased slightly.
@pitrou thanks for the thorough review. I think I addressed most comments with the exception of the cleanup of directory logic, which I will try to address in a little bit (I left TODO place markers there). I've also rebased, and changed the destructor on OutputStream to close the file it is isn't closed. |
@pitrou I addressed the TODO's I left in there for simplification. Please let me know if the code is now easier to read, there was a bunch of superfluous code. I am still not sure why testbench needs to be installed twice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @emkornfield !
dev/archery/archery/cli.py
Outdated
@@ -183,6 +183,8 @@ def _apply_options(cmd, options): | |||
@click.option("--with-r", default=None, type=BOOL, | |||
help="Build the Arrow R extensions. This is not a CMake option, " | |||
"it will toggle required options") | |||
@click.option("--with-gcs", default=None, type=BOOL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the option is duplicated now? (I'm curious that click doesn't complain about it)
|
||
/// Options for the GcsFileSystem implementation. | ||
struct ARROW_EXPORT GcsOptions { | ||
std::shared_ptr<GcsCredentials> credentials; | ||
GcsCredentials credentials; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan on solving this TODO here?
|
||
/// Options for the GcsFileSystem implementation. | ||
struct ARROW_EXPORT GcsOptions { | ||
std::shared_ptr<GcsCredentials> credentials; | ||
GcsCredentials credentials; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note a default-constructed credentials could trivially be anonymous...
ci/scripts/python_test.sh
Outdated
export PYARROW_TEST_HDFS | ||
export PYARROW_TEST_ORC | ||
export PYARROW_TEST_PARQUET | ||
export PYARROW_TEST_S3 | ||
|
||
# Without this, install_gcs_test_bench.sh doesn't seem to be put in a | ||
# place that the python env can find it. | ||
python -m pip install "https://github.com/googleapis/storage-testbench/archive/v0.16.0.tar.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment about this in conftest.py
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to check for the existence of the script and use the existing install comment. I tried multiple approaches with docker but could get the environment variable set.
ci/scripts/python_build.sh
Outdated
export PYARROW_WITH_HDFS=${ARROW_HDFS:-ON} | ||
export PYARROW_WITH_ORC=${ARROW_ORC:-OFF} | ||
export PYARROW_WITH_PLASMA=${ARROW_PLASMA:-OFF} | ||
export PYARROW_WITH_PARQUET=${ARROW_PARQUET:-OFF} | ||
export PYARROW_WITH_PARQUET_ENCRYPTION=${PARQUET_REQUIRE_ENCRYPTION:-ON} | ||
export PYARROW_WITH_GCS=${ARROW_GCS:-OFF} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this one is duplicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
@@ -50,6 +50,7 @@ namespace bp = boost::process; | |||
namespace gc = google::cloud; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for FileSystemFromUri somewhere in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a test to make sure it can instantiate the file system. there are detailed test for the FromUri implementations on GCS already.
.github/workflows/cpp.yml
Outdated
# ARROW_GCS: ON | ||
ARROW_GCS: OFF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you revert this change because ARROW_GCS
is OFF
by default in ci/scripts/cpp_build.sh
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -45,6 +45,7 @@ | |||
#cmakedefine ARROW_IPC | |||
#cmakedefine ARROW_JSON | |||
|
|||
#cmakedefine ARROW_GCS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also update document for FileSystemFromUri
like https://github.com/apache/arrow/pull/12932/files#diff-82a2d9a8424dfb436344c0802dc36275cd0fb87e2e9eb6399745fba43799f7e5 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
dev/tasks/tasks.yml
Outdated
@@ -431,7 +431,7 @@ tasks: | |||
{############################## Wheel OSX ####################################} | |||
|
|||
# enable S3 support from macOS 10.13 so we don't need to bundle curl, crypt and ssl | |||
{% for macos_version, macos_codename, arrow_s3 in [("10.9", "mavericks", "OFF"), | |||
{% for macos_version, macos_codename, arrow_s3_gcs in [("10.9", "mavericks", "OFF"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a strong opinion for this but I like adding a new variable for gcs
for readability:
{% for macos_version, macos_codename, arrow_s3, arrow_gcs in [("10.9", "mavericks", "OFF", "OFF"),
("10.13", "high-sierra", "ON", "ON")] %}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
(I had a typo: "I have a strong opinion" -> "I don't have a strong opinion")
@github-actions crossbow submit -g wheel |
@github-actions crossbow submit wheel-macos-big-sur* |
Revision: 922e4ef Submitted crossbow builds: ursacomputing/crossbow @ actions-2176483478
|
Ok, the situation on our macOS wheel builds is a bit horrible, but it's pointless to try to improve it here, so I'll revert the last two commits. |
@github-actions crossbow submit wheel-macos-* |
Revision: 0f1f1da Submitted crossbow builds: ursacomputing/crossbow @ actions-041204a60b |
@github-actions crossbow submit -g nightly |
Revision: 0f1f1da Submitted crossbow builds: ursacomputing/crossbow @ actions-8c3920b06e |
We need to fix https://github.com/ursacomputing/crossbow/runs/6842267198?check_suite_focus=true
|
These seem unrelated to this PR? If so I can open up JIRA to track |
@kou Those seem unrelated to this PR. I need to do a last review pass and then this PR can be merged. |
@github-actions crossbow submit -g wheel--cp37- |
|
@github-actions crossbow submit wheel--cp37- |
Revision: b8336ef Submitted crossbow builds: ursacomputing/crossbow @ actions-d604965942
|
Ok, some JIRAs will probably have to be opened for the wheel test failures. |
Oh, sorry. |
for documentation, should i open a new issue? v excited about this :) the docs source: https://github.com/apache/arrow/blob/master/docs/source/python/filesystems.rst |
Yes, please. |
Incorporate GCS file system into python and other bug fixes.
Bugs/Other changes:
FileSystemFromUri was never tested and didn't compile this is now fixed