Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR][Data] Fix nyc_taxi_basic_processing notebook #26983

Merged
merged 13 commits into from
Jul 28, 2022

Conversation

jiaodong
Copy link
Member

@jiaodong jiaodong commented Jul 25, 2022

Why are these changes needed?

The code is actually ok but raw datasize is too large. So I downsampled monthly data with ratio of 0.1 and full year data with 0.01, uploaded to AIR's public s3 bucket.

Also slightly changed the full year data description as we're not dealing with huge data anymore, but kept the same paragraph about lazy read.

Blocked by #27064

Related issue number

#26410

Closes #27064

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

doc/source/data/examples/nyc_taxi_basic_processing.ipynb Outdated Show resolved Hide resolved
"name": "stderr",
"output_type": "stream",
"text": [
"(_get_read_tasks pid=9272) 2022-07-25 14:29:30,732\tINFO parquet_datasource.py:323 -- Parquet input size estimation took 28.51 seconds.\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl @c21 28.51 seconds for size estimation on a small dataset (50 MB on disk, 150 MB in memory) seems really high?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the notebook expected to run on user's local machine? I feel this might due to the network speed to read S3 files into local machine.

I tried locally on my laptop, and I found the spent time varied no matter how large the file size is.

Downsampled data used here (sampling took 6.54 seconds):

>>> ds = ray.data.read_parquet([
...     "s3://air-example-data-2/ursa-labs-taxi-data/downsampled_2009_01_data.parquet",
...     "s3://air-example-data-2/ursa-labs-taxi-data/downsampled_2009_02_data.parquet"])
⚠️  The number of blocks in this dataset (2) limits its parallelism to 2 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks.
>>> (_get_read_tasks pid=70675) 2022-07-25 15:33:27,142	INFO parquet_datasource.py:323 -- Parquet input size estimation took 6.54 seconds.

Full data (sampling took within 5 seconds, so no printing):

>>> ds = ray.data.read_parquet([
...     "s3://ursa-labs-taxi-data/2009/01/data.parquet",
...     "s3://ursa-labs-taxi-data/2009/02/data.parquet"])
⚠️  The blocks of this dataset are estimated to be 2.0x larger than the target block size of 512 MiB. This may lead to out-of-memory errors during processing. Consider reducing the size of input files or using `.repartition(n)` to increase the number of dataset blocks.
>>> 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiaodong - sorry for the trouble, could you help remove this log? We disable Parquet sampling by default in 2.0 for the sake of stability (#27034). So this log is not printed out now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure i can remove it

@jiaodong jiaodong added air tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Jul 27, 2022
@richardliaw
Copy link
Contributor

can we actually add a simple test, just to provide a unit test scaffold for this component

# `s3://anonymous@bucket/data.parquet` with known filesystem, pyarrow
# returns `anonymous@bucket/data.parquet` as the resolved path by
# mistake.
if type(filesystem).__name__ == "S3FileSystem" and "@" in resolved_path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not do the more Pythonic instance check here?

Suggested change
if type(filesystem).__name__ == "S3FileSystem" and "@" in resolved_path:
if isinstance(filesystem, S3FileSystem) and "@" in resolved_path:

Copy link
Member Author

@jiaodong jiaodong Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clarkzinzow so i looked into pyarrow's implementation a bit and it only lazily imports S3FileSystem and there's nowhere that assumes it's always available, so i fell back to use string check. If dataset can safely assume we always have this module i can change it back to instance check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah the S3FileSystem does some expensive things (e.g. network calls) at initialization time, so the string check would probably be best. Thank you for looking into that!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pyarrow's python implementation ... makes me appreciate ray dataset code a lot more now lol

# like `anonymous@bucket/data.parquet`
resolved_path = resolved_path.split("@")[-1]
else:
resolved_path = filesystem.normalize_path(resolved_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a unit test covering this case to python/ray/data/tests/test_dataset_formats.py.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great ! added

@jiaodong
Copy link
Member Author

jiaodong commented Jul 27, 2022

Screen Shot 2022-07-27 at 5 45 26 PM

Latest commit only added unit tests coverage and ensured they all passed. Last commit's CI is green on all windows + linux tests, looks like mac build overall is flaky that didn't finish ..

=== update ===

dataset unit test actually skips if deps are not added ... now it's finally running for real

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but someone from data will need to approve

@jiaodong jiaodong added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Jul 27, 2022
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one minor comment to remove log. Thanks for the work @jiaodong!

@jiaodong
Copy link
Member Author

test failure on CI is legit, dataset tests skip a test if moto is not installed which is the case of my laptop dev .. and it's reproducible locally. Fixing these now.

@jiaodong
Copy link
Member Author

diff looks large but looks like we actually need them since the meat of this notebook is to show e2e intermediate results.
Feel free to take a look at the final notebook, i don't think it has much redundant info.

https://github.com/ray-project/ray/blob/9e5f7e63b099c05c975dcbbfab04b67fdcf30acd/doc/source/data/examples/nyc_taxi_basic_processing.ipynb

@jiaodong jiaodong added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Jul 28, 2022
@richardliaw richardliaw merged commit 0dbb18a into ray-project:master Jul 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AIR][Dataset] Cannot read two parquet files with anonymous@
5 participants