[AIR][Data] Fix nyc_taxi_basic_processing notebook #26983

jiaodong · 2022-07-25T21:48:15Z

Why are these changes needed?

The code is actually ok but raw datasize is too large. So I downsampled monthly data with ratio of 0.1 and full year data with 0.01, uploaded to AIR's public s3 bucket.

Also slightly changed the full year data description as we're not dealing with huge data anymore, but kept the same paragraph about lazy read.

Blocked by #27064

Related issue number

#26410

Closes #27064

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

doc/source/data/examples/nyc_taxi_basic_processing.ipynb

clarkzinzow · 2022-07-25T22:10:10Z

doc/source/data/examples/nyc_taxi_basic_processing.ipynb

+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "(_get_read_tasks pid=9272) 2022-07-25 14:29:30,732\tINFO parquet_datasource.py:323 -- Parquet input size estimation took 28.51 seconds.\n"


@ericl @c21 28.51 seconds for size estimation on a small dataset (50 MB on disk, 150 MB in memory) seems really high?

Is the notebook expected to run on user's local machine? I feel this might due to the network speed to read S3 files into local machine.

I tried locally on my laptop, and I found the spent time varied no matter how large the file size is.

Downsampled data used here (sampling took 6.54 seconds):

>>> ds = ray.data.read_parquet([ ... "s3://air-example-data-2/ursa-labs-taxi-data/downsampled_2009_01_data.parquet", ... "s3://air-example-data-2/ursa-labs-taxi-data/downsampled_2009_02_data.parquet"]) ⚠️ The number of blocks in this dataset (2) limits its parallelism to 2 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks. >>> (_get_read_tasks pid=70675) 2022-07-25 15:33:27,142 INFO parquet_datasource.py:323 -- Parquet input size estimation took 6.54 seconds.

Full data (sampling took within 5 seconds, so no printing):

>>> ds = ray.data.read_parquet([ ... "s3://ursa-labs-taxi-data/2009/01/data.parquet", ... "s3://ursa-labs-taxi-data/2009/02/data.parquet"]) ⚠️ The blocks of this dataset are estimated to be 2.0x larger than the target block size of 512 MiB. This may lead to out-of-memory errors during processing. Consider reducing the size of input files or using `.repartition(n)` to increase the number of dataset blocks. >>>

@jiaodong - sorry for the trouble, could you help remove this log? We disable Parquet sampling by default in 2.0 for the sake of stability (#27034). So this log is not printed out now.

sure i can remove it

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

richardliaw · 2022-07-27T07:42:01Z

can we actually add a simple test, just to provide a unit test scaffold for this component

clarkzinzow · 2022-07-27T14:38:42Z

python/ray/data/datasource/file_based_datasource.py

+        # `s3://anonymous@bucket/data.parquet` with known filesystem, pyarrow
+        # returns `anonymous@bucket/data.parquet` as the resolved path by
+        # mistake.
+        if type(filesystem).__name__ == "S3FileSystem" and "@" in resolved_path:


Any reason to not do the more Pythonic instance check here?

Suggested change

if type(filesystem).__name__ == "S3FileSystem" and "@" in resolved_path:

if isinstance(filesystem, S3FileSystem) and "@" in resolved_path:

@clarkzinzow so i looked into pyarrow's implementation a bit and it only lazily imports S3FileSystem and there's nowhere that assumes it's always available, so i fell back to use string check. If dataset can safely assume we always have this module i can change it back to instance check

Ah yeah the S3FileSystem does some expensive things (e.g. network calls) at initialization time, so the string check would probably be best. Thank you for looking into that!

Pyarrow's python implementation ... makes me appreciate ray dataset code a lot more now lol

clarkzinzow · 2022-07-27T14:39:27Z

python/ray/data/datasource/file_based_datasource.py

+            # like `anonymous@bucket/data.parquet`
+            resolved_path = resolved_path.split("@")[-1]
+        else:
+            resolved_path = filesystem.normalize_path(resolved_path)


We should add a unit test covering this case to python/ray/data/tests/test_dataset_formats.py.

great ! added

jiaodong · 2022-07-27T16:36:31Z

Latest commit only added unit tests coverage and ensured they all passed. Last commit's CI is green on all windows + linux tests, looks like mac build overall is flaky that didn't finish ..

=== update ===

dataset unit test actually skips if deps are not added ... now it's finally running for real

clarkzinzow

LGTM!

richardliaw

lgtm but someone from data will need to approve

c21

LGTM with one minor comment to remove log. Thanks for the work @jiaodong!

jiaodong · 2022-07-27T23:23:50Z

test failure on CI is legit, dataset tests skip a test if moto is not installed which is the case of my laptop dev .. and it's reproducible locally. Fixing these now.

jiaodong · 2022-07-28T00:52:57Z

diff looks large but looks like we actually need them since the meat of this notebook is to show e2e intermediate results.
Feel free to take a look at the final notebook, i don't think it has much redundant info.

https://github.com/ray-project/ray/blob/9e5f7e63b099c05c975dcbbfab04b67fdcf30acd/doc/source/data/examples/nyc_taxi_basic_processing.ipynb

Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>

initial working commit

b86197a

jiaodong requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix and maxpumperla as code owners July 25, 2022 21:48

clean outputs

e4cd9a9

clarkzinzow reviewed Jul 25, 2022

View reviewed changes

jiaodong and others added 8 commits July 25, 2022 15:13

Update doc/source/data/examples/nyc_taxi_basic_processing.ipynb

efcca26

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

change to ray.init()

9a0f017

Merge branch 'master' into nyc_taxi_notebook

68493cc

add timeout

d544322

add anonymous

5d8ee69

change to use air-example-data

2d146ea

add setup credentials

7f7724e

fix weird pyarrow s3 reading and parsing issue

b961bcd

jiaodong mentioned this pull request Jul 27, 2022

[AIR][Dataset] Cannot read two parquet files with anonymous@ #27064

Closed

jiaodong assigned clarkzinzow and richardliaw Jul 27, 2022

jiaodong added air tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Jul 27, 2022

clarkzinzow reviewed Jul 27, 2022

View reviewed changes

add unit test coverage

dcf8a11

clarkzinzow approved these changes Jul 27, 2022

View reviewed changes

richardliaw approved these changes Jul 27, 2022

View reviewed changes

jiaodong added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Jul 27, 2022

c21 approved these changes Jul 27, 2022

View reviewed changes

jiaodong added 2 commits July 27, 2022 17:46

make simple fixes and update unit test

e160fa7

remove size estimation log output

9e5f7e6

jiaodong added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Jul 28, 2022

richardliaw merged commit 0dbb18a into ray-project:master Jul 28, 2022

jiaodong added the v2.0.0-pick label Jul 28, 2022

jiaodong added a commit that referenced this pull request Jul 29, 2022

[AIR][Data] Fix nyc_taxi_basic_processing notebook (#26983)

f07f2d8

scv119 added the v2.0.0-pick-done label Aug 3, 2022

Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022

[AIR][Data] Fix nyc_taxi_basic_processing notebook (ray-project#26983)

b89a7a0

Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR][Data] Fix nyc_taxi_basic_processing notebook #26983

[AIR][Data] Fix nyc_taxi_basic_processing notebook #26983

jiaodong commented Jul 25, 2022 •

edited

Loading

clarkzinzow Jul 25, 2022

c21 Jul 25, 2022

c21 Jul 27, 2022

jiaodong Jul 27, 2022

richardliaw commented Jul 27, 2022

clarkzinzow Jul 27, 2022

jiaodong Jul 27, 2022 •

edited

Loading

clarkzinzow Jul 27, 2022

jiaodong Jul 27, 2022

clarkzinzow Jul 27, 2022

jiaodong Jul 27, 2022

jiaodong commented Jul 27, 2022 •

edited

Loading

clarkzinzow left a comment

richardliaw left a comment

c21 left a comment

jiaodong commented Jul 27, 2022

jiaodong commented Jul 28, 2022

	if type(filesystem).__name__ == "S3FileSystem" and "@" in resolved_path:
	if isinstance(filesystem, S3FileSystem) and "@" in resolved_path:

[AIR][Data] Fix nyc_taxi_basic_processing notebook #26983

[AIR][Data] Fix nyc_taxi_basic_processing notebook #26983

Conversation

jiaodong commented Jul 25, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw commented Jul 27, 2022

Choose a reason for hiding this comment

jiaodong Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiaodong commented Jul 27, 2022 • edited Loading

clarkzinzow left a comment

Choose a reason for hiding this comment

richardliaw left a comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

jiaodong commented Jul 27, 2022

jiaodong commented Jul 28, 2022

jiaodong commented Jul 25, 2022 •

edited

Loading

jiaodong Jul 27, 2022 •

edited

Loading

jiaodong commented Jul 27, 2022 •

edited

Loading