-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR][Data] Fix nyc_taxi_basic_processing notebook #26983
Conversation
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"(_get_read_tasks pid=9272) 2022-07-25 14:29:30,732\tINFO parquet_datasource.py:323 -- Parquet input size estimation took 28.51 seconds.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the notebook expected to run on user's local machine? I feel this might due to the network speed to read S3 files into local machine.
I tried locally on my laptop, and I found the spent time varied no matter how large the file size is.
Downsampled data used here (sampling took 6.54 seconds):
>>> ds = ray.data.read_parquet([
... "s3://air-example-data-2/ursa-labs-taxi-data/downsampled_2009_01_data.parquet",
... "s3://air-example-data-2/ursa-labs-taxi-data/downsampled_2009_02_data.parquet"])
⚠️ The number of blocks in this dataset (2) limits its parallelism to 2 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks.
>>> (_get_read_tasks pid=70675) 2022-07-25 15:33:27,142 INFO parquet_datasource.py:323 -- Parquet input size estimation took 6.54 seconds.
Full data (sampling took within 5 seconds, so no printing):
>>> ds = ray.data.read_parquet([
... "s3://ursa-labs-taxi-data/2009/01/data.parquet",
... "s3://ursa-labs-taxi-data/2009/02/data.parquet"])
⚠️ The blocks of this dataset are estimated to be 2.0x larger than the target block size of 512 MiB. This may lead to out-of-memory errors during processing. Consider reducing the size of input files or using `.repartition(n)` to increase the number of dataset blocks.
>>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure i can remove it
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
can we actually add a simple test, just to provide a unit test scaffold for this component |
# `s3://anonymous@bucket/data.parquet` with known filesystem, pyarrow | ||
# returns `anonymous@bucket/data.parquet` as the resolved path by | ||
# mistake. | ||
if type(filesystem).__name__ == "S3FileSystem" and "@" in resolved_path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not do the more Pythonic instance check here?
if type(filesystem).__name__ == "S3FileSystem" and "@" in resolved_path: | |
if isinstance(filesystem, S3FileSystem) and "@" in resolved_path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clarkzinzow so i looked into pyarrow's implementation a bit and it only lazily imports S3FileSystem
and there's nowhere that assumes it's always available, so i fell back to use string check. If dataset can safely assume we always have this module i can change it back to instance check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah the S3FileSystem
does some expensive things (e.g. network calls) at initialization time, so the string check would probably be best. Thank you for looking into that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pyarrow's python implementation ... makes me appreciate ray dataset code a lot more now lol
# like `anonymous@bucket/data.parquet` | ||
resolved_path = resolved_path.split("@")[-1] | ||
else: | ||
resolved_path = filesystem.normalize_path(resolved_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a unit test covering this case to python/ray/data/tests/test_dataset_formats.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great ! added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm but someone from data will need to approve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one minor comment to remove log. Thanks for the work @jiaodong!
test failure on CI is legit, dataset tests skip a test if moto is not installed which is the case of my laptop dev .. and it's reproducible locally. Fixing these now. |
diff looks large but looks like we actually need them since the meat of this notebook is to show e2e intermediate results. |
Why are these changes needed?
The code is actually ok but raw datasize is too large. So I downsampled monthly data with ratio of
0.1
and full year data with0.01
, uploaded to AIR's public s3 bucket.Also slightly changed the full year data description as we're not dealing with huge data anymore, but kept the same paragraph about lazy read.
Blocked by #27064
Related issue number
#26410
Closes #27064
Checks
scripts/format.sh
to lint the changes in this PR.