-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 errors in Pipeline examples for reading training data and artifact storage #596
Comments
/cc @jlewi can you page in the right folks here? we're blocked on using this until it's solved. |
ack I'll loop in some folks; but it sounds like the issue is actually outside Kubeflow and is in Apache Beam. You might want to repost the issue in the Apache Beam Or in the TF Data Validation If your goal is trying to use pipelines did you consider trying to use some other example? Or creating a new one that doesn't use TFX. |
It looks like this is a known issue with Apache Beam and has been open for a long time. |
Resolving since this is an issue in beam. |
I understand our desire to close these issues, but I'd like to suggest we take ownership over the problem. Obviously, most Kuebfolw deployments will run against S3, rather than GCP, so most deployments will now be a problem. |
At LEAST we should file it as a bug over there. Are the TFX aware of the issue? @mameshini - would you mind filing a bug? |
@aronchick We are currently implementing a storage management approach that mounts S3 bucket as Kubernetes volume, using s3fs dynamic provisioner. It's not just Apache Beam, it's Keras and other libraries that can't handle S3 or Minio. We can mount S3/GCS/Minio buckets as volumes and access them as Posix file system, with an optional caching layer. I can share working examples soon, fingers crossed performance seems acceptable. I am waiting on filing a bug because we may be able to solve this problem better with Kubernetes storage provisioners. A lot of Python libraries require a file system to work with. |
+1! @vicaire I would suggest we reopen - we need a solution here. |
Got it. Apologies. Looks like I misunderstood the issue. Reopening. Having things working with S3 is a high priority for us. |
Note: Related to volume support: #801 |
@mameshini, we are looking forward to your example. Thanks! |
I am following up on JIRA ticket https://issues.apache.org/jira/browse/BEAM-2572 and pushing progress for s3 filesystem native support in Apache Beam Python SDK. I will give a hand if necessary since Beam Python will use boto3 which will significantly simplify deployment. As an alternative, I use NFS as the shared storage and add a fews arguments to make sure Please check the example here. https://gist.github.com/Jeffwan/5ee66343e48cf52c08c4de98be98cc1d Change Lists:
|
Thank you for the update @Jeffwan. NFS storage can be acceptable for demos or even real use cases where data sets are relatively small. But NFS does have limitations with cost, performance, scalability. When data sets exceed 100 GB, NFS becomes too expensive, slow, and hard to manage. Even when using managed NFS from cloud providers, we experienced very long delays when simply reading directories with a large number of files. The preferred approach would be to store data in object storage. We have explored several ways to mount object storage as Posix-like file system, with support for S3, GCS, Minio, or Ceph. Based on our testing, Goofys had the best reliability and performance, and we decided to use it for creating PVs and PVCs for pipeline steps. Preparing to publish a working example soon. |
@mameshini Nice, thanks for sharing. Indeed, NFS is not a very good option in HPC/DL area. We internally use Lustre for deep learning training, it can be used as data caching layer and backed by S3 as data repository and provide Posix file system interface. Look forward to your example! |
Any recent progress on this? :-) |
@rummens Please track status in Apache Beam community https://issues.apache.org/jira/browse/BEAM-2572 . There's a python SDK dependency there to make demo work with S3 natively. |
@rummens At the same time, work is in progress on a library that allows to mount any S3 bucket into pipeline steps or Jupyter notebooks. The library is based on Goofys file system, and it will be ready as soon as next week. It allows various frameworks such as Beam and Keras to access data in S3 buckets via POSIX interface. It supports S3, Minio, GCS, Ceph: |
Awesome looking forward to it! Basically it will be a PV mounted in each pod, that under the hood is an object storage? |
@mameshini This looks very interesting. Do you have an example to share? |
Can you point me to the documentation of how to create the PVC in the first place? I was only able to find the mounting of it. |
@rummens You need to deploy Flex plugin to enable goofys support for kubernetes. |
Thanks, I will be looking out for the install instructions and see if it makes sense for us. |
@mameshini Hi Mameshini, I am trying to store data from kubeflow to MinIO, but I've got an issue regarding the mounting problem. It seems that I need to modify the yaml file, but I do not know how to modify it, do you have any example? or is there any hint of how do modify it plz? |
I am experiencing the same issues as @AnnieWei58 |
@IronPan @mameshini Can you work together to check the possibility of enabling gcsfuse or goofys or other S3 mounting system as part of KFP deployment? The CUJ is: |
S3 support was merged into Beam and it will be released in 2.19.0. Let's wait on the release and we can update all the examples. |
Apache Beam has been released with s3 support. The latest version is 2.20.0. |
@gautamkmr Check #3185 The blocker is now on the TFX side. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
TFX v0.22.0 support Apache Beam 2.21.0 with S3 support. https://github.com/tensorflow/tfx#compatible-versions |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/remove-frozen |
Is this issue still relevant? There are similar discussions also going on elsewhere, such as here: #3405 This is a pretty important issue for us. We want to move all in-cluster persistent storage (mysql DBs, MinIO) off to managed AWS services as managing them in-cluster is becoming quite a pain. |
+1, I think we should focus discussion on single thread: #3405 where the latest discussions exist. |
…low#596) * This is a replacement for https://github.com/kubeflow/manifests/blob/master/hack/update-instance-labels.sh * We don't want to assume all applications are at the same version * This script makes it easier to set a different version for a specific application.
I am working with the Taxi Cab pipeline example and need to replace GCS storage with Minio (S3 compatible) for storing training data, eval data, and to pass data from step to step in argo workflows:
"pipelines/samples/notebooks/KubeFlow Pipeline Using TFX OSS Components.ipynb"
The issue with s3:// protocol support seems to be specific to TFDV/Apache Beam step. Beam does not seem to provide support for S3 in Python client. We are looking for a way right now to change TFDV step to use local/attached storage.
Minio access parameters seem to be properly configured - the validation step is successfully creating several folders in Minio bucket, for example: demo04kubeflow/output/tfx-taxi-cab-classification-pipeline-example-ht94b/validation
The error is on reading or writing any files from the Minio buckets, and it's coming from Tensorflow/Beam tfdv.generate_statistics_from_csv():
Minio files are accessed via s3:// protocol, for example
PipelineTFX4.ipynb.txt
OUTPUT_DIR = 's3://demo04kubeflow/output'
This same step worked fine when train.csv was stored in GCS bucket:
gs://ml-pipeline-playground/tfx/taxi-cab-classification/train.csv
Minio credentials were provided as env variables to ContainerOp:
This pipeline example was created from Jupyter notebook running on the same Kubernetes cluster as Kubeflow Pipelines, Argo, and Minio. Please see attached the Jupyter notebook, and two log files from pipeline execution (validate step). All required files (such as train.csv) were uploaded to Minio from the notebook.
tfx-taxi-cab-classification-pipeline-example-wait.log
tfx-taxi-cab-classification-pipeline-example-main.log
PipelineTFX4.ipynb.zip
The text was updated successfully, but these errors were encountered: