S3 errors in Pipeline examples for reading training data and artifact storage #596

mameshini · 2018-12-27T07:04:39Z

I am working with the Taxi Cab pipeline example and need to replace GCS storage with Minio (S3 compatible) for storing training data, eval data, and to pass data from step to step in argo workflows:
"pipelines/samples/notebooks/KubeFlow Pipeline Using TFX OSS Components.ipynb"

The issue with s3:// protocol support seems to be specific to TFDV/Apache Beam step. Beam does not seem to provide support for S3 in Python client. We are looking for a way right now to change TFDV step to use local/attached storage.

Minio access parameters seem to be properly configured - the validation step is successfully creating several folders in Minio bucket, for example: demo04kubeflow/output/tfx-taxi-cab-classification-pipeline-example-ht94b/validation

The error is on reading or writing any files from the Minio buckets, and it's coming from Tensorflow/Beam tfdv.generate_statistics_from_csv():

File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py", line 92, in get_filesystem
    raise ValueError('Unable to get the Filesystem for path %s' % path)
ValueError: Unable to get the Filesystem for path s3://ml-pipeline-playground/tfx/taxi-cab-classification/train.csv

Minio files are accessed via s3:// protocol, for example
PipelineTFX4.ipynb.txt
OUTPUT_DIR = 's3://demo04kubeflow/output'

This same step worked fine when train.csv was stored in GCS bucket:
gs://ml-pipeline-playground/tfx/taxi-cab-classification/train.csv

Minio credentials were provided as env variables to ContainerOp:

return dsl.ContainerOp(
        name = step_name,
        image = DATAFLOW_TFDV_IMAGE,
        arguments = [
            '--csv-data-for-inference', inference_data,
            '--csv-data-to-validate', validation_data,
            '--column-names', column_names,
            '--key-columns', key_columns,
            '--project', project,
            '--mode', mode,
            '--output', validation_output,
        ],
        file_outputs = {
            'schema': '/schema.txt',
        }
    ).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_ENDPOINT', 
            value=S3_ENDPOINT, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_ENDPOINT_URL', 
            value='https://{}'.format(S3_ENDPOINT), 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_ACCESS_KEY_ID', 
            value=S3_ACCESS_KEY, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_SECRET_ACCESS_KEY', 
            value=S3_SECRET_KEY, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_REGION', 
            value='us-east-1', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='BUCKET_NAME', 
            value='demo04kubeflow', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_USE_HTTPS', 
            value='1', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_VERIFY_SSL', 
            value='1'
    ))

This pipeline example was created from Jupyter notebook running on the same Kubernetes cluster as Kubeflow Pipelines, Argo, and Minio. Please see attached the Jupyter notebook, and two log files from pipeline execution (validate step). All required files (such as train.csv) were uploaded to Minio from the notebook.
tfx-taxi-cab-classification-pipeline-example-wait.log

tfx-taxi-cab-classification-pipeline-example-main.log

PipelineTFX4.ipynb.zip

The text was updated successfully, but these errors were encountered:

aronchick · 2019-01-09T20:42:52Z

/cc @jlewi can you page in the right folks here? we're blocked on using this until it's solved.

jlewi · 2019-01-10T00:38:18Z

ack I'll loop in some folks; but it sounds like the issue is actually outside Kubeflow and is in Apache Beam.

You might want to repost the issue in the Apache Beam
https://issues.apache.org/jira/browse/BEAM-2500

Or in the TF Data Validation
https://github.com/tensorflow/data-validation

If your goal is trying to use pipelines did you consider trying to use some other example? Or creating a new one that doesn't use TFX.

jlewi · 2019-01-10T00:44:24Z

It looks like this is a known issue with Apache Beam and has been open for a long time.
https://issues.apache.org/jira/browse/BEAM-2572

vicaire · 2019-03-26T04:49:43Z

Resolving since this is an issue in beam.

aronchick · 2019-03-26T13:58:44Z

I understand our desire to close these issues, but I'd like to suggest we take ownership over the problem. Obviously, most Kuebfolw deployments will run against S3, rather than GCP, so most deployments will now be a problem.

aronchick · 2019-03-26T14:03:51Z

At LEAST we should file it as a bug over there. Are the TFX aware of the issue?

@mameshini - would you mind filing a bug?

mameshini · 2019-03-26T16:32:38Z

@aronchick We are currently implementing a storage management approach that mounts S3 bucket as Kubernetes volume, using s3fs dynamic provisioner. It's not just Apache Beam, it's Keras and other libraries that can't handle S3 or Minio. We can mount S3/GCS/Minio buckets as volumes and access them as Posix file system, with an optional caching layer. I can share working examples soon, fingers crossed performance seems acceptable. I am waiting on filing a bug because we may be able to solve this problem better with Kubernetes storage provisioners. A lot of Python libraries require a file system to work with.

aronchick · 2019-03-26T18:00:04Z

+1!

@vicaire I would suggest we reopen - we need a solution here.

vicaire · 2019-03-27T03:59:47Z

Got it.

Apologies. Looks like I misunderstood the issue. Reopening. Having things working with S3 is a high priority for us.

vicaire · 2019-03-27T04:03:18Z

Note: Related to volume support: #801

vicaire · 2019-03-27T04:03:55Z

@mameshini, we are looking forward to your example. Thanks!

Jeffwan · 2019-04-12T07:09:54Z

I am following up on JIRA ticket https://issues.apache.org/jira/browse/BEAM-2572 and pushing progress for s3 filesystem native support in Apache Beam Python SDK. I will give a hand if necessary since Beam Python will use boto3 which will significantly simplify deployment.

As an alternative, I use NFS as the shared storage and add a fews arguments to make sure
tf-serving deployer are working for non-GKE clusters.

Please check the example here. https://gist.github.com/Jeffwan/5ee66343e48cf52c08c4de98be98cc1d

Change Lists:

Prepare NFS storage and copy all files here to /taxi, create a PV and PVC called efs-claim
Add storage volume efs-claim to every container and remove GCP secret env.
deployer calls google meta data service to get cluster name, to skip this step, you can pass --cluster-name in order to get persist model, --pvc-name is also necessary.
deployer uses pod service account to get access to kubernetes server. Make sure this server account has right RBAC setup. I assume it's kubeflow:default

mameshini · 2019-04-12T18:27:15Z

Thank you for the update @Jeffwan. NFS storage can be acceptable for demos or even real use cases where data sets are relatively small. But NFS does have limitations with cost, performance, scalability. When data sets exceed 100 GB, NFS becomes too expensive, slow, and hard to manage. Even when using managed NFS from cloud providers, we experienced very long delays when simply reading directories with a large number of files. The preferred approach would be to store data in object storage.

We have explored several ways to mount object storage as Posix-like file system, with support for S3, GCS, Minio, or Ceph. Based on our testing, Goofys had the best reliability and performance, and we decided to use it for creating PVs and PVCs for pipeline steps. Preparing to publish a working example soon.

Jeffwan · 2019-04-12T19:59:37Z

@mameshini Nice, thanks for sharing. Indeed, NFS is not a very good option in HPC/DL area. We internally use Lustre for deep learning training, it can be used as data caching layer and backed by S3 as data repository and provide Posix file system interface. Look forward to your example!

rummens · 2019-06-06T15:28:29Z

Any recent progress on this? :-)

Jeffwan · 2019-06-06T21:22:42Z

@rummens Please track status in Apache Beam community https://issues.apache.org/jira/browse/BEAM-2572 . There's a python SDK dependency there to make demo work with S3 natively.

mameshini · 2019-06-07T17:42:56Z

@rummens At the same time, work is in progress on a library that allows to mount any S3 bucket into pipeline steps or Jupyter notebooks. The library is based on Goofys file system, and it will be ready as soon as next week. It allows various frameworks such as Beam and Keras to access data in S3 buckets via POSIX interface. It supports S3, Minio, GCS, Ceph:
https://github.com/kahing/goofys
Currently working on documentation and examples.

rummens · 2019-06-08T07:39:42Z

Awesome looking forward to it! Basically it will be a PV mounted in each pod, that under the hood is an object storage?

IronPan · 2019-07-09T21:01:40Z

@mameshini This looks very interesting. Do you have an example to share?

rummens · 2019-07-10T06:15:50Z

Can you point me to the documentation of how to create the PVC in the first place? I was only able to find the mounting of it.
Thanks

mameshini · 2019-07-12T03:23:51Z

@rummens You need to deploy Flex plugin to enable goofys support for kubernetes.
Then goofys can be used to mount S3, GCS, or Minio buckets as PVs. Goofys will only mount a specific bucket so you must provide the bucket option, and pre-create the bucket for each volume. An example for PVC is provided in kubeflow-extensions/storage/s3fs/test.yaml, bucket name is "default".
Installation instructions for manual deployment of flex plugin will be added in a few days, automatic install is working on Agile Stacks.

rummens · 2019-07-12T06:05:24Z

Thanks, I will be looking out for the install instructions and see if it makes sense for us.

AnnieWei58 · 2019-08-19T09:34:13Z

@mameshini Hi Mameshini, I am trying to store data from kubeflow to MinIO, but I've got an issue regarding the mounting problem. It seems that I need to modify the yaml file, but I do not know how to modify it, do you have any example? or is there any hint of how do modify it plz?

wdhorton · 2019-10-09T03:22:09Z

I am experiencing the same issues as @AnnieWei58

Ark-kun · 2019-10-19T00:37:43Z

@IronPan @mameshini Can you work together to check the possibility of enabling gcsfuse or goofys or other S3 mounting system as part of KFP deployment?

The CUJ is:
When user adds volumeMount for a specific volume to a Pod, the volume mounts successfully and the files written to that volume appear in GCS in a bucket configured by the cluster admin.

Jeffwan · 2020-01-21T00:57:31Z

S3 support was merged into Beam and it will be released in 2.19.0. Let's wait on the release and we can update all the examples.

goswamig · 2020-04-15T21:56:18Z

Apache Beam has been released with s3 support. The latest version is 2.20.0.

Jeffwan · 2020-04-15T23:22:08Z

@gautamkmr Check #3185 The blocker is now on the TFX side.

stale · 2020-07-15T03:38:35Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

PatrickXYS · 2020-07-19T21:50:23Z

TFX v0.22.0 support Apache Beam 2.21.0 with S3 support. https://github.com/tensorflow/tfx#compatible-versions

stale · 2020-10-18T00:27:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

PatrickXYS · 2020-10-18T00:31:37Z

/remove-frozen

karlschriek · 2020-11-26T08:27:34Z

Is this issue still relevant? There are similar discussions also going on elsewhere, such as here: #3405

This is a pretty important issue for us. We want to move all in-cluster persistent storage (mysql DBs, MinIO) off to managed AWS services as managing them in-cluster is becoming quite a pain.

Bobgy · 2021-06-29T01:50:58Z

+1, I think we should focus discussion on single thread: #3405 where the latest discussions exist.

…low#596) * This is a replacement for https://github.com/kubeflow/manifests/blob/master/hack/update-instance-labels.sh * We don't want to assume all applications are at the same version * This script makes it easier to set a different version for a specific application.

mameshini changed the title ~~Use Minio in Pipeline examples for reading training data and artifact storage~~ S3 errors in Pipeline examples for reading training data and artifact storage Jan 1, 2019

vicaire closed this as completed Mar 26, 2019

vicaire reopened this Mar 27, 2019

vicaire added priority/p0 platform/other help wanted The community is welcome to contribute. kind/bug area/components labels Mar 27, 2019

vicaire assigned mameshini and vicaire Mar 27, 2019

jlewi added the platform/aws label Apr 22, 2019

IronPan mentioned this issue Jul 11, 2019

Web UI should supports get artifact from local path #1497

Closed

vicaire removed their assignment Jul 16, 2019

Ark-kun assigned Ark-kun and IronPan Oct 5, 2019

Jeffwan mentioned this issue Feb 28, 2020

Update taxi pipeline example with latest beam with S3 support #3185

Closed

Jeffwan mentioned this issue Mar 31, 2020

S3 support in Kubeflow Pipelines #3405

Closed

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jul 15, 2020

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jul 19, 2020

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 18, 2020

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 18, 2020

Ark-kun added the lifecycle/frozen label Nov 10, 2020

PatrickXYS mentioned this issue Feb 5, 2021

Transform fails with S3 as backend storage tensorflow/tfx#3189

Closed

Bobgy closed this as completed Jun 29, 2021

HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024

Permit pipeline-runner to operate on runs (kubeflow#596)

0ebcd7e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 errors in Pipeline examples for reading training data and artifact storage #596

S3 errors in Pipeline examples for reading training data and artifact storage #596

mameshini commented Dec 27, 2018 •

edited

Loading

aronchick commented Jan 9, 2019

jlewi commented Jan 10, 2019

jlewi commented Jan 10, 2019

vicaire commented Mar 26, 2019

aronchick commented Mar 26, 2019

aronchick commented Mar 26, 2019

mameshini commented Mar 26, 2019

aronchick commented Mar 26, 2019

vicaire commented Mar 27, 2019

vicaire commented Mar 27, 2019

vicaire commented Mar 27, 2019

Jeffwan commented Apr 12, 2019 •

edited

Loading

mameshini commented Apr 12, 2019

Jeffwan commented Apr 12, 2019

rummens commented Jun 6, 2019

Jeffwan commented Jun 6, 2019

mameshini commented Jun 7, 2019

rummens commented Jun 8, 2019

IronPan commented Jul 9, 2019

rummens commented Jul 10, 2019

mameshini commented Jul 12, 2019

rummens commented Jul 12, 2019

AnnieWei58 commented Aug 19, 2019 •

edited

Loading

wdhorton commented Oct 9, 2019

Ark-kun commented Oct 19, 2019

Jeffwan commented Jan 21, 2020

goswamig commented Apr 15, 2020

Jeffwan commented Apr 15, 2020

stale bot commented Jul 15, 2020

PatrickXYS commented Jul 19, 2020

stale bot commented Oct 18, 2020

PatrickXYS commented Oct 18, 2020

karlschriek commented Nov 26, 2020

Bobgy commented Jun 29, 2021

S3 errors in Pipeline examples for reading training data and artifact storage #596

S3 errors in Pipeline examples for reading training data and artifact storage #596

Comments

mameshini commented Dec 27, 2018 • edited Loading

aronchick commented Jan 9, 2019

jlewi commented Jan 10, 2019

jlewi commented Jan 10, 2019

vicaire commented Mar 26, 2019

aronchick commented Mar 26, 2019

aronchick commented Mar 26, 2019

mameshini commented Mar 26, 2019

aronchick commented Mar 26, 2019

vicaire commented Mar 27, 2019

vicaire commented Mar 27, 2019

vicaire commented Mar 27, 2019

Jeffwan commented Apr 12, 2019 • edited Loading

mameshini commented Apr 12, 2019

Jeffwan commented Apr 12, 2019

rummens commented Jun 6, 2019

Jeffwan commented Jun 6, 2019

mameshini commented Jun 7, 2019

rummens commented Jun 8, 2019

IronPan commented Jul 9, 2019

rummens commented Jul 10, 2019

mameshini commented Jul 12, 2019

rummens commented Jul 12, 2019

AnnieWei58 commented Aug 19, 2019 • edited Loading

wdhorton commented Oct 9, 2019

Ark-kun commented Oct 19, 2019

Jeffwan commented Jan 21, 2020

goswamig commented Apr 15, 2020

Jeffwan commented Apr 15, 2020

stale bot commented Jul 15, 2020

PatrickXYS commented Jul 19, 2020

stale bot commented Oct 18, 2020

PatrickXYS commented Oct 18, 2020

karlschriek commented Nov 26, 2020

Bobgy commented Jun 29, 2021

mameshini commented Dec 27, 2018 •

edited

Loading

Jeffwan commented Apr 12, 2019 •

edited

Loading

AnnieWei58 commented Aug 19, 2019 •

edited

Loading