Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 errors in Pipeline examples for reading training data and artifact storage #596

Closed
mameshini opened this issue Dec 27, 2018 · 35 comments
Closed

Comments

@mameshini
Copy link

mameshini commented Dec 27, 2018

I am working with the Taxi Cab pipeline example and need to replace GCS storage with Minio (S3 compatible) for storing training data, eval data, and to pass data from step to step in argo workflows:
"pipelines/samples/notebooks/KubeFlow Pipeline Using TFX OSS Components.ipynb"

The issue with s3:// protocol support seems to be specific to TFDV/Apache Beam step. Beam does not seem to provide support for S3 in Python client. We are looking for a way right now to change TFDV step to use local/attached storage.

Minio access parameters seem to be properly configured - the validation step is successfully creating several folders in Minio bucket, for example: demo04kubeflow/output/tfx-taxi-cab-classification-pipeline-example-ht94b/validation

The error is on reading or writing any files from the Minio buckets, and it's coming from Tensorflow/Beam tfdv.generate_statistics_from_csv():

File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py", line 92, in get_filesystem
    raise ValueError('Unable to get the Filesystem for path %s' % path)
ValueError: Unable to get the Filesystem for path s3://ml-pipeline-playground/tfx/taxi-cab-classification/train.csv

Minio files are accessed via s3:// protocol, for example
PipelineTFX4.ipynb.txt
OUTPUT_DIR = 's3://demo04kubeflow/output'

This same step worked fine when train.csv was stored in GCS bucket:
gs://ml-pipeline-playground/tfx/taxi-cab-classification/train.csv

Minio credentials were provided as env variables to ContainerOp:

return dsl.ContainerOp(
        name = step_name,
        image = DATAFLOW_TFDV_IMAGE,
        arguments = [
            '--csv-data-for-inference', inference_data,
            '--csv-data-to-validate', validation_data,
            '--column-names', column_names,
            '--key-columns', key_columns,
            '--project', project,
            '--mode', mode,
            '--output', validation_output,
        ],
        file_outputs = {
            'schema': '/schema.txt',
        }
    ).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_ENDPOINT', 
            value=S3_ENDPOINT, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_ENDPOINT_URL', 
            value='https://{}'.format(S3_ENDPOINT), 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_ACCESS_KEY_ID', 
            value=S3_ACCESS_KEY, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_SECRET_ACCESS_KEY', 
            value=S3_SECRET_KEY, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_REGION', 
            value='us-east-1', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='BUCKET_NAME', 
            value='demo04kubeflow', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_USE_HTTPS', 
            value='1', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_VERIFY_SSL', 
            value='1'
    ))

This pipeline example was created from Jupyter notebook running on the same Kubernetes cluster as Kubeflow Pipelines, Argo, and Minio. Please see attached the Jupyter notebook, and two log files from pipeline execution (validate step). All required files (such as train.csv) were uploaded to Minio from the notebook.
tfx-taxi-cab-classification-pipeline-example-wait.log

tfx-taxi-cab-classification-pipeline-example-main.log

PipelineTFX4.ipynb.zip

@mameshini mameshini changed the title Use Minio in Pipeline examples for reading training data and artifact storage S3 errors in Pipeline examples for reading training data and artifact storage Jan 1, 2019
@aronchick
Copy link

/cc @jlewi can you page in the right folks here? we're blocked on using this until it's solved.

@jlewi
Copy link
Contributor

jlewi commented Jan 10, 2019

ack I'll loop in some folks; but it sounds like the issue is actually outside Kubeflow and is in Apache Beam.

You might want to repost the issue in the Apache Beam
https://issues.apache.org/jira/browse/BEAM-2500

Or in the TF Data Validation
https://github.com/tensorflow/data-validation

If your goal is trying to use pipelines did you consider trying to use some other example? Or creating a new one that doesn't use TFX.

@jlewi
Copy link
Contributor

jlewi commented Jan 10, 2019

It looks like this is a known issue with Apache Beam and has been open for a long time.
https://issues.apache.org/jira/browse/BEAM-2572

@vicaire
Copy link
Contributor

vicaire commented Mar 26, 2019

Resolving since this is an issue in beam.

@vicaire vicaire closed this as completed Mar 26, 2019
@aronchick
Copy link

I understand our desire to close these issues, but I'd like to suggest we take ownership over the problem. Obviously, most Kuebfolw deployments will run against S3, rather than GCP, so most deployments will now be a problem.

@aronchick
Copy link

At LEAST we should file it as a bug over there. Are the TFX aware of the issue?

@mameshini - would you mind filing a bug?

@mameshini
Copy link
Author

@aronchick We are currently implementing a storage management approach that mounts S3 bucket as Kubernetes volume, using s3fs dynamic provisioner. It's not just Apache Beam, it's Keras and other libraries that can't handle S3 or Minio. We can mount S3/GCS/Minio buckets as volumes and access them as Posix file system, with an optional caching layer. I can share working examples soon, fingers crossed performance seems acceptable. I am waiting on filing a bug because we may be able to solve this problem better with Kubernetes storage provisioners. A lot of Python libraries require a file system to work with.

@aronchick
Copy link

+1!

@vicaire I would suggest we reopen - we need a solution here.

@vicaire
Copy link
Contributor

vicaire commented Mar 27, 2019

Got it.

Apologies. Looks like I misunderstood the issue. Reopening. Having things working with S3 is a high priority for us.

@vicaire
Copy link
Contributor

vicaire commented Mar 27, 2019

Note: Related to volume support: #801

@vicaire
Copy link
Contributor

vicaire commented Mar 27, 2019

@mameshini, we are looking forward to your example. Thanks!

@Jeffwan
Copy link
Member

Jeffwan commented Apr 12, 2019

I am following up on JIRA ticket https://issues.apache.org/jira/browse/BEAM-2572 and pushing progress for s3 filesystem native support in Apache Beam Python SDK. I will give a hand if necessary since Beam Python will use boto3 which will significantly simplify deployment.

As an alternative, I use NFS as the shared storage and add a fews arguments to make sure
tf-serving deployer are working for non-GKE clusters.

Please check the example here. https://gist.github.com/Jeffwan/5ee66343e48cf52c08c4de98be98cc1d

Change Lists:

  1. Prepare NFS storage and copy all files here to /taxi, create a PV and PVC called efs-claim
  2. Add storage volume efs-claim to every container and remove GCP secret env.
  3. deployer calls google meta data service to get cluster name, to skip this step, you can pass --cluster-name in order to get persist model, --pvc-name is also necessary.
  4. deployer uses pod service account to get access to kubernetes server. Make sure this server account has right RBAC setup. I assume it's kubeflow:default

@mameshini
Copy link
Author

Thank you for the update @Jeffwan. NFS storage can be acceptable for demos or even real use cases where data sets are relatively small. But NFS does have limitations with cost, performance, scalability. When data sets exceed 100 GB, NFS becomes too expensive, slow, and hard to manage. Even when using managed NFS from cloud providers, we experienced very long delays when simply reading directories with a large number of files. The preferred approach would be to store data in object storage.

We have explored several ways to mount object storage as Posix-like file system, with support for S3, GCS, Minio, or Ceph. Based on our testing, Goofys had the best reliability and performance, and we decided to use it for creating PVs and PVCs for pipeline steps. Preparing to publish a working example soon.

@Jeffwan
Copy link
Member

Jeffwan commented Apr 12, 2019

@mameshini Nice, thanks for sharing. Indeed, NFS is not a very good option in HPC/DL area. We internally use Lustre for deep learning training, it can be used as data caching layer and backed by S3 as data repository and provide Posix file system interface. Look forward to your example!

@rummens
Copy link

rummens commented Jun 6, 2019

Any recent progress on this? :-)

@Jeffwan
Copy link
Member

Jeffwan commented Jun 6, 2019

@rummens Please track status in Apache Beam community https://issues.apache.org/jira/browse/BEAM-2572 . There's a python SDK dependency there to make demo work with S3 natively.

@mameshini
Copy link
Author

@rummens At the same time, work is in progress on a library that allows to mount any S3 bucket into pipeline steps or Jupyter notebooks. The library is based on Goofys file system, and it will be ready as soon as next week. It allows various frameworks such as Beam and Keras to access data in S3 buckets via POSIX interface. It supports S3, Minio, GCS, Ceph:
https://github.com/kahing/goofys
Currently working on documentation and examples.

@rummens
Copy link

rummens commented Jun 8, 2019

Awesome looking forward to it! Basically it will be a PV mounted in each pod, that under the hood is an object storage?

@IronPan
Copy link
Member

IronPan commented Jul 9, 2019

@mameshini This looks very interesting. Do you have an example to share?

@rummens
Copy link

rummens commented Jul 10, 2019

Can you point me to the documentation of how to create the PVC in the first place? I was only able to find the mounting of it.
Thanks

@mameshini
Copy link
Author

@rummens You need to deploy Flex plugin to enable goofys support for kubernetes.
Then goofys can be used to mount S3, GCS, or Minio buckets as PVs. Goofys will only mount a specific bucket so you must provide the bucket option, and pre-create the bucket for each volume. An example for PVC is provided in kubeflow-extensions/storage/s3fs/test.yaml, bucket name is "default".
Installation instructions for manual deployment of flex plugin will be added in a few days, automatic install is working on Agile Stacks.

@rummens
Copy link

rummens commented Jul 12, 2019

Thanks, I will be looking out for the install instructions and see if it makes sense for us.

@vicaire vicaire removed their assignment Jul 16, 2019
@AnnieWei58
Copy link

AnnieWei58 commented Aug 19, 2019

@mameshini Hi Mameshini, I am trying to store data from kubeflow to MinIO, but I've got an issue regarding the mounting problem. It seems that I need to modify the yaml file, but I do not know how to modify it, do you have any example? or is there any hint of how do modify it plz?

@wdhorton
Copy link

wdhorton commented Oct 9, 2019

I am experiencing the same issues as @AnnieWei58

@Ark-kun
Copy link
Contributor

Ark-kun commented Oct 19, 2019

@IronPan @mameshini Can you work together to check the possibility of enabling gcsfuse or goofys or other S3 mounting system as part of KFP deployment?

The CUJ is:
When user adds volumeMount for a specific volume to a Pod, the volume mounts successfully and the files written to that volume appear in GCS in a bucket configured by the cluster admin.

@Jeffwan
Copy link
Member

Jeffwan commented Jan 21, 2020

S3 support was merged into Beam and it will be released in 2.19.0. Let's wait on the release and we can update all the examples.

@goswamig
Copy link
Contributor

Apache Beam has been released with s3 support. The latest version is 2.20.0.

@Jeffwan
Copy link
Member

Jeffwan commented Apr 15, 2020

@gautamkmr Check #3185 The blocker is now on the TFX side.

@stale
Copy link

stale bot commented Jul 15, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jul 15, 2020
@PatrickXYS
Copy link
Member

TFX v0.22.0 support Apache Beam 2.21.0 with S3 support. https://github.com/tensorflow/tfx#compatible-versions

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jul 19, 2020
@stale
Copy link

stale bot commented Oct 18, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 18, 2020
@PatrickXYS
Copy link
Member

/remove-frozen

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 18, 2020
@karlschriek
Copy link

Is this issue still relevant? There are similar discussions also going on elsewhere, such as here: #3405

This is a pretty important issue for us. We want to move all in-cluster persistent storage (mysql DBs, MinIO) off to managed AWS services as managing them in-cluster is becoming quite a pain.

@Bobgy
Copy link
Contributor

Bobgy commented Jun 29, 2021

+1, I think we should focus discussion on single thread: #3405 where the latest discussions exist.

@Bobgy Bobgy closed this as completed Jun 29, 2021
Linchin pushed a commit to Linchin/pipelines that referenced this issue Apr 11, 2023
…low#596)

* This is a replacement for
  https://github.com/kubeflow/manifests/blob/master/hack/update-instance-labels.sh

* We don't want to assume all applications are at the same version
* This script makes it easier to set a different version for a specific
  application.
HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests