-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transform fails with S3 as backend storage #3189
Comments
@arghyaganguly I think this issue would make sense to be in the TFX repo since tf.transform the library doesn't explicitly support S3, could you please move it there? |
@ConverJens , as suggested by jkinkead in tensorflow/tensorflow#13844 adjusting the S3_REQUEST_TIMEOUT_MSEC might help in dealing with default timeout by S3 clients when encountering large graphs. |
@zoyahav I'll post it on TFX page instead. @arghyaganguly How and where do I set this? As an environment variable in the pod? |
@ConverJens , I have transferred the issue to tfx. |
@arghyaganguly Thank you! Did you create an issue on the TFX issue page as well or are we sticking with this? Regarding your proposed solution, I don't think that's the issue because TFX manages to upload the graph to a tmp dir in s3 but then fails in copying it to the persistent location. To me, it seems like a bug or strange behaviour in a tensorflow method: tf.io.gfile.copy(source, destination). |
Hi @PatrickXYS, the issue is related to running TFX via Kubeflow with S3, is this sth you can help with? |
Thanks @Bobgy We have an issue keep tracking it, kubeflow/pipelines#596 I'll check to see if any progress |
@arghyaganguly I'm using TFX 0.26.1 with it's dependencies. @Bobgy @PatrickXYS TFX now supports a recent enough beam version to be able to run on S3 and other components, e.g. ExampleGen, SchemaGen, StatisticsGen, all work as expected. The issue is solemnly with Transform, hence it is likely a TFX issue and not Kubeflow or KFP. The actual call failing is |
@ConverJens
But object exists in minio storage and is a empty directory. Because error occured that function returns |
@arghyaganguly I have the following under Transform/transform_graph// : The latter contains two vocab_asset files and three dir with guid names, each containing an empty variables dir and a saved_model.pb. transform_tmp contains adir with a guid name which contains an asset dir with the correct vocab files a variable file of zero bytes. I'm currently using force_tf_compat_v1=True but is seems as the error is similar to (or the same as) what @ferryvg has. |
@arghyaganguly @ferryvg |
Before the failure the last log lines verify that all the artifacts are written to appropriate tmp dirs in S3:
|
My Minio trace:
So, i think this is caused by Minio wrong response, because |
I'm was trying reproduce my case using AWS S3 and does not receive error. So, that problem caused only by Minio |
@ferryvg Great troubleshooting! How do we proceed then? What version of Minio did you use? |
@ConverJens |
@ferryvg Correct me if I'm wrong, but doesn't this mean that Minio isn't perfectly S3 compatible since this works on AWS but not using Minio? And if that is the case it should be re-rasied with Minio. |
@ConverJens Yeah, but they was replied about that issue in minio/minio#4434 |
@ferryvg That is true. However, that was more than one year ago and I believe that the case of the worlds largest production ML framework not working on Minio but actually working on S3 should warrant some attention. I'll post a new issue on Minio and refer to this one. |
@Bobgy @PatrickXYS @arghyaganguly This issue was not related to TFX but rather how Minio was deployed, i.e. not running Minio in erasure mode. Vanilla KubeFlow does not use erasure mode so perhaps this should be addressed for future releases? See above Minio issue. |
Another way how you reproduce this case: make custom component which use Or just create two custom components. Executor of first component will just create empty "dir" in Minio/S3 and set it to |
Closing this since this is not a tfx specific issue.Please raise an issue in Kubeflow. https://github.com/kubeflow/kubeflow/issues |
I'm running TFX in KubeFlow and now I'm trying to use an S3 backend, e.g. Minio.
ExampleGen, StatisticsGen and SchemaGen completes successfully but Transform fails. It seems to have finished all computations and has written the graph to a tmp dir in S3 but then it tries to copy it to the actual patch and fails.
In Minio I can see the output of the analyzer cache and the transform_tmp dir. transformed_examples is empty.
Below is the log output with the error:
The text was updated successfully, but these errors were encountered: