Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform fails with S3 as backend storage #3189

Closed
ConverJens opened this issue Feb 2, 2021 · 22 comments
Closed

Transform fails with S3 as backend storage #3189

ConverJens opened this issue Feb 2, 2021 · 22 comments

Comments

@ConverJens
Copy link
Contributor

I'm running TFX in KubeFlow and now I'm trying to use an S3 backend, e.g. Minio.

ExampleGen, StatisticsGen and SchemaGen completes successfully but Transform fails. It seems to have finished all computations and has written the graph to a tmp dir in S3 but then it tries to copy it to the actual patch and fails.

In Minio I can see the output of the analyzer cache and the transform_tmp dir. transformed_examples is empty.

Below is the log output with the error:

INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612267047   nanos: 216625690 } message: "Assets added to graph." instruction_id: "bundle_170" transform_id: "Analyze/CreateSavedModel[tf_compat_v1]/BindTensors/ReplaceWithConstants" log_location: "/usr/local/lib/python3.7/dist-packages/tensorflow/python/saved_model/builder_impl.py:666" thread: "Thread-12" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612267047   nanos: 279367208 } message: "Assets written to: s3://pipelines/tfx/trace_pipeline_e2e/TransformMaster/transform_graph/1736/.temp_path/tftransform_tmp/3d458c0a7a00464f9632b2d467cdd963/assets" instruction_id: "bundle_170" transform_id: "Analyze/CreateSavedModel[tf_compat_v1]/BindTensors/ReplaceWithConstants" log_location: "/usr/local/lib/python3.7/dist-packages/tensorflow/python/saved_model/builder_impl.py:775" thread: "Thread-12" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612267047   nanos: 331916809 } message: "SavedModel written to: s3://pipelines/tfx/trace_pipeline_e2e/TransformMaster/transform_graph/1736/.temp_path/tftransform_tmp/3d458c0a7a00464f9632b2d467cdd963/saved_model.pb" instruction_id: "bundle_170" transform_id: "Analyze/CreateSavedModel[tf_compat_v1]/BindTensors/ReplaceWithConstants" log_location: "/usr/local/lib/python3.7/dist-packages/tensorflow/python/saved_model/builder_impl.py:426" thread: "Thread-12" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: ERROR timestamp {   seconds: 1612267136   nanos: 111515760 } message: "Error processing instruction bundle_170. Original traceback is\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1239, in process\n    return self.do_fn_invoker.invoke_process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 769, in invoke_process\n    windowed_value, additional_args, additional_kwargs)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 893, in _invoke_process_per_window\n    self.process_method(*args_for_process),\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/transforms/core.py\", line 1590, in <lambda>\n    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 37, in _copy_tree_to_unique_temp_dir\n    _copy_tree(source, destination)\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 51, in _copy_tree\n    os.path.join(source, filename), os.path.join(destination, filename))\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 53, in _copy_tree\n    tf.io.gfile.copy(source, destination)\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow/python/lib/io/file_io.py\", line 513, in copy_v2\n    compat.as_bytes(src), compat.as_bytes(dst), overwrite)\ntensorflow.python.framework.errors_impl.AbortedError: All 10 retry attempts failed. The last failure: Unknown: : No response body.\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py\", line 289, in _execute\n    response = task()\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py\", line 362, in <lambda>\n    lambda: self.create_worker().do_instruction(request), request)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py\", line 607, in do_instruction\n    getattr(request, request_type), request.instruction_id)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py\", line 644, in process_bundle\n    bundle_processor.process_bundle(instruction_id))\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/bundle_processor.py\", line 1000, in process_bundle\n    element.data)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/bundle_processor.py\", line 228, in process_encoded\n    self.output(decoded_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 359, in output\n    cython.cast(Receiver, self.receivers[output_index]).receive(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 221, in receive\n    self.consumer.process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 719, in process\n    delayed_application = self.dofn_runner.process(o)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1241, in process\n    self._reraise_augmented(exn)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1239, in process\n    return self.do_fn_invoker.invoke_process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 588, in invoke_process\n    windowed_value, self.process_method(windowed_value.value))\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1401, in process_outputs\n    self.main_receivers.receive(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 221, in receive\n    self.consumer.process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 719, in process\n    delayed_application = self.dofn_runner.process(o)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1241, in process\n    self._reraise_augmented(exn)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1239, in process\n    return self.do_fn_invoker.invoke_process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 588, in invoke_process\n    windowed_value, self.process_method(windowed_value.value))\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1401, in process_outputs\n    self.main_receivers.receive(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 221, in receive\n    self.consumer.process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 719, in process\n    delayed_application = self.dofn_runner.process(o)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1241, in process\n    self._reraise_augmented(exn)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1239, in process\n    return self.do_fn_invoker.invoke_process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 769, in invoke_process\n    windowed_value, additional_args, additional_kwargs)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 894, in _invoke_process_per_window\n    self.threadsafe_watermark_estimator)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1401, in process_outputs\n    self.main_receivers.receive(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 158, in receive\n    cython.cast(Operation, consumer).process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 719, in process\n    delayed_application = self.dofn_runner.process(o)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1241, in process\n    self._reraise_augmented(exn)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1321, in _reraise_augmented\n    raise_with_traceback(new_exn)\n  File \"/usr/local/lib/python3.7/dist-packages/future/utils/__init__.py\", line 446, in raise_with_traceback\n    raise exc.with_traceback(traceback)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1239, in process\n    return self.do_fn_invoker.invoke_process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 769, in invoke_process\n    windowed_value, additional_args, additional_kwargs)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 893, in _invoke_process_per_window\n    self.process_method(*args_for_process),\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/transforms/core.py\", line 1590, in <lambda>\n    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 37, in _copy_tree_to_unique_temp_dir\n    _copy_tree(source, destination)\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 51, in _copy_tree\n    os.path.join(source, filename), os.path.join(destination, filename))\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 53, in _copy_tree\n    tf.io.gfile.copy(source, destination)\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow/python/lib/io/file_io.py\", line 513, in copy_v2\n    compat.as_bytes(src), compat.as_bytes(dst), overwrite)\nRuntimeError: tensorflow.python.framework.errors_impl.AbortedError: All 10 retry attempts failed. The last failure: Unknown: : No response body. [while running \'WriteTransformFn/WriteTransformFnToTemp\']\n\n" instruction_id: "bundle_170" log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:296" thread: "Thread-12" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612267137   nanos: 21243095 } message: "No more requests from control plane" log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:266" thread: "MainThread" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612267137   nanos: 21665096 } message: "SDK Harness waiting for in-flight requests to complete" log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:267" thread: "MainThread" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612267137   nanos: 21919965 } message: "Closing all cached grpc data channels." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/data_plane.py:721" thread: "MainThread" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612267137   nanos: 22111654 } message: "Closing all cached gRPC state handlers." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:891" thread: "MainThread" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612267137   nanos: 31976699 } message: "Done consuming work." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:279" thread: "MainThread" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612267137   nanos: 32243013 } message: "Python sdk harness exiting." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker_main.py:166" thread: "MainThread" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612267137   nanos: 851266622 } message: "No more requests from control plane" log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:266" thread: "MainThread" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612267137   nanos: 851534366 } message: "SDK Harness waiting for in-flight requests to complete" log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:267" thread: "MainThread" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612267137   nanos: 851665258 } message: "Closing all cached grpc data channels." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/data_plane.py:721" thread: "MainThread" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612267137   nanos: 851786136 } message: "Closing all cached gRPC state handlers." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:891" thread: "MainThread" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612267137   nanos: 859477281 } message: "Done consuming work." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:279" thread: "MainThread" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612267137   nanos: 859626293 } message: "Python sdk harness exiting." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker_main.py:166" thread: "MainThread" 
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py", line 1239, in process
    return self.do_fn_invoker.invoke_process(windowed_value)
  File "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py", line 769, in invoke_process
    windowed_value, additional_args, additional_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py", line 893, in _invoke_process_per_window
    self.process_method(*args_for_process),
  File "/usr/local/lib/python3.7/dist-packages/apache_beam/transforms/core.py", line 1590, in <lambda>
    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py", line 37, in _copy_tree_to_unique_temp_dir
    _copy_tree(source, destination)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py", line 51, in _copy_tree
    os.path.join(source, filename), os.path.join(destination, filename))
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py", line 53, in _copy_tree
    tf.io.gfile.copy(source, destination)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 513, in copy_v2
    compat.as_bytes(src), compat.as_bytes(dst), overwrite)
tensorflow.python.framework.errors_impl.AbortedError: All 10 retry attempts failed. The last failure: Unknown: : No response body.
@arghyaganguly arghyaganguly self-assigned this Feb 3, 2021
@zoyahav
Copy link
Member

zoyahav commented Feb 3, 2021

@arghyaganguly I think this issue would make sense to be in the TFX repo since tf.transform the library doesn't explicitly support S3, could you please move it there?

@arghyaganguly arghyaganguly transferred this issue from tensorflow/transform Feb 3, 2021
@arghyaganguly
Copy link
Contributor

arghyaganguly commented Feb 3, 2021

@ConverJens , as suggested by jkinkead in tensorflow/tensorflow#13844 adjusting the S3_REQUEST_TIMEOUT_MSEC might help in dealing with default timeout by S3 clients when encountering large graphs.
Please mention the versions being used for tensorflow,tfx and tensorflow-transform for your usecase.

@ConverJens
Copy link
Contributor Author

@zoyahav I'll post it on TFX page instead.

@arghyaganguly How and where do I set this? As an environment variable in the pod?

@arghyaganguly
Copy link
Contributor

arghyaganguly commented Feb 3, 2021

@ConverJens , I have transferred the issue to tfx.

@ConverJens
Copy link
Contributor Author

@arghyaganguly Thank you! Did you create an issue on the TFX issue page as well or are we sticking with this?

Regarding your proposed solution, I don't think that's the issue because TFX manages to upload the graph to a tmp dir in s3 but then fails in copying it to the persistent location. To me, it seems like a bug or strange behaviour in a tensorflow method: tf.io.gfile.copy(source, destination).

@arghyaganguly arghyaganguly removed their assignment Feb 3, 2021
@Bobgy
Copy link

Bobgy commented Feb 5, 2021

Hi @PatrickXYS, the issue is related to running TFX via Kubeflow with S3, is this sth you can help with?

@PatrickXYS
Copy link

Thanks @Bobgy

We have an issue keep tracking it, kubeflow/pipelines#596

I'll check to see if any progress

@ConverJens
Copy link
Contributor Author

@arghyaganguly I'm using TFX 0.26.1 with it's dependencies.

@Bobgy @PatrickXYS TFX now supports a recent enough beam version to be able to run on S3 and other components, e.g. ExampleGen, SchemaGen, StatisticsGen, all work as expected. The issue is solemnly with Transform, hence it is likely a TFX issue and not Kubeflow or KFP.

The actual call failing is tf.io.gfile.copy(source, destination). I think someone from TFX should look at this.

@ferryvg
Copy link

ferryvg commented Feb 5, 2021

@ConverJens
In my case, i got an error in tensorflow.python.lib.io.file_io.is_directory_v2:

Object s3://.../Transform/transform_graph/10/.temp_path/tftransform_tmp/eba5999cfda140d98e2fbf8f78111d0d/variables does not exist

But object exists in minio storage and is a empty directory.

Because error occured that function returns false and tensorflow_transform.beam.tft_beam_io.transform_fn_io._copy_tree trying copy that directory as file.

@ConverJens
Copy link
Contributor Author

@arghyaganguly
I tried setting S3_REQUEST_TIMEOUT_MSEC=600000 as an env variable in my pipeline and the problem still persists.

I have the following under Transform/transform_graph// :
transform_tmp
and
.temp_path/tfttransform_tmp

The latter contains two vocab_asset files and three dir with guid names, each containing an empty variables dir and a saved_model.pb.

transform_tmp contains adir with a guid name which contains an asset dir with the correct vocab files a variable file of zero bytes.

I'm currently using force_tf_compat_v1=True but is seems as the error is similar to (or the same as) what @ferryvg has.

@ConverJens
Copy link
Contributor Author

@arghyaganguly @ferryvg
BTW, the graph I'm trying to save is 38kb on disk and would hardly count as large.

@ConverJens
Copy link
Contributor Author

Before the failure the last log lines verify that all the artifacts are written to appropriate tmp dirs in S3:

INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612529043   nanos: 194838523 } message: "Assets added to graph." instruction_id: "bundle_157" transform_id: "Analyze/CreateSavedModel[tf_compat_v1]/BindTensors/ReplaceWithConstants" log_location: "/usr/local/lib/python3.7/dist-packages/tensorflow/python/saved_model/builder_impl.py:666" thread: "Thread-14" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612529043   nanos: 256393671 } message: "Assets written to: s3://pipelines/tfx/trace_pipeline_e2e/TransformMaster/transform_graph/2092/.temp_path/tftransform_tmp/295976ed1dc94530a214952b7b0c8eb5/assets" instruction_id: "bundle_157" transform_id: "Analyze/CreateSavedModel[tf_compat_v1]/BindTensors/ReplaceWithConstants" log_location: "/usr/local/lib/python3.7/dist-packages/tensorflow/python/saved_model/builder_impl.py:775" thread: "Thread-14" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612529043   nanos: 316509008 } message: "SavedModel written to: s3://pipelines/tfx/trace_pipeline_e2e/TransformMaster/transform_graph/2092/.temp_path/tftransform_tmp/295976ed1dc94530a214952b7b0c8eb5/saved_model.pb" instruction_id: "bundle_157" transform_id: "Analyze/CreateSavedModel[tf_compat_v1]/BindTensors/ReplaceWithConstants" log_location: "/usr/local/lib/python3.7/dist-packages/tensorflow/python/saved_model/builder_impl.py:426" thread: "Thread-14" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: WARN timestamp {   seconds: 1612529043   nanos: 392094373 } message: "Expected binary or unicode string, got type_url: \"type.googleapis.com/tensorflow.AssetFileDef\"\nvalue: \"\\n\\013\\n\\tConst_3:0\\022-vocab_compute_and_apply_vocabulary_vocabulary\"\n" instruction_id: "bundle_157" transform_id: "Analyze/ComputeDeferredMetadata[compat_v1=True]" log_location: "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py:6689" thread: "Thread-14" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: WARN timestamp {   seconds: 1612529043   nanos: 392244577 } message: "Expected binary or unicode string, got type_url: \"type.googleapis.com/tensorflow.AssetFileDef\"\nvalue: \"\\n\\013\\n\\tConst_5:0\\022/vocab_compute_and_apply_vocabulary_1_vocabulary\"\n" instruction_id: "bundle_157" transform_id: "Analyze/ComputeDeferredMetadata[compat_v1=True]" log_location: "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py:6689" thread: "Thread-14" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp {   seconds: 1612529043   nanos: 392589569 } message: "Saver not created because there are no variables in the graph to restore" instruction_id: "bundle_157" transform_id: "Analyze/ComputeDeferredMetadata[compat_v1=True]" log_location: "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py:1512" thread: "Thread-14" 
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: ERROR timestamp {   seconds: 1612529132   nanos: 790544748 } message: "Error processing instruction bundle_157. Original traceback is\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1239, in process\n    return self.do_fn_invoker.invoke_process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 769, in invoke_process\n    windowed_value, additional_args, additional_kwargs)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 893, in _invoke_process_per_window\n    self.process_method(*args_for_process),\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/transforms/core.py\", line 1590, in <lambda>\n    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 37, in _copy_tree_to_unique_temp_dir\n    _copy_tree(source, destination)\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 51, in _copy_tree\n    os.path.join(source, filename), os.path.join(destination, filename))\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 53, in _copy_tree\n    tf.io.gfile.copy(source, destination)\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow/python/lib/io/file_io.py\", line 513, in copy_v2\n    compat.as_bytes(src), compat.as_bytes(dst), overwrite)\ntensorflow.python.framework.errors_impl.AbortedError: All 10 retry attempts failed. The last failure: Unknown: : No response body.\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py\", line 289, in _execute\n    response = task()\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py\", line 362, in <lambda>\n    lambda: self.create_worker().do_instruction(request), request)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py\", line 607, in do_instruction\n    getattr(request, request_type), request.instruction_id)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py\", line 644, in process_bundle\n    bundle_processor.process_bundle(instruction_id))\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/bundle_processor.py\", line 1000, in process_bundle\n    element.data)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/bundle_processor.py\", line 228, in process_encoded\n    self.output(decoded_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 359, in output\n    cython.cast(Receiver, self.receivers[output_index]).receive(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 221, in receive\n    self.consumer.process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 719, in process\n    delayed_application = self.dofn_runner.process(o)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1241, in process\n    self._reraise_augmented(exn)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1239, in process\n    return self.do_fn_invoker.invoke_process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 588, in invoke_process\n    windowed_value, self.process_method(windowed_value.value))\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1401, in process_outputs\n    self.main_receivers.receive(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 221, in receive\n    self.consumer.process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 719, in process\n    delayed_application = self.dofn_runner.process(o)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1241, in process\n    self._reraise_augmented(exn)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1239, in process\n    return self.do_fn_invoker.invoke_process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 588, in invoke_process\n    windowed_value, self.process_method(windowed_value.value))\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1401, in process_outputs\n    self.main_receivers.receive(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 221, in receive\n    self.consumer.process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 719, in process\n    delayed_application = self.dofn_runner.process(o)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1241, in process\n    self._reraise_augmented(exn)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1239, in process\n    return self.do_fn_invoker.invoke_process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 769, in invoke_process\n    windowed_value, additional_args, additional_kwargs)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 894, in _invoke_process_per_window\n    self.threadsafe_watermark_estimator)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1401, in process_outputs\n    self.main_receivers.receive(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 158, in receive\n    cython.cast(Operation, consumer).process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 719, in process\n    delayed_application = self.dofn_runner.process(o)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1241, in process\n    self._reraise_augmented(exn)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1321, in _reraise_augmented\n    raise_with_traceback(new_exn)\n  File \"/usr/local/lib/python3.7/dist-packages/future/utils/__init__.py\", line 446, in raise_with_traceback\n    raise exc.with_traceback(traceback)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1239, in process\n    return self.do_fn_invoker.invoke_process(windowed_value)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 769, in invoke_process\n    windowed_value, additional_args, additional_kwargs)\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 893, in _invoke_process_per_window\n    self.process_method(*args_for_process),\n  File \"/usr/local/lib/python3.7/dist-packages/apache_beam/transforms/core.py\", line 1590, in <lambda>\n    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 37, in _copy_tree_to_unique_temp_dir\n    _copy_tree(source, destination)\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 51, in _copy_tree\n    os.path.join(source, filename), os.path.join(destination, filename))\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 53, in _copy_tree\n    tf.io.gfile.copy(source, destination)\n  File \"/usr/local/lib/python3.7/dist-packages/tensorflow/python/lib/io/file_io.py\", line 513, in copy_v2\n    compat.as_bytes(src), compat.as_bytes(dst), overwrite)\nRuntimeError: tensorflow.python.framework.errors_impl.AbortedError: All 10 retry attempts failed. The last failure: Unknown: : No response body. [while running \'WriteTransformFn/WriteTransformFnToTemp\']\n\n" instruction_id: "bundle_157" log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:296" thread: "Thread-14" 

@ferryvg
Copy link

ferryvg commented Feb 5, 2021

My Minio trace:

13:05:48.934 [200 OK] s3.PutObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/transform_tmp/dea87325a915411ab80be15bd53da5f5/ 192.168.0.130     347µs       ↑ 135 B ↓ 261 B
13:05:48.935 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674 192.168.0.130     178µs       ↑ 153 B ↓ 227 B
13:05:48.937 [200 OK] s3.ListObjectsV1 **.**.**.**:49000/bucketname?max-keys=1&prefix=pipe-out-path%2FTransform%2Ftransform_graph%2F5%2F.temp_path%2Ftftransform_tmp%2F6850e091c2574530b19e09097ccc9674%2F  192.168.0.130     330µs       ↑ 153 B ↓ 1.2 KiB
13:05:48.938 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674 192.168.0.130     175µs       ↑ 153 B ↓ 227 B
13:05:48.940 [200 OK] s3.ListObjectsV1 **.**.**.**:49000/bucketname?max-keys=1&prefix=pipe-out-path%2FTransform%2Ftransform_graph%2F5%2F.temp_path%2Ftftransform_tmp%2F6850e091c2574530b19e09097ccc9674%2F  192.168.0.130     307µs       ↑ 153 B ↓ 1.2 KiB
13:05:48.941 [200 OK] s3.ListObjectsV1 **.**.**.**:49000/bucketname?delimiter=%2F&max-keys=100&prefix=pipe-out-path%2FTransform%2Ftransform_graph%2F5%2F.temp_path%2Ftftransform_tmp%2F6850e091c2574530b19e09097ccc9674%2F  192.168.0.130     304µs       ↑ 153 B ↓ 1.2 KiB
13:05:48.943 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     181µs       ↑ 153 B ↓ 227 B
13:05:48.944 [200 OK] s3.ListObjectsV1 **.**.**.**:49000/bucketname?max-keys=1&prefix=pipe-out-path%2FTransform%2Ftransform_graph%2F5%2F.temp_path%2Ftftransform_tmp%2F6850e091c2574530b19e09097ccc9674%2Fvariables%2F  192.168.0.130     231µs       ↑ 153 B ↓ 644 B
13:05:53.370 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/transform_tmp/dea87325a915411ab80be15bd53da5f5/variables 192.168.0.130     242µs       ↑ 153 B ↓ 227 B
13:05:53.371 [200 OK] s3.ListObjectsV1 **.**.**.**:49000/bucketname?max-keys=1&prefix=pipe-out-path%2FTransform%2Ftransform_graph%2F5%2Ftransform_tmp%2Fdea87325a915411ab80be15bd53da5f5%2Fvariables%2F  192.168.0.130     262µs       ↑ 153 B ↓ 631 B
13:05:53.372 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/transform_tmp/dea87325a915411ab80be15bd53da5f5/variables 192.168.0.130     185µs       ↑ 153 B ↓ 227 B
13:05:53.373 [200 OK] s3.ListObjectsV1 **.**.**.**:49000/bucketname?max-keys=1&prefix=pipe-out-path%2FTransform%2Ftransform_graph%2F5%2Ftransform_tmp%2Fdea87325a915411ab80be15bd53da5f5%2Fvariables%2F  192.168.0.130     240µs       ↑ 153 B ↓ 631 B
13:05:53.374 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     228µs       ↑ 159 B ↓ 227 B
13:05:53.376 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     208µs       ↑ 159 B ↓ 227 B
13:05:53.377 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     153µs       ↑ 159 B ↓ 227 B
13:05:53.378 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     161µs       ↑ 159 B ↓ 227 B
13:05:53.857 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     182µs       ↑ 159 B ↓ 227 B
13:05:53.858 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     166µs       ↑ 159 B ↓ 227 B
13:05:53.860 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     171µs       ↑ 159 B ↓ 227 B
13:05:53.861 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     159µs       ↑ 159 B ↓ 227 B
13:05:54.961 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     169µs       ↑ 159 B ↓ 227 B
13:05:54.962 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     184µs       ↑ 159 B ↓ 227 B
13:05:54.964 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     166µs       ↑ 159 B ↓ 227 B
13:05:54.965 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     167µs       ↑ 159 B ↓ 227 B
13:05:55.797 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     228µs       ↑ 159 B ↓ 227 B
13:05:55.799 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     154µs       ↑ 159 B ↓ 227 B
13:05:55.800 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     163µs       ↑ 159 B ↓ 227 B
13:05:55.801 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     182µs       ↑ 159 B ↓ 227 B
13:05:57.516 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     206µs       ↑ 159 B ↓ 227 B
13:05:57.518 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     162µs       ↑ 159 B ↓ 227 B
13:05:57.519 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     190µs       ↑ 159 B ↓ 227 B
13:05:57.521 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     208µs       ↑ 159 B ↓ 227 B
13:06:00.038 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     187µs       ↑ 159 B ↓ 227 B
13:06:00.040 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     155µs       ↑ 159 B ↓ 227 B
13:06:00.041 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     189µs       ↑ 159 B ↓ 227 B
13:06:00.043 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     175µs       ↑ 159 B ↓ 227 B
13:06:03.444 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     193µs       ↑ 159 B ↓ 227 B
13:06:03.445 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     187µs       ↑ 159 B ↓ 227 B
13:06:03.447 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     161µs       ↑ 159 B ↓ 227 B
13:06:03.448 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     172µs       ↑ 159 B ↓ 227 B
13:06:10.490 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     198µs       ↑ 159 B ↓ 227 B
13:06:10.491 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     170µs       ↑ 159 B ↓ 227 B
13:06:10.492 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     160µs       ↑ 159 B ↓ 227 B
13:06:10.494 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     173µs       ↑ 159 B ↓ 227 B
13:06:23.474 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     225µs       ↑ 159 B ↓ 227 B
13:06:23.476 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     208µs       ↑ 159 B ↓ 227 B
13:06:23.478 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     179µs       ↑ 159 B ↓ 227 B
13:06:23.479 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     154µs       ↑ 159 B ↓ 227 B
13:06:49.933 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     239µs       ↑ 159 B ↓ 227 B
13:06:49.935 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     171µs       ↑ 159 B ↓ 227 B
13:06:49.936 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     179µs       ↑ 159 B ↓ 227 B
13:06:49.938 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     167µs       ↑ 159 B ↓ 227 B
13:07:22.299 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     234µs       ↑ 159 B ↓ 227 B
13:07:22.300 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     226µs       ↑ 159 B ↓ 227 B
13:07:22.302 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     206µs       ↑ 159 B ↓ 227 B
13:07:22.303 [404 Not Found] s3.HeadObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/.temp_path/tftransform_tmp/6850e091c2574530b19e09097ccc9674/variables 192.168.0.130     187µs       ↑ 159 B ↓ 227 B
13:07:22.304 [200 OK] s3.PutObject **.**.**.**:49000/bucketname/pipe-out-path/Transform/transform_graph/5/transform_tmp/dea87325a915411ab80be15bd53da5f5/variables 192.168.0.130     82.412ms     ↑ 135 B ↓ 261 B

So, i think this is caused by Minio wrong response, because **/variables object is really exists
minio/minio#4434

@ferryvg
Copy link

ferryvg commented Feb 5, 2021

I'm was trying reproduce my case using AWS S3 and does not receive error. So, that problem caused only by Minio

@ConverJens
Copy link
Contributor Author

@ferryvg Great troubleshooting! How do we proceed then? What version of Minio did you use?

@ferryvg
Copy link

ferryvg commented Feb 8, 2021

@ConverJens
Latest version of Minio.

@ConverJens
Copy link
Contributor Author

@ferryvg Correct me if I'm wrong, but doesn't this mean that Minio isn't perfectly S3 compatible since this works on AWS but not using Minio? And if that is the case it should be re-rasied with Minio.

@ferryvg
Copy link

ferryvg commented Feb 8, 2021

@ConverJens Yeah, but they was replied about that issue in minio/minio#4434

@ConverJens
Copy link
Contributor Author

ConverJens commented Feb 8, 2021

@ferryvg That is true. However, that was more than one year ago and I believe that the case of the worlds largest production ML framework not working on Minio but actually working on S3 should warrant some attention. I'll post a new issue on Minio and refer to this one.

@ConverJens
Copy link
Contributor Author

@Bobgy @PatrickXYS @arghyaganguly This issue was not related to TFX but rather how Minio was deployed, i.e. not running Minio in erasure mode. Vanilla KubeFlow does not use erasure mode so perhaps this should be addressed for future releases? See above Minio issue.

@ferryvg
Copy link

ferryvg commented Feb 8, 2021

Another way how you reproduce this case: make custom component which use PushedModel artifact from Pusher component, and configure Evaluator rules so that Pusher does not push a new model. And you will get an exception when checking artifacts uri

Or just create two custom components. Executor of first component will just create empty "dir" in Minio/S3 and set it to output_dict artifact uri. Second component should take in input_dict artifact from output_dict of first component. And you will got error before run executor of second component

@arghyaganguly
Copy link
Contributor

Closing this since this is not a tfx specific issue.Please raise an issue in Kubeflow. https://github.com/kubeflow/kubeflow/issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants