Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel push fails for multi-module project sometimes fail with MANIFEST_INVALID #2230

Closed
kan-izh opened this issue Jan 14, 2020 · 11 comments
Closed

Comments

@kan-izh
Copy link

kan-izh commented Jan 14, 2020

Environment:

  • 1.8.0
  • Maven
  • "linux", version: "3.10.0-957.21.3.el7.x86_64", arch: "amd64", family: "unix"

Description of the issue:

Sometimes jib:build fails when running parallel build for a multi-module project. I use "mvn -T4" for four running threads.

Expected behavior:

Push images parallel

Steps to reproduce:

That's hard... It happens sporadically as any multithreading issue. Non-parallel build apparently works fine.

Log output:

[2020-01-14T18:31:18.250Z] [ERROR] Failed to execute goal com.google.cloud.tools:jib-maven-plugin:1.8.0:build (default-cli) on project some-service: Tried to push image manifest for artifactory.xxxxx.com/dckr/some-service:17.1.0-SNAPSHOT but failed because: Registry may not support pushing OCI Manifest or Docker Image Manifest Version 2, Schema 2 | If this is a bug, please file an issue at https://github.com/GoogleContainerTools/jib/issues/new: 400 Bad Request
[2020-01-14T18:31:18.250Z] [ERROR] {"errors":[{"code":"MANIFEST_INVALID","message":"manifest invalid","detail":{"description":"Could not find item at docker-sc-release_local:dckr/some-service/_uploads/sha256__8e3d9783c8401812e442d0302c2b2f75456c9fc4a689d684bce2961b97ecf908"}}]}
[2020-01-14T18:31:18.250Z] [ERROR] -> [Help 1]
18:31:18  org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal com.google.cloud.tools:jib-maven-plugin:1.8.0:build (default-cli) on project some-service: Tried to push image manifest for artifactory.xxxxx.com/dckr/some-service:17.1.0-SNAPSHOT but failed because: Registry may not support pushing OCI Manifest or Docker Image Manifest Version 2, Schema 2 | If this is a bug, please file an issue at https://github.com/GoogleContainerTools/jib/issues/new
18:31:18      at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215)
18:31:18      at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
18:31:18      at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
18:31:18      at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
18:31:18      at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:190)
18:31:18      at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:186)
18:31:18      at java.util.concurrent.FutureTask.run (FutureTask.java:266)
18:31:18      at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:511)
18:31:18      at java.util.concurrent.FutureTask.run (FutureTask.java:266)
18:31:18      at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142)
18:31:18      at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617)
18:31:18      at java.lang.Thread.run (Thread.java:748)
18:31:18  Caused by: org.apache.maven.plugin.MojoExecutionException: Tried to push image manifest for artifactory.xxxxx.com/dckr/some-service:17.1.0-SNAPSHOT but failed because: Registry may not support pushing OCI Manifest or Docker Image Manifest Version 2, Schema 2 | If this is a bug, please file an issue at https://github.com/GoogleContainerTools/jib/issues/new
18:31:18      at com.google.cloud.tools.jib.maven.BuildImageMojo.execute (BuildImageMojo.java:149)
18:31:18      at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137)
18:31:18      at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210)
18:31:18      at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
18:31:18      at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
18:31:18      at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
18:31:18      at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:190)
18:31:18      at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:186)
18:31:18      at java.util.concurrent.FutureTask.run (FutureTask.java:266)
18:31:18      at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:511)
18:31:18      at java.util.concurrent.FutureTask.run (FutureTask.java:266)
18:31:18      at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142)
18:31:18      at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617)
18:31:18      at java.lang.Thread.run (Thread.java:748)
18:31:18  Caused by: com.google.cloud.tools.jib.registry.RegistryErrorException: Tried to push image manifest for artifactory.xxxxx.com/dckr/some-service:17.1.0-SNAPSHOT but failed because: Registry may not support pushing OCI Manifest or Docker Image Manifest Version 2, Schema 2 | If this is a bug, please file an issue at https://github.com/GoogleContainerTools/jib/issues/new
18:31:18      at com.google.cloud.tools.jib.registry.RegistryErrorExceptionBuilder.build (RegistryErrorExceptionBuilder.java:105)
18:31:18      at com.google.cloud.tools.jib.registry.ManifestPusher.handleHttpResponseException (ManifestPusher.java:119)
18:31:18      at com.google.cloud.tools.jib.registry.ManifestPusher.handleHttpResponseException (ManifestPusher.java:39)
18:31:18      at com.google.cloud.tools.jib.registry.RegistryEndpointCaller.call (RegistryEndpointCaller.java:146)
18:31:18      at com.google.cloud.tools.jib.registry.RegistryEndpointCaller.call (RegistryEndpointCaller.java:115)
18:31:18      at com.google.cloud.tools.jib.registry.RegistryClient.callRegistryEndpoint (RegistryClient.java:473)
18:31:18      at com.google.cloud.tools.jib.registry.RegistryClient.pushManifest (RegistryClient.java:321)
18:31:18      at com.google.cloud.tools.jib.builder.steps.PushImageStep.call (PushImageStep.java:125)
18:31:18      at com.google.cloud.tools.jib.builder.steps.PushImageStep.call (PushImageStep.java:43)
18:31:18      at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly (TrustedListenableFutureTask.java:125)
18:31:18      at com.google.common.util.concurrent.InterruptibleTask.run (InterruptibleTask.java:69)
18:31:18      at com.google.common.util.concurrent.TrustedListenableFutureTask.run (TrustedListenableFutureTask.java:78)
18:31:18      at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142)
18:31:18      at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617)
18:31:18      at java.lang.Thread.run (Thread.java:748)
18:31:18  Caused by: com.google.cloud.tools.jib.http.ResponseException: 400 Bad Request
18:31:18  {"errors":[{"code":"MANIFEST_INVALID","message":"manifest invalid","detail":{"description":"Could not find item at docker-sc-release_local:dckr/some-service/_uploads/sha256__8e3d9783c8401812e442d0302c2b2f75456c9fc4a689d684bce2961b97ecf908"}}]}
18:31:18      at com.google.cloud.tools.jib.http.FailoverHttpClient.call (FailoverHttpClient.java:315)
18:31:18      at com.google.cloud.tools.jib.http.FailoverHttpClient.call (FailoverHttpClient.java:237)
18:31:18      at com.google.cloud.tools.jib.registry.RegistryEndpointCaller.call (RegistryEndpointCaller.java:139)
18:31:18      at com.google.cloud.tools.jib.registry.RegistryEndpointCaller.call (RegistryEndpointCaller.java:115)
18:31:18      at com.google.cloud.tools.jib.registry.RegistryClient.callRegistryEndpoint (RegistryClient.java:473)
18:31:18      at com.google.cloud.tools.jib.registry.RegistryClient.pushManifest (RegistryClient.java:321)
18:31:18      at com.google.cloud.tools.jib.builder.steps.PushImageStep.call (PushImageStep.java:125)
18:31:18      at com.google.cloud.tools.jib.builder.steps.PushImageStep.call (PushImageStep.java:43)
18:31:18      at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly (TrustedListenableFutureTask.java:125)
18:31:18      at com.google.common.util.concurrent.InterruptibleTask.run (InterruptibleTask.java:69)
18:31:18      at com.google.common.util.concurrent.TrustedListenableFutureTask.run (TrustedListenableFutureTask.java:78)
18:31:18      at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142)
18:31:18      at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617)
18:31:18      at java.lang.Thread.run (Thread.java:748)
18:31:18  Caused by: com.google.api.client.http.HttpResponseException: 400 Bad Request
18:31:18  {"errors":[{"code":"MANIFEST_INVALID","message":"manifest invalid","detail":{"description":"Could not find item at docker-sc-release_local:dckr/some-service/_uploads/sha256__8e3d9783c8401812e442d0302c2b2f75456c9fc4a689d684bce2961b97ecf908"}}]}
18:31:18      at com.google.api.client.http.HttpRequest.execute (HttpRequest.java:1097)
18:31:18      at com.google.cloud.tools.jib.http.FailoverHttpClient.call (FailoverHttpClient.java:309)
18:31:18      at com.google.cloud.tools.jib.http.FailoverHttpClient.call (FailoverHttpClient.java:237)
18:31:18      at com.google.cloud.tools.jib.registry.RegistryEndpointCaller.call (RegistryEndpointCaller.java:139)
18:31:18      at com.google.cloud.tools.jib.registry.RegistryEndpointCaller.call (RegistryEndpointCaller.java:115)
18:31:18      at com.google.cloud.tools.jib.registry.RegistryClient.callRegistryEndpoint (RegistryClient.java:473)
18:31:18      at com.google.cloud.tools.jib.registry.RegistryClient.pushManifest (RegistryClient.java:321)
18:31:18      at com.google.cloud.tools.jib.builder.steps.PushImageStep.call (PushImageStep.java:125)
18:31:18      at com.google.cloud.tools.jib.builder.steps.PushImageStep.call (PushImageStep.java:43)
18:31:18      at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly (TrustedListenableFutureTask.java:125)
18:31:18      at com.google.common.util.concurrent.InterruptibleTask.run (InterruptibleTask.java:69)
18:31:18      at com.google.common.util.concurrent.TrustedListenableFutureTask.run (TrustedListenableFutureTask.java:78)
18:31:18      at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142)
18:31:18      at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617)
18:31:18      at java.lang.Thread.run (Thread.java:748)

Additional Information:

I don't use OCI anywhere in configuration, looks like error text is misleading.

Not sure if it is our artifactory fails or the Jib plugin may not handle pushing shared layers between projects.

@chanseokoh
Copy link
Member

chanseokoh commented Jan 14, 2020

Hi @kan-izh,

Does your multi-module setup push multiple images by Jib (i.e., you have multiple sub-modules that configure Jib)? If there are multiple Jib sub-modules, do they depend on another somehow?

@kan-izh
Copy link
Author

kan-izh commented Jan 14, 2020

@chanseokoh, There is one common jar and three jib-modules depending on the jar. So... I guess the jib-modules share same dependency layers and they may push concurrently.

@chanseokoh
Copy link
Member

chanseokoh commented Jan 14, 2020

About docker-sc-release_local:dckr/some-service/_uploads/sha256__8e3d9783c8401812e442d0302c2b2f75456c9fc4a689d684bce2961b97ecf908 in your log, I'd like to see if the blob by the digest 8e3d9783c8401812e442d0302c2b2f75456c9fc4a689d684bce2961b97ecf908 is the layer containing the common jar.

Assuming that the blob was eventually uploaded, can you download the blob and check the contents? The blob will most likely be .tar.gz if it's a layer. It could be a plain JSON if not a layer. I'm trying to see if it is the layer with the common jar you mentioned. If the three jib-modules build and push the same layer concurrently, I think theoretically it might be problematic if the registry is not handling the upload requests correct.

One way I can think of to download the blob is to use Crane. For example, crane pull <your image> /dev/stdout | tar -zvf - will download all the layers of the image and its manifest JSON and container configuration JSON in the current directory. (Note you should pull the image built by direct Jib push using jib:build. An image pushed using docker push after jib:dockerBuild will have different layer digests.)

@kan-izh
Copy link
Author

kan-izh commented Jan 15, 2020

I've reproduced it once again and found the layer in the target/jib-cache folder. And looks like the failed sha256-id was containing jib-module's JAR... So must be unique... weird. Does it help you?
Now I'm trying to reproduce it again with DEBUG logs... maybe it would show something more useful.

@chanseokoh
Copy link
Member

That helps, thanks. In addition to getting debug logs (-X), can you follow these instructions to capture network traffic? And once this happens, run Jib again and capture the second log without touching the registry and your project.

And do you own your Artifactory repo? I wonder if it is easy to check via its admin UI if a blob exists or not.

@chanseokoh
Copy link
Member

Oh, sorry. I forgot to say you should remove -Djib.serialize=true about the instructions to capture network traffic. Using the flag will make everything serial.

@kan-izh
Copy link
Author

kan-izh commented Jan 20, 2020

Unfortunately I have no luck to reproduce it now... I doubt... it may be linked to a recent change we had in our code. There were some conflicting transitive dependencies (i.e. a library had two different versions and maven may pick randomly). Could it potentially clutter docker layer generation or something?

@chanseokoh
Copy link
Member

chanseokoh commented Jan 23, 2020

Honestly, I have been thinking this could be a server issue. For one, we've observed a Quay server malfunctioning and returning "500 Internal Server Error" when there is a high-level of concurrency: #1986 (comment) and #2013. At that time, I had my own theory for the Quay server malfunction. Another example is a concurrency issue (#2372 (comment)) in Sonatype Nexus.

Our code is designed in the way that the thread to push a manifest can run only after all the threads that upload each blob get a concrete confirmation from the server that the blobs are all uploaded successfully or they already exist. (The manifest upload is the final network interaction with the server.) The fact that Jib started to upload a manifest means that there was no error or exception in previous operations and all seemed normal. If there were any anomaly, a Jib execution would just crash and the manifest upload cannot start at all.

But yet, the server seems to say that some layer is absent or not found when Jib tried to upload the final manifest. That is, the server seems to think a blob we specified in the manifest doesn't exist (and hence MANIFEST_INVALID).

"code":"MANIFEST_INVALID",
"message":"manifest invalid",
"detail":{"description":"Could not find item at docker-sc-release_local:dckr/some-service/_uploads/sha256__8e3d9783c8401812e442d0302c2b2f75456c9fc4a689d684bce2961b97ecf908"}

So I was trying to see, when this happened, if the layer did exist or not in the registry. Immediately running a second Jib push with debug log would have revealed what the server responds when Jib asks if the blob already exists in the server in the second run.

@chanseokoh
Copy link
Member

So I think changing your application or custom base image to create new layers to push may increase the chance to hit this. Emptying the registry repo may help, but some registry may retain blobs internally for a while, so I think creating new layers is ideal.

@chanseokoh chanseokoh changed the title Parallel push fails for multi-module project sometimes Parallel push fails for multi-module project sometimes fail with MANIFEST_INVALID Apr 1, 2020
@chanseokoh
Copy link
Member

Closing as we haven't heard back from the user. Also the user doesn't seem able to reproduce this after some change on their side. But I suspect this is a bug in the registry (Artifactory).

@kan-izh
Copy link
Author

kan-izh commented Mar 5, 2021

Yes, seems it was related to Nexus bug, maybe this one https://issues.sonatype.org/browse/NEXUS-20640

I did not see it after we had Nexus upgraded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants