-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gateway Timeout (504) error when running with SAM #6041
Comments
I've got exactly the same issue. The weird this is that I got it to work on CPU on one computer and then followed the exact same steps on another and there it doesn't work. In the docker logs I've found this:
I found no further errors in the nuclio container and the sam container. I've tested with another serverless function and there I got the same issue. Currently I run with the env CVAT_HOST, but I got the same behaviour without. So I suspect that there might be some communication issues between the cvat containers, but I don't really see a clear way to debug this. |
In my case I was able to work this around by increasing Also, consider deceasing And increasing SAM started to work, but it is very slow, and CPU is not fully utilized, while I am waiting for the results in the UI, so looks like it is due to I/O (which is also strange as the instance has SSD and all the network connectivity is on localhost). |
The weird thing is that it did work on my laptop RTX3000, 32Gb and 6core CPU. But doesn't work on our much more powerful pc: 64Gb, 16 Core and RTX3090. So I'm not 100% sure if is due to limited resources. |
Hi guys, could you please submit that it is running on GPU? If you deploy with Could you provide some logs? |
These are the exact steps that I follow:
|
In my case SAM is working on CPU. To make it work from the first request, which takes 1.5 minute in my case, in addition to increasing
The second and following requests for the same image were slow not because it was running on CPU as I worked this around by removing If I may suggest, masks should not be sent by CVAT when they are not needed. Also, in my case a size of a mask was over 100MB, while a size of the image was just 35MB, some kind of RLE encoding or other compression should be used, I believe. |
Hi @OutSorcerer Actually CVAT operates by RLE-encoded masks, but serverless functions return just bitmaps for now. Moreover mask is usually relates to a fragment of an image, and better to send only this fragment with some additional data [top, left, width, height] data. @GillesBallegeerVintecc , @OutSorcerer |
@bsekachev, in my case images are quite huge, 4096 by 9000 pixels, with a bit depth of 8 a single image in BMP format is ~35MB, the mask size from SAM was over 100MB (from the response size on the network tab of a Chrome debugger). |
The issue with huge masks is gonna be fixed here #6019 |
@bsekachev The images are 3024x4032 JPEG encoded (4.3MB), I just test with a simple image (1280x1280) and there I got the same result. |
Same issue with same logs... Not working with GPU or CPU. |
I'm also having the same issue, No "AI Tool", either SAM interactor or YOLO detectors, are working on GPU or CPU on my desktop. However, they are working fine on my laptop, where I used identical installation steps. The laptop has lesser hardware in every respect than my desktop. My desktop has openvpn and ssh servers installed. They aren't running in docker. I don't think this should affect anything, but I can't think of why else my computer would differ significantly from any other. The operation simply times out, with the error message shown in the following image Here are logs from Desktop:
Laptop
Supplementary InfoTo install, I followed these steps.
|
I'm having the exact same issue on cvat 2.4.3, both with SAM and YOLOv5. Logs seem to indicate that the cvat server is not able to communicate with the serverless container (example for YOLOv5):
Logs of the YOLO container: (SAM is similar to previous comment ) :
|
One update on this topic. I noticed that all the docker containers cvat launches when using serverless functions by default on boot were interfering with my ability to perform an Ubuntu desktop share. I first noticed that I couldn't do this over my openvpn connection. But it seemed like the openvpn connection was being made successfully. So when I killed the openvpn services on both server and client I noticed that even on my home network, I couldn't use a desktop share. When I killed the docker containers, deleted my cvat directory, and rebooted, I was able to use a desktop share just fine. I don't know if this holds any kind of clue to the issue at hand... Though even if it doesn't, if anyone can explain to me why what was happening might happen that would be most welcome! |
<!-- Raise an issue to propose your change (https://github.com/opencv/cvat/issues). It helps to avoid duplication of efforts from multiple independent contributors. Discuss your ideas with maintainers to be sure that changes will be approved and merged. Read the [Contribution guide](https://opencv.github.io/cvat/docs/contributing/). --> <!-- Provide a general summary of your changes in the Title above --> ### Motivation and context Resolved #5984 Resolved #6049 Resolved #6041 - Compatible only with ``sam_vit_h_4b8939.pth`` weights. Need to re-export ONNX mask decoder with some custom model changes (see below) to support other weights (or just download them using links below) - Need to redeploy the serverless function because its interface has been changed. Decoders for other weights: sam_vit_l_0b3195.pth: [Download](https://drive.google.com/file/d/1Nb5CJKQm_6s1n3xLSZYso6VNgljjfR-6/view?usp=sharing) sam_vit_b_01ec64.pth: [Download](https://drive.google.com/file/d/17cZAXBPaOABS170c9bcj9PdQsMziiBHw/view?usp=sharing) Changes done in ONNX part: ``` git diff scripts/export_onnx_model.py diff --git a/scripts/export_onnx_model.py b/scripts/export_onnx_model.py index 8441258..18d5be7 100644 --- a/scripts/export_onnx_model.py +++ b/scripts/export_onnx_model.py @@ -138,7 +138,7 @@ def run_export( _ = onnx_model(**dummy_inputs) - output_names = ["masks", "iou_predictions", "low_res_masks"] + output_names = ["masks", "iou_predictions", "low_res_masks", "xtl", "ytl", "xbr", "ybr"] with warnings.catch_warnings(): warnings.filterwarnings("ignore", category=torch.jit.TracerWarning) bsekachev@DESKTOP-OTBLK26:~/sam$ git diff segment_anything/utils/onnx.py diff --git a/segment_anything/utils/onnx.py b/segment_anything/utils/onnx.py index 3196bdf..85729c1 100644 --- a/segment_anything/utils/onnx.py +++ b/segment_anything/utils/onnx.py @@ -87,7 +87,15 @@ class SamOnnxModel(nn.Module): orig_im_size = orig_im_size.to(torch.int64) h, w = orig_im_size[0], orig_im_size[1] masks = F.interpolate(masks, size=(h, w), mode="bilinear", align_corners=False) - return masks + masks = torch.gt(masks, 0).to(torch.uint8) + nonzero = torch.nonzero(masks) + xindices = nonzero[:, 3:4] + yindices = nonzero[:, 2:3] + ytl = torch.min(yindices).to(torch.int64) + ybr = torch.max(yindices).to(torch.int64) + xtl = torch.min(xindices).to(torch.int64) + xbr = torch.max(xindices).to(torch.int64) + return masks[:, :, ytl:ybr + 1, xtl:xbr + 1], xtl, ytl, xbr, ybr def select_masks( self, masks: torch.Tensor, iou_preds: torch.Tensor, num_points: int @@ -132,7 +140,7 @@ class SamOnnxModel(nn.Module): if self.return_single_mask: masks, scores = self.select_masks(masks, scores, point_coords.shape[1]) - upscaled_masks = self.mask_postprocessing(masks, orig_im_size) + upscaled_masks, xtl, ytl, xbr, ybr = self.mask_postprocessing(masks, orig_im_size) if self.return_extra_metrics: stability_scores = calculate_stability_score( @@ -141,4 +149,4 @@ class SamOnnxModel(nn.Module): areas = (upscaled_masks > self.model.mask_threshold).sum(-1).sum(-1) return upscaled_masks, scores, stability_scores, areas, masks - return upscaled_masks, scores, masks + return upscaled_masks, scores, masks, xtl, ytl, xbr, ybr ``` ### How has this been tested? <!-- Please describe in detail how you tested your changes. Include details of your testing environment, and the tests you ran to see how your change affects other areas of the code, etc. --> ### Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply. If an item isn't applicable for some reason, then ~~explicitly strikethrough~~ the whole line. If you don't do that, GitHub will show incorrect progress for the pull request. If you're unsure about any of these, don't hesitate to ask. We're here to help! --> - [x] I submit my changes into the `develop` branch - [x] I have added a description of my changes into the [CHANGELOG](https://github.com/opencv/cvat/blob/develop/CHANGELOG.md) file - [ ] I have updated the documentation accordingly - [ ] I have added tests to cover my changes - [x] I have linked related issues (see [GitHub docs]( https://help.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword)) - [x] I have increased versions of npm packages if it is necessary ([cvat-canvas](https://github.com/opencv/cvat/tree/develop/cvat-canvas#versioning), [cvat-core](https://github.com/opencv/cvat/tree/develop/cvat-core#versioning), [cvat-data](https://github.com/opencv/cvat/tree/develop/cvat-data#versioning) and [cvat-ui](https://github.com/opencv/cvat/tree/develop/cvat-ui#versioning)) ### License - [x] I submit _my code changes_ under the same [MIT License]( https://github.com/opencv/cvat/blob/develop/LICENSE) that covers the project. Feel free to contact the maintainers if that's a concern.
In the new implementation mask isn't sent from server to cclient anymore, only image embeddings. Please read also some action required in PRs description to upgrade SAM if you wish to do that. |
For those still having "ERROR django.request: Service Unavailable" issues, I found yaochenglouis's answer at #2641 worked for me. Was just a firewall issue - do "ufw allow" on the port the Nuclio function is listening on. |
@whom-da dawg, thanks for this comment. I finally got things working after struggling for weeks! Because I have openvpn and ssh servers installed on my machine, I enabled the firewall which is disabled in Ubuntu by default. And so I had to issue |
@bsekachev Hello, I am having a problem when using sam. When I click with ai tools, sam loads and then the following error pops up:
And example docker log sam container:
|
Probably you did not upgdade CVAT till the latest version (2.4.4) https://opencv.github.io/cvat/docs/administration/advanced/upgrade_guide/ You can also build from current source code using the following command:
|
I also got SAM working by opening the nuclio port on ufw (though my port was different from the one above, so make sure to use 'nuctl get functions` to verify what port you need to open). @bsekachev, maybe this could be added to the docs? |
<!-- Raise an issue to propose your change (https://github.com/opencv/cvat/issues). It helps to avoid duplication of efforts from multiple independent contributors. Discuss your ideas with maintainers to be sure that changes will be approved and merged. Read the [Contribution guide](https://opencv.github.io/cvat/docs/contributing/). --> <!-- Provide a general summary of your changes in the Title above --> ### Motivation and context Resolved cvat-ai#5984 Resolved cvat-ai#6049 Resolved cvat-ai#6041 - Compatible only with ``sam_vit_h_4b8939.pth`` weights. Need to re-export ONNX mask decoder with some custom model changes (see below) to support other weights (or just download them using links below) - Need to redeploy the serverless function because its interface has been changed. Decoders for other weights: sam_vit_l_0b3195.pth: [Download](https://drive.google.com/file/d/1Nb5CJKQm_6s1n3xLSZYso6VNgljjfR-6/view?usp=sharing) sam_vit_b_01ec64.pth: [Download](https://drive.google.com/file/d/17cZAXBPaOABS170c9bcj9PdQsMziiBHw/view?usp=sharing) Changes done in ONNX part: ``` git diff scripts/export_onnx_model.py diff --git a/scripts/export_onnx_model.py b/scripts/export_onnx_model.py index 8441258..18d5be7 100644 --- a/scripts/export_onnx_model.py +++ b/scripts/export_onnx_model.py @@ -138,7 +138,7 @@ def run_export( _ = onnx_model(**dummy_inputs) - output_names = ["masks", "iou_predictions", "low_res_masks"] + output_names = ["masks", "iou_predictions", "low_res_masks", "xtl", "ytl", "xbr", "ybr"] with warnings.catch_warnings(): warnings.filterwarnings("ignore", category=torch.jit.TracerWarning) bsekachev@DESKTOP-OTBLK26:~/sam$ git diff segment_anything/utils/onnx.py diff --git a/segment_anything/utils/onnx.py b/segment_anything/utils/onnx.py index 3196bdf..85729c1 100644 --- a/segment_anything/utils/onnx.py +++ b/segment_anything/utils/onnx.py @@ -87,7 +87,15 @@ class SamOnnxModel(nn.Module): orig_im_size = orig_im_size.to(torch.int64) h, w = orig_im_size[0], orig_im_size[1] masks = F.interpolate(masks, size=(h, w), mode="bilinear", align_corners=False) - return masks + masks = torch.gt(masks, 0).to(torch.uint8) + nonzero = torch.nonzero(masks) + xindices = nonzero[:, 3:4] + yindices = nonzero[:, 2:3] + ytl = torch.min(yindices).to(torch.int64) + ybr = torch.max(yindices).to(torch.int64) + xtl = torch.min(xindices).to(torch.int64) + xbr = torch.max(xindices).to(torch.int64) + return masks[:, :, ytl:ybr + 1, xtl:xbr + 1], xtl, ytl, xbr, ybr def select_masks( self, masks: torch.Tensor, iou_preds: torch.Tensor, num_points: int @@ -132,7 +140,7 @@ class SamOnnxModel(nn.Module): if self.return_single_mask: masks, scores = self.select_masks(masks, scores, point_coords.shape[1]) - upscaled_masks = self.mask_postprocessing(masks, orig_im_size) + upscaled_masks, xtl, ytl, xbr, ybr = self.mask_postprocessing(masks, orig_im_size) if self.return_extra_metrics: stability_scores = calculate_stability_score( @@ -141,4 +149,4 @@ class SamOnnxModel(nn.Module): areas = (upscaled_masks > self.model.mask_threshold).sum(-1).sum(-1) return upscaled_masks, scores, stability_scores, areas, masks - return upscaled_masks, scores, masks + return upscaled_masks, scores, masks, xtl, ytl, xbr, ybr ``` ### How has this been tested? <!-- Please describe in detail how you tested your changes. Include details of your testing environment, and the tests you ran to see how your change affects other areas of the code, etc. --> ### Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply. If an item isn't applicable for some reason, then ~~explicitly strikethrough~~ the whole line. If you don't do that, GitHub will show incorrect progress for the pull request. If you're unsure about any of these, don't hesitate to ask. We're here to help! --> - [x] I submit my changes into the `develop` branch - [x] I have added a description of my changes into the [CHANGELOG](https://github.com/opencv/cvat/blob/develop/CHANGELOG.md) file - [ ] I have updated the documentation accordingly - [ ] I have added tests to cover my changes - [x] I have linked related issues (see [GitHub docs]( https://help.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword)) - [x] I have increased versions of npm packages if it is necessary ([cvat-canvas](https://github.com/opencv/cvat/tree/develop/cvat-canvas#versioning), [cvat-core](https://github.com/opencv/cvat/tree/develop/cvat-core#versioning), [cvat-data](https://github.com/opencv/cvat/tree/develop/cvat-data#versioning) and [cvat-ui](https://github.com/opencv/cvat/tree/develop/cvat-ui#versioning)) ### License - [x] I submit _my code changes_ under the same [MIT License]( https://github.com/opencv/cvat/blob/develop/LICENSE) that covers the project. Feel free to contact the maintainers if that's a concern.
See these several issues. They are connected to each other in some way. The thing is that nuclio has a default timeout of 1 minute. With this change we can force nuclio dashboard not to terminate the connection. nuclio/nuclio#3016 #3301 #6041
See these several issues. They are connected to each other in some way. The thing is that nuclio has a default timeout of 1 minute. With this change we can force nuclio dashboard not to terminate the connection. nuclio/nuclio#3016 cvat-ai#3301 cvat-ai#6041
For Ubuntu, |
That doesnt worked for me. I use the serveless traefik implementation. How to handle the 60 secs timeout here? |
My actions before raising this issue
Steps to Reproduce (for bugs)
Downloaded
cvat
, am on commitad534b2ac32f57
.Installed NVIDIA container toolkit.
Followed Serverless Setup steps.
Installed
nuctl
by following guide here. Verified thatnuclio
is version1.8.14
.Ran command to launch SAM nuctl function as described here.
cd serverless && ./deploy_gpu.sh pytorch/facebookresearch/sam/nuclio/
Checked that
nuclio
function is running properlynuctl get function
returns that SAM function is in STATEready
.Launched CVAT in serverless mode using
docker-compose -f docker-compose.yml -f components/serverless/docker-compose.serverless.yml up -d
.Open CVAT task, select "Segment Anything" from AI tools, click on image. Get a "Waiting a response from Segment Anything." After a while I get a 504 timeout error.
Failed to load resource: the server responded with a status of 504 (Gateway Timeout)
. Clicking on the link in the browser console shows me REST api call (image below).Current Behaviour
It seems that CVAT instance is unable to communicate with the nuclio SAM function. I have verified that SAM function is running in nuclio dashboard.
Your Environment
ad534b2a
:23.0.4
The text was updated successfully, but these errors were encountered: