refer to these links to understand the code:
https://github.com/facebookresearch/sam2
https://pytorch.org/hub/intelisl_midas_v2/
-
Depth Estimation
• Objective: Generate a depth map for the input image to estimate the relative distances of objects. • Process:
- Load a pre-trained depth estimation model (e.g., MiDaS Small for faster inference).
- Preprocess the image: • Resize to match the model’s input dimensions. • Normalize pixel values to [0, 1].
- Run the depth estimation model to generate the depth map.
- Optionally invert the depth values to make closer objects brighter.
-
SAM Mask Generation
• Objective: Segment the image into object masks using the Segment Anything Model (SAM). • Process:
- Load the SAM model and configure it for automatic mask generation.
- Feed the input image into the model to generate segmentation masks.
- Each mask includes attributes like segmentation, bbox, and area.
-
YOLO Object Detection
• Objective: Detect apples in the image and validate SAM-generated masks. • Process:
- Load a pre-trained YOLO model.
- Run YOLO on the input image to detect objects.
- Extract bounding boxes (bbox) and confidence scores for each detected apple.
- Filter detections using a confidence threshold (e.g., conf=0.2).
-
Filter SAM Masks with YOLO Results
• Objective: Retain only the SAM masks that overlap with YOLO-detected apple bounding boxes. • Process:
- Convert SAM masks to bounding boxes.
- Scale YOLO bounding boxes if necessary to match SAM mask resolution.
- Calculate the Intersection over Union (IoU) between each SAM mask and YOLO bounding box.
- Retain SAM masks with IoU ≥ threshold (e.g., 0.5).
-
Median Depth Comparison
• Objective: Determine the relative distances of the retained SAM masks based on depth values. • Process:
- For each filtered SAM mask: • Apply the mask to the depth map. • Extract depth values corresponding to the mask area. • Calculate the median depth for the mask.
- Compare the median depths: • Identify masks with the smallest (nearest) or largest (farthest) median depth.
- Highlight the selected masks: • Use color overlays to visualize the selected masks on the original image.
Pipeline Summary
1. Input Image: The original image is used as input for both YOLO and SAM.
2. Depth Map: Generated using MiDaS to estimate distances.
3. SAM Masks: Automatically segmented using SAM.
4. YOLO Filtering: YOLO bounding boxes are used to validate SAM masks.
5. Median Depth: Filtered masks are compared using their median depth values.
Key Parameters to Tune
• YOLO Confidence Threshold (conf): Adjust for better object detection.
• IoU Threshold: Controls how closely SAM masks must match YOLO detections.
• Depth Map Inversion: Invert depth values for better visualization, if needed.