ground truth depth is actually distance #9

sniklaus · 2021-01-05T21:43:39Z

The following line doesn't compute the depth, but the distance from a point to the camera. As such, the provided depth_meters files are actually distance_meters instead. Not a big deal as long as you are aware of it, one can convert one to the other using the focal length. But if you aren't aware of it you may get severely wrong results as shown in the screenshots below.

ml-hypersim/code/python/tools/generate_hdf5_from_vrimg.py

Line 302 in 9c9be19

    
           depth        = linalg.norm(position - camera_position[newaxis,newaxis,:], axis=2)

If you use the provided depth (which is actually distance) to render the image as a point cloud you will get distortions:

If you instead convert the provided depth to the actual depth and then render the image as a point cloud from that:

The text was updated successfully, but these errors were encountered:

mikeroberts3000 · 2021-01-05T22:11:11Z

Hi @sniklaus, thanks for this great visualization. You're absolutely correct that depth_meters should really be called distance_meters. I apologize for this unclear naming convention 😅 In our defense, we document the correct interpretation for this data in our README.

frame.IIII.depth_meters.hdf5    # Euclidean distances in meters to the optical center of the camera

But this is a great reminder to be careful when interpreting our depth_meters images. We also provide position images, where the value at each pixel is a position in world-space, which can be projected into image-space using whatever convention you prefer. Note that our position images are given in asset units, i.e., not meters.

@sniklaus, if you have a self-contained code snippet to convert our depth_meters or position images into planar depth images that are more useful in your downstream application, please feel free to post it here 😀

sniklaus · 2021-01-05T22:16:27Z

Thanks for chiming in and for the clarifications! I used the following to convert the distance to depth, it expects to have the following variables: intWidth (1024), intHeight (768), fltFocal (886.81), and npyDistance (from the depth_meters.hdf5).

npyImageplaneX = numpy.linspace((-0.5 * intWidth) + 0.5, (0.5 * intWidth) - 0.5, intWidth).reshape(1, intWidth).repeat(intHeight, 0).astype(numpy.float32)[:, :, None]
npyImageplaneY = numpy.linspace((-0.5 * intHeight) + 0.5, (0.5 * intHeight) - 0.5, intHeight).reshape(intHeight, 1).repeat(intWidth, 1).astype(numpy.float32)[:, :, None]
npyImageplaneZ = numpy.full([intHeight, intWidth, 1], fltFocal, numpy.float32)
npyImageplane = numpy.concatenate([npyImageplaneX, npyImageplaneY, npyImageplaneZ], 2)

npyDepth = npyDistance / numpy.linalg.norm(npyImageplane, 2, 2) * fltFocal

mikeroberts3000 · 2021-01-05T22:19:06Z

Sweet! 😀

Tord-Zhang · 2022-04-24T12:38:15Z

@sniklaus Hi, the focal length of all images are the same in this dataset?

mikeroberts3000 · 2022-04-24T19:40:21Z

@Tord-Zhang When computing planar depth images for typical downstream learning applications, it is a reasonable approximation to assume that all images have the same focal length.

However, if you want exactly perfect planar depth data, you need to account for the fact that our camera intrinsics can vary in minor ways for each scene. More specifically, due to minor tilt-shift photography effects that can vary per-scene, the image plane is not guaranteed to be exactly orthogonal to the camera-space z-axis. So what does it mean to compute a "planar" depth image in these cases? What is the exact quantity that you want to store at each pixel in your "planar" depth image?

In these cases, the solution that makes the most sense to me is to warp the scene geometry in a way that exactly inverts the tilt-shift photography effects. If you do this correctly, the warped scene geometry viewed through a typical pinhole camera will produce an identical image to the non-warped scene geometry viewed through a tilt-shift camera. At this point, you can compute the planar depth image as usual using the warped scene geometry.

See here and here for a more detailed discussion.

lholzherr · 2024-11-05T13:48:58Z

Thanks for chiming in and for the clarifications! I used the following to convert the distance to depth, it expects to have the following variables: intWidth (1024), intHeight (768), fltFocal (886.81), and npyDistance (from the depth_meters.hdf5).

npyImageplaneX = numpy.linspace((-0.5 * intWidth) + 0.5, (0.5 * intWidth) - 0.5, intWidth).reshape(1, intWidth).repeat(intHeight, 0).astype(numpy.float32)[:, :, None]
npyImageplaneY = numpy.linspace((-0.5 * intHeight) + 0.5, (0.5 * intHeight) - 0.5, intHeight).reshape(intHeight, 1).repeat(intWidth, 1).astype(numpy.float32)[:, :, None]
npyImageplaneZ = numpy.full([intHeight, intWidth, 1], fltFocal, numpy.float32)
npyImageplane = numpy.concatenate([npyImageplaneX, npyImageplaneY, npyImageplaneZ], 2)

npyDepth = npyDistance / numpy.linalg.norm(npyImageplane, 2, 2) * fltFocal

@sniklaus , how did you derive the formula? I understand that intWidth, intHeight and the focal length fltFocal are all measured in pixels, and npyDistance is the metric distance from camera center to the 3D point in meters. What exactly is the depth that you compute here? If it is the distance from the image plane to the point I would have expected a formula like:
npyDistance - numpy.linalg.norm(npyImageplane, 2, 2) * ratio_meters_per_pixels

mikeroberts3000 · 2024-11-05T20:27:37Z

@lholzherr The code snippet you're referring to does not attempt to compute Euclidean distance to the image plane. That information is already stored in our depth images. Instead, the code snippet is attempting to compute planar depth, i.e., the distance along the axis in camera-space that is orthogonal to the image plane.

To illustrate the difference between these two representations, suppose you are 1 meter away from a flat wall, and you capture an image looking directly at the wall. If you capture a planar depth image, it will contain a value of 1 meter at every pixel. If you capture a Euclidean distance image, it will contain a value of 1 meter at the center pixel, but will have different (slightly larger) values at every other pixel.

As an aside, I suspect that planar depth images are better-behaved inputs to convolutional neural networks, as compared to distance images. This is because CNNs implicitly assume that the statistics of image patches are stationary as you move across an image, and the statistics of planar depth images are more stationary than distance images.

To convert a distance image to a planar depth image, we use the following reasoning.

In the Hypersim data, we know the ray in camera-space that corresponds to each pixel.
For each pixel, we also know the Euclidean distance to its observed surface because this is the quantity stored in our depth images. A better name for these images would be distance images.
Therefore, we know the camera-space position of the observed surface at each pixel. It is simply the pixel's camera-space ray, which we know, normalized and scaled by the distance to the surface, which we also know.
Given a camera-space position, the planar depth of that position is simply its z-coordinate (or whatever coordinate is orthogonal to the image plane).

Using this reasoning, you should try to derive @sniklaus's code snippet above, and either convince yourself it is correct, or post here if you think it is incorrect.

lholzherr · 2024-11-06T08:47:17Z

@lholzherr The code snippet you're referring to does not attempt to compute Euclidean distance to the image plane. That information is already stored in our depth images. Instead, the code snippet is attempting to compute planar depth, i.e., the distance along the axis in camera-space that is orthogonal to the image plane.

To illustrate the difference between these two representations, suppose you are 1 meter away from a flat wall, and you capture an image looking directly at the wall. If you capture a planar depth image, it will contain a value of 1 meter at every pixel. If you capture a Euclidean distance image, it will contain a value of 1 meter at the center pixel, but will have different (slightly larger) values at every other pixel.

As an aside, I suspect that planar depth images are better-behaved inputs to convolutional neural networks, as compared to distance images. This is because CNNs implicitly assume that the statistics of image patches are stationary as you move across an image, and the statistics of planar depth images are more stationary than distance images.

To convert a distance image to a planar depth image, we use the following reasoning.

In the Hypersim data, we know the ray in camera-space that corresponds to each pixel.

For each pixel, we also know the Euclidean distance to its observed surface because this is the quantity stored in our depth images. A better name for these images would be distance images.

Therefore, we know the camera-space position of the observed surface at each pixel. It is simply the pixel's camera-space ray, which we know, normalized and scaled by the distance to the surface, which we also know.

Given a camera-space position, the planar depth of that position is simply its z-coordinate (or whatever coordinate is orthogonal to the image plane).

Using this reasoning, you should try to derive @sniklaus's code snippet above, and either convince yourself it is correct, or post here if you think it is incorrect.

Thanks @mikeroberts3000 , I was able to derive the formula. If anyone else is wondering, this is the derivation:
(1): orthogonal distance is the dot product of the 3d point in camera frame with the unit-vector in camera z-direction.
(2) 3D point in camera frame is given by the distance from the distance image multiplied by the u,v,f vector. Note that even tough u, v, f is in pixels and the distance is in meters, the equation is still correct because the vector is normalized and therefore without units.
(3) this is the equation shown in the post of @sniklaus
note: everything is measured from the camera center and not from the image-plane.

mikeroberts3000 closed this as completed Jan 5, 2021

mikeroberts3000 mentioned this issue Jan 6, 2021

Question on coordinate systems #10

Closed

teufelweich mentioned this issue May 8, 2021

Broken Scenes #22

Closed

feiran-l mentioned this issue Mar 29, 2022

camera intrinsic matrix #44

Closed

brotherofken mentioned this issue May 31, 2023

HypersimDataset is missing nianticlabs/implicit-depth#1

Closed

mikeroberts3000 mentioned this issue Jul 17, 2023

How to get point cloud? #57

Closed

Lizhuoling mentioned this issue Aug 20, 2023

How to obtain 3D point cloud for 3D object detetcion #60

Closed

fuxiao0719 mentioned this issue Jul 18, 2024

Request for code of datasets preprocess fuxiao0719/GeoWizard#28

Open

ZakeyShi mentioned this issue Aug 21, 2024

PointCloud Align Problem #76

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ground truth depth is actually distance #9

ground truth depth is actually distance #9

sniklaus commented Jan 5, 2021

mikeroberts3000 commented Jan 5, 2021 •

edited

Loading

sniklaus commented Jan 5, 2021 •

edited by mikeroberts3000

Loading

mikeroberts3000 commented Jan 5, 2021

Tord-Zhang commented Apr 24, 2022

mikeroberts3000 commented Apr 24, 2022 •

edited

Loading

lholzherr commented Nov 5, 2024

mikeroberts3000 commented Nov 5, 2024

lholzherr commented Nov 6, 2024

ground truth depth is actually distance #9

ground truth depth is actually distance #9

Comments

sniklaus commented Jan 5, 2021

mikeroberts3000 commented Jan 5, 2021 • edited Loading

sniklaus commented Jan 5, 2021 • edited by mikeroberts3000 Loading

mikeroberts3000 commented Jan 5, 2021

Tord-Zhang commented Apr 24, 2022

mikeroberts3000 commented Apr 24, 2022 • edited Loading

lholzherr commented Nov 5, 2024

mikeroberts3000 commented Nov 5, 2024

lholzherr commented Nov 6, 2024

mikeroberts3000 commented Jan 5, 2021 •

edited

Loading

sniklaus commented Jan 5, 2021 •

edited by mikeroberts3000

Loading

mikeroberts3000 commented Apr 24, 2022 •

edited

Loading