Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ground truth depth is actually distance #9

Closed
sniklaus opened this issue Jan 5, 2021 · 8 comments
Closed

ground truth depth is actually distance #9

sniklaus opened this issue Jan 5, 2021 · 8 comments

Comments

@sniklaus
Copy link
Contributor

sniklaus commented Jan 5, 2021

The following line doesn't compute the depth, but the distance from a point to the camera. As such, the provided depth_meters files are actually distance_meters instead. Not a big deal as long as you are aware of it, one can convert one to the other using the focal length. But if you aren't aware of it you may get severely wrong results as shown in the screenshots below.

depth = linalg.norm(position - camera_position[newaxis,newaxis,:], axis=2)

If you use the provided depth (which is actually distance) to render the image as a point cloud you will get distortions:
image

If you instead convert the provided depth to the actual depth and then render the image as a point cloud from that:
image

@mikeroberts3000
Copy link
Collaborator

mikeroberts3000 commented Jan 5, 2021

Hi @sniklaus, thanks for this great visualization. You're absolutely correct that depth_meters should really be called distance_meters. I apologize for this unclear naming convention 😅 In our defense, we document the correct interpretation for this data in our README.

frame.IIII.depth_meters.hdf5    # Euclidean distances in meters to the optical center of the camera

But this is a great reminder to be careful when interpreting our depth_meters images. We also provide position images, where the value at each pixel is a position in world-space, which can be projected into image-space using whatever convention you prefer. Note that our position images are given in asset units, i.e., not meters.

@sniklaus, if you have a self-contained code snippet to convert our depth_meters or position images into planar depth images that are more useful in your downstream application, please feel free to post it here 😀

@sniklaus
Copy link
Contributor Author

sniklaus commented Jan 5, 2021

Thanks for chiming in and for the clarifications! I used the following to convert the distance to depth, it expects to have the following variables: intWidth (1024), intHeight (768), fltFocal (886.81), and npyDistance (from the depth_meters.hdf5).

npyImageplaneX = numpy.linspace((-0.5 * intWidth) + 0.5, (0.5 * intWidth) - 0.5, intWidth).reshape(1, intWidth).repeat(intHeight, 0).astype(numpy.float32)[:, :, None]
npyImageplaneY = numpy.linspace((-0.5 * intHeight) + 0.5, (0.5 * intHeight) - 0.5, intHeight).reshape(intHeight, 1).repeat(intWidth, 1).astype(numpy.float32)[:, :, None]
npyImageplaneZ = numpy.full([intHeight, intWidth, 1], fltFocal, numpy.float32)
npyImageplane = numpy.concatenate([npyImageplaneX, npyImageplaneY, npyImageplaneZ], 2)

npyDepth = npyDistance / numpy.linalg.norm(npyImageplane, 2, 2) * fltFocal

@mikeroberts3000
Copy link
Collaborator

Sweet! 😀

@Tord-Zhang
Copy link

@sniklaus Hi, the focal length of all images are the same in this dataset?

@mikeroberts3000
Copy link
Collaborator

mikeroberts3000 commented Apr 24, 2022

@Tord-Zhang When computing planar depth images for typical downstream learning applications, it is a reasonable approximation to assume that all images have the same focal length.

However, if you want exactly perfect planar depth data, you need to account for the fact that our camera intrinsics can vary in minor ways for each scene. More specifically, due to minor tilt-shift photography effects that can vary per-scene, the image plane is not guaranteed to be exactly orthogonal to the camera-space z-axis. So what does it mean to compute a "planar" depth image in these cases? What is the exact quantity that you want to store at each pixel in your "planar" depth image?

In these cases, the solution that makes the most sense to me is to warp the scene geometry in a way that exactly inverts the tilt-shift photography effects. If you do this correctly, the warped scene geometry viewed through a typical pinhole camera will produce an identical image to the non-warped scene geometry viewed through a tilt-shift camera. At this point, you can compute the planar depth image as usual using the warped scene geometry.

See here and here for a more detailed discussion.

@lholzherr
Copy link

Thanks for chiming in and for the clarifications! I used the following to convert the distance to depth, it expects to have the following variables: intWidth (1024), intHeight (768), fltFocal (886.81), and npyDistance (from the depth_meters.hdf5).

npyImageplaneX = numpy.linspace((-0.5 * intWidth) + 0.5, (0.5 * intWidth) - 0.5, intWidth).reshape(1, intWidth).repeat(intHeight, 0).astype(numpy.float32)[:, :, None]
npyImageplaneY = numpy.linspace((-0.5 * intHeight) + 0.5, (0.5 * intHeight) - 0.5, intHeight).reshape(intHeight, 1).repeat(intWidth, 1).astype(numpy.float32)[:, :, None]
npyImageplaneZ = numpy.full([intHeight, intWidth, 1], fltFocal, numpy.float32)
npyImageplane = numpy.concatenate([npyImageplaneX, npyImageplaneY, npyImageplaneZ], 2)

npyDepth = npyDistance / numpy.linalg.norm(npyImageplane, 2, 2) * fltFocal

@sniklaus , how did you derive the formula? I understand that intWidth, intHeight and the focal length fltFocal are all measured in pixels, and npyDistance is the metric distance from camera center to the 3D point in meters. What exactly is the depth that you compute here? If it is the distance from the image plane to the point I would have expected a formula like:
npyDistance - numpy.linalg.norm(npyImageplane, 2, 2) * ratio_meters_per_pixels

@mikeroberts3000
Copy link
Collaborator

@lholzherr The code snippet you're referring to does not attempt to compute Euclidean distance to the image plane. That information is already stored in our depth images. Instead, the code snippet is attempting to compute planar depth, i.e., the distance along the axis in camera-space that is orthogonal to the image plane.

To illustrate the difference between these two representations, suppose you are 1 meter away from a flat wall, and you capture an image looking directly at the wall. If you capture a planar depth image, it will contain a value of 1 meter at every pixel. If you capture a Euclidean distance image, it will contain a value of 1 meter at the center pixel, but will have different (slightly larger) values at every other pixel.

As an aside, I suspect that planar depth images are better-behaved inputs to convolutional neural networks, as compared to distance images. This is because CNNs implicitly assume that the statistics of image patches are stationary as you move across an image, and the statistics of planar depth images are more stationary than distance images.

To convert a distance image to a planar depth image, we use the following reasoning.

  • In the Hypersim data, we know the ray in camera-space that corresponds to each pixel.
  • For each pixel, we also know the Euclidean distance to its observed surface because this is the quantity stored in our depth images. A better name for these images would be distance images.
  • Therefore, we know the camera-space position of the observed surface at each pixel. It is simply the pixel's camera-space ray, which we know, normalized and scaled by the distance to the surface, which we also know.
  • Given a camera-space position, the planar depth of that position is simply its z-coordinate (or whatever coordinate is orthogonal to the image plane).

Using this reasoning, you should try to derive @sniklaus's code snippet above, and either convince yourself it is correct, or post here if you think it is incorrect.

@lholzherr
Copy link

@lholzherr The code snippet you're referring to does not attempt to compute Euclidean distance to the image plane. That information is already stored in our depth images. Instead, the code snippet is attempting to compute planar depth, i.e., the distance along the axis in camera-space that is orthogonal to the image plane.

To illustrate the difference between these two representations, suppose you are 1 meter away from a flat wall, and you capture an image looking directly at the wall. If you capture a planar depth image, it will contain a value of 1 meter at every pixel. If you capture a Euclidean distance image, it will contain a value of 1 meter at the center pixel, but will have different (slightly larger) values at every other pixel.

As an aside, I suspect that planar depth images are better-behaved inputs to convolutional neural networks, as compared to distance images. This is because CNNs implicitly assume that the statistics of image patches are stationary as you move across an image, and the statistics of planar depth images are more stationary than distance images.

To convert a distance image to a planar depth image, we use the following reasoning.

  • In the Hypersim data, we know the ray in camera-space that corresponds to each pixel.
  • For each pixel, we also know the Euclidean distance to its observed surface because this is the quantity stored in our depth images. A better name for these images would be distance images.
  • Therefore, we know the camera-space position of the observed surface at each pixel. It is simply the pixel's camera-space ray, which we know, normalized and scaled by the distance to the surface, which we also know.
  • Given a camera-space position, the planar depth of that position is simply its z-coordinate (or whatever coordinate is orthogonal to the image plane).

Using this reasoning, you should try to derive @sniklaus's code snippet above, and either convince yourself it is correct, or post here if you think it is incorrect.

Thanks @mikeroberts3000 , I was able to derive the formula. If anyone else is wondering, this is the derivation:
(1): orthogonal distance is the dot product of the 3d point in camera frame with the unit-vector in camera z-direction.
(2) 3D point in camera frame is given by the distance from the distance image multiplied by the u,v,f vector. Note that even tough u, v, f is in pixels and the distance is in meters, the equation is still correct because the vector is normalized and therefore without units.
(3) this is the equation shown in the post of @sniklaus
note: everything is measured from the camera center and not from the image-plane.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants