Skip to content

Commit

Permalink
Add support for Depth Anything (#534)
Browse files Browse the repository at this point in the history
* Add support for `DPTImageProcessor`

* Add support for depth anything model

* Update list of `depth_anything` models

* Update processor test model id
  • Loading branch information
xenova authored Jan 25, 2024
1 parent 4fb23f2 commit 587adfc
Show file tree
Hide file tree
Showing 6 changed files with 92 additions and 1 deletion.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -287,6 +287,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (from University of Hong Kong and TikTok) released with the paper [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao.
1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski.
1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT.
Expand Down
1 change: 1 addition & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (from University of Hong Kong and TikTok) released with the paper [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao.
1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski.
1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT.
Expand Down
9 changes: 9 additions & 0 deletions scripts/supported_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -408,6 +408,15 @@
'Intel/dpt-large',
],
},
'depth_anything': {
# Depth estimation
# NOTE: requires --task depth-estimation
'depth-estimation': [
'LiheYoung/depth-anything-small-hf',
'LiheYoung/depth-anything-base-hf',
'LiheYoung/depth-anything-large-hf',
],
},
'electra': {
# Feature extraction
'feature-extraction': [
Expand Down
11 changes: 11 additions & 0 deletions src/models.js
Original file line number Diff line number Diff line change
Expand Up @@ -4027,6 +4027,16 @@ export class DPTModel extends DPTPreTrainedModel { }
export class DPTForDepthEstimation extends DPTPreTrainedModel { }
//////////////////////////////////////////////////

//////////////////////////////////////////////////
export class DepthAnythingPreTrainedModel extends PreTrainedModel { }

/**
* Depth Anything Model with a depth estimation head on top (consisting of 3 convolutional layers) e.g. for KITTI, NYUv2.
*/
export class DepthAnythingForDepthEstimation extends DepthAnythingPreTrainedModel { }
//////////////////////////////////////////////////


//////////////////////////////////////////////////
export class GLPNPreTrainedModel extends PreTrainedModel { }

Expand Down Expand Up @@ -5391,6 +5401,7 @@ const MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES = new Map([

const MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES = new Map([
['dpt', ['DPTForDepthEstimation', DPTForDepthEstimation]],
['depth_anything', ['DepthAnythingForDepthEstimation', DepthAnythingForDepthEstimation]],
['glpn', ['GLPNForDepthEstimation', GLPNForDepthEstimation]],
])

Expand Down
51 changes: 50 additions & 1 deletion src/processors.js
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,29 @@ function validate_audio_inputs(audio, feature_extractor) {
}
}

/**
* Helper function to constrain a value to be a multiple of a number.
* @param {number} val The value to constrain.
* @param {number} multiple The number to constrain to.
* @param {number} [minVal=0] The minimum value to constrain to.
* @param {number} [maxVal=null] The maximum value to constrain to.
* @returns {number} The constrained value.
* @private
*/
function constraint_to_multiple_of(val, multiple, minVal = 0, maxVal = null) {
let x = Math.round(val / multiple) * multiple;

if (maxVal !== null && x > maxVal) {
x = Math.floor(val / multiple) * multiple;
}

if (x < minVal) {
x = Math.ceil(val / multiple) * multiple;
}

return x;
}

/**
* Base class for feature extractors.
*
Expand Down Expand Up @@ -465,7 +488,31 @@ export class ImageFeatureExtractor extends FeatureExtractor {

} else if (size !== undefined && size.width !== undefined && size.height !== undefined) {
// If `width` and `height` are set, resize to those dimensions
return [size.width, size.height];

let newWidth = size.width;
let newHeight = size.height;

// Custom for DPT models
if (this.config.keep_aspect_ratio && this.config.ensure_multiple_of) {

// determine new height and width
let scale_height = size.height / srcHeight;
let scale_width = size.width / srcWidth;

// scale as little as possible
if (Math.abs(1 - scale_width) < Math.abs(1 - scale_height)) {
// fit width
scale_height = scale_width;
} else {
// fit height
scale_width = scale_height;
}

newHeight = constraint_to_multiple_of(scale_height * srcHeight, this.config.ensure_multiple_of);
newWidth = constraint_to_multiple_of(scale_width * srcWidth, this.config.ensure_multiple_of);
}

return [newWidth, newHeight];

} else if (this.size_divisibility !== undefined) {
// Rounds the height and width down to the closest multiple of size_divisibility
Expand Down Expand Up @@ -699,6 +746,7 @@ export class SegformerFeatureExtractor extends ImageFeatureExtractor {
return toReturn;
}
}
export class DPTImageProcessor extends ImageFeatureExtractor { }
export class BitImageProcessor extends ImageFeatureExtractor { }
export class DPTFeatureExtractor extends ImageFeatureExtractor { }
export class GLPNFeatureExtractor extends ImageFeatureExtractor { }
Expand Down Expand Up @@ -1881,6 +1929,7 @@ export class AutoProcessor {
ConvNextImageProcessor,
SegformerFeatureExtractor,
BitImageProcessor,
DPTImageProcessor,
DPTFeatureExtractor,
GLPNFeatureExtractor,
BeitFeatureExtractor,
Expand Down
20 changes: 20 additions & 0 deletions tests/processors.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ describe('Processors', () => {
detr: 'facebook/detr-resnet-50',
yolos: 'hustvl/yolos-small-300',
dpt: 'Intel/dpt-hybrid-midas',
dpt_2: 'LiheYoung/depth-anything-small-hf',
glpn: 'vinvino02/glpn-kitti',
nougat: 'facebook/nougat-small',
owlvit: 'google/owlvit-base-patch32',
Expand Down Expand Up @@ -407,6 +408,25 @@ describe('Processors', () => {
compare(reshaped_input_sizes, [[224, 224]]);
}
}, MAX_TEST_EXECUTION_TIME);

// DPTImageProcessor
// - tests ensure_multiple_of
// - tests keep_aspect_ratio
it(MODELS.dpt_2, async () => {
const processor = await AutoProcessor.from_pretrained(m(MODELS.dpt_2))

{
const image = await load_image(TEST_IMAGES.cats);
const { pixel_values, original_sizes, reshaped_input_sizes } = await processor(image);

compare(pixel_values.dims, [1, 3, 518, 686]);
compare(avg(pixel_values.data), 0.30337387323379517);

compare(original_sizes, [[480, 640]]);
compare(reshaped_input_sizes, [[518, 686]]);
}
}, MAX_TEST_EXECUTION_TIME);

});

describe('Audio processors', () => {
Expand Down

0 comments on commit 587adfc

Please sign in to comment.