Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Depth Anything #534

Merged
merged 5 commits into from
Jan 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -287,6 +287,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (from University of Hong Kong and TikTok) released with the paper [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao.
1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski.
1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT.
Expand Down
1 change: 1 addition & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (from University of Hong Kong and TikTok) released with the paper [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao.
1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski.
1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT.
Expand Down
9 changes: 9 additions & 0 deletions scripts/supported_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -408,6 +408,15 @@
'Intel/dpt-large',
],
},
'depth_anything': {
# Depth estimation
# NOTE: requires --task depth-estimation
'depth-estimation': [
'LiheYoung/depth-anything-small-hf',
'LiheYoung/depth-anything-base-hf',
'LiheYoung/depth-anything-large-hf',
],
},
'electra': {
# Feature extraction
'feature-extraction': [
Expand Down
11 changes: 11 additions & 0 deletions src/models.js
Original file line number Diff line number Diff line change
Expand Up @@ -4027,6 +4027,16 @@ export class DPTModel extends DPTPreTrainedModel { }
export class DPTForDepthEstimation extends DPTPreTrainedModel { }
//////////////////////////////////////////////////

//////////////////////////////////////////////////
export class DepthAnythingPreTrainedModel extends PreTrainedModel { }

/**
* Depth Anything Model with a depth estimation head on top (consisting of 3 convolutional layers) e.g. for KITTI, NYUv2.
*/
export class DepthAnythingForDepthEstimation extends DepthAnythingPreTrainedModel { }
//////////////////////////////////////////////////


//////////////////////////////////////////////////
export class GLPNPreTrainedModel extends PreTrainedModel { }

Expand Down Expand Up @@ -5391,6 +5401,7 @@ const MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES = new Map([

const MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES = new Map([
['dpt', ['DPTForDepthEstimation', DPTForDepthEstimation]],
['depth_anything', ['DepthAnythingForDepthEstimation', DepthAnythingForDepthEstimation]],
['glpn', ['GLPNForDepthEstimation', GLPNForDepthEstimation]],
])

Expand Down
51 changes: 50 additions & 1 deletion src/processors.js
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,29 @@ function validate_audio_inputs(audio, feature_extractor) {
}
}

/**
* Helper function to constrain a value to be a multiple of a number.
* @param {number} val The value to constrain.
* @param {number} multiple The number to constrain to.
* @param {number} [minVal=0] The minimum value to constrain to.
* @param {number} [maxVal=null] The maximum value to constrain to.
* @returns {number} The constrained value.
* @private
*/
function constraint_to_multiple_of(val, multiple, minVal = 0, maxVal = null) {
let x = Math.round(val / multiple) * multiple;

if (maxVal !== null && x > maxVal) {
x = Math.floor(val / multiple) * multiple;
}

if (x < minVal) {
x = Math.ceil(val / multiple) * multiple;
}

return x;
}

/**
* Base class for feature extractors.
*
Expand Down Expand Up @@ -465,7 +488,31 @@ export class ImageFeatureExtractor extends FeatureExtractor {

} else if (size !== undefined && size.width !== undefined && size.height !== undefined) {
// If `width` and `height` are set, resize to those dimensions
return [size.width, size.height];

let newWidth = size.width;
let newHeight = size.height;

// Custom for DPT models
if (this.config.keep_aspect_ratio && this.config.ensure_multiple_of) {

// determine new height and width
let scale_height = size.height / srcHeight;
let scale_width = size.width / srcWidth;

// scale as little as possible
if (Math.abs(1 - scale_width) < Math.abs(1 - scale_height)) {
// fit width
scale_height = scale_width;
} else {
// fit height
scale_width = scale_height;
}

newHeight = constraint_to_multiple_of(scale_height * srcHeight, this.config.ensure_multiple_of);
newWidth = constraint_to_multiple_of(scale_width * srcWidth, this.config.ensure_multiple_of);
}

return [newWidth, newHeight];

} else if (this.size_divisibility !== undefined) {
// Rounds the height and width down to the closest multiple of size_divisibility
Expand Down Expand Up @@ -699,6 +746,7 @@ export class SegformerFeatureExtractor extends ImageFeatureExtractor {
return toReturn;
}
}
export class DPTImageProcessor extends ImageFeatureExtractor { }
export class BitImageProcessor extends ImageFeatureExtractor { }
export class DPTFeatureExtractor extends ImageFeatureExtractor { }
export class GLPNFeatureExtractor extends ImageFeatureExtractor { }
Expand Down Expand Up @@ -1881,6 +1929,7 @@ export class AutoProcessor {
ConvNextImageProcessor,
SegformerFeatureExtractor,
BitImageProcessor,
DPTImageProcessor,
DPTFeatureExtractor,
GLPNFeatureExtractor,
BeitFeatureExtractor,
Expand Down
20 changes: 20 additions & 0 deletions tests/processors.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ describe('Processors', () => {
detr: 'facebook/detr-resnet-50',
yolos: 'hustvl/yolos-small-300',
dpt: 'Intel/dpt-hybrid-midas',
dpt_2: 'LiheYoung/depth-anything-small-hf',
glpn: 'vinvino02/glpn-kitti',
nougat: 'facebook/nougat-small',
owlvit: 'google/owlvit-base-patch32',
Expand Down Expand Up @@ -407,6 +408,25 @@ describe('Processors', () => {
compare(reshaped_input_sizes, [[224, 224]]);
}
}, MAX_TEST_EXECUTION_TIME);

// DPTImageProcessor
// - tests ensure_multiple_of
// - tests keep_aspect_ratio
it(MODELS.dpt_2, async () => {
const processor = await AutoProcessor.from_pretrained(m(MODELS.dpt_2))

{
const image = await load_image(TEST_IMAGES.cats);
const { pixel_values, original_sizes, reshaped_input_sizes } = await processor(image);

compare(pixel_values.dims, [1, 3, 518, 686]);
compare(avg(pixel_values.data), 0.30337387323379517);

compare(original_sizes, [[480, 640]]);
compare(reshaped_input_sizes, [[518, 686]]);
}
}, MAX_TEST_EXECUTION_TIME);

});

describe('Audio processors', () => {
Expand Down
Loading