diff --git a/README.md b/README.md index 817d821cd80c..8b4fbb1821cf 100644 --- a/README.md +++ b/README.md @@ -89,8 +89,8 @@ Here are a few examples: In Natural Language Processing: - [Masked word completion with BERT](https://huggingface.co/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) -- [Name Entity Recognition with Electra](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city) -- [Text generation with GPT-2](https://huggingface.co/gpt2?text=A+long+time+ago%2C+) +- [Named Entity Recognition with Electra](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city) +- [Text generation with Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) - [Natural Language Inference with RoBERTa](https://huggingface.co/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) - [Summarization with BART](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct) - [Question answering with DistilBERT](https://huggingface.co/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) @@ -100,22 +100,26 @@ In Computer Vision: - [Image classification with ViT](https://huggingface.co/google/vit-base-patch16-224) - [Object Detection with DETR](https://huggingface.co/facebook/detr-resnet-50) - [Semantic Segmentation with SegFormer](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512) -- [Panoptic Segmentation with MaskFormer](https://huggingface.co/facebook/maskformer-swin-small-coco) -- [Depth Estimation with DPT](https://huggingface.co/docs/transformers/model_doc/dpt) +- [Panoptic Segmentation with Mask2Former](https://huggingface.co/facebook/mask2former-swin-large-coco-panoptic) +- [Depth Estimation with Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything) - [Video Classification with VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae) - [Universal Segmentation with OneFormer](https://huggingface.co/shi-labs/oneformer_ade20k_dinat_large) In Audio: -- [Automatic Speech Recognition with Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) +- [Automatic Speech Recognition with Whisper](https://huggingface.co/openai/whisper-large-v3) - [Keyword Spotting with Wav2Vec2](https://huggingface.co/superb/wav2vec2-base-superb-ks) - [Audio Classification with Audio Spectrogram Transformer](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) In Multimodal tasks: - [Table Question Answering with TAPAS](https://huggingface.co/google/tapas-base-finetuned-wtq) - [Visual Question Answering with ViLT](https://huggingface.co/dandelin/vilt-b32-finetuned-vqa) -- [Zero-shot Image Classification with CLIP](https://huggingface.co/openai/clip-vit-large-patch14) +- [Image captioning with LLaVa](https://huggingface.co/llava-hf/llava-1.5-7b-hf) +- [Zero-shot Image Classification with SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) - [Document Question Answering with LayoutLM](https://huggingface.co/impira/layoutlm-document-qa) - [Zero-shot Video Classification with X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip) +- [Zero-shot Object Detection with OWLv2](https://huggingface.co/docs/transformers/en/model_doc/owlv2) +- [Zero-shot Image Segmentation with CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg) +- [Automatic Mask Generation with SAM](https://huggingface.co/docs/transformers/model_doc/sam) ## 100 projects using Transformers diff --git a/docs/source/en/tasks/zero_shot_object_detection.md b/docs/source/en/tasks/zero_shot_object_detection.md index 7af6bc3dc384..03e849a6c79d 100644 --- a/docs/source/en/tasks/zero_shot_object_detection.md +++ b/docs/source/en/tasks/zero_shot_object_detection.md @@ -52,7 +52,7 @@ for zero-shot object detection from a [checkpoint on the Hugging Face Hub](https ```python >>> from transformers import pipeline ->>> checkpoint = "google/owlvit-base-patch32" +>>> checkpoint = "google/owlv2-base-patch16-ensemble" >>> detector = pipeline(model=checkpoint, task="zero-shot-object-detection") ``` diff --git a/src/transformers/pipelines/depth_estimation.py b/src/transformers/pipelines/depth_estimation.py index bd6bb0d0db9f..c6431a499717 100644 --- a/src/transformers/pipelines/depth_estimation.py +++ b/src/transformers/pipelines/depth_estimation.py @@ -29,7 +29,7 @@ class DepthEstimationPipeline(Pipeline): ```python >>> from transformers import pipeline - >>> depth_estimator = pipeline(task="depth-estimation", model="Intel/dpt-large") + >>> depth_estimator = pipeline(task="depth-estimation", model="LiheYoung/depth-anything-base-hf") >>> output = depth_estimator("http://images.cocodataset.org/val2017/000000039769.jpg") >>> # This is a tensor with the values being the depth expressed in meters for each pixel >>> output["predicted_depth"].shape diff --git a/src/transformers/pipelines/zero_shot_image_classification.py b/src/transformers/pipelines/zero_shot_image_classification.py index d97fe246a2ef..8e40d0e6a5cb 100644 --- a/src/transformers/pipelines/zero_shot_image_classification.py +++ b/src/transformers/pipelines/zero_shot_image_classification.py @@ -40,7 +40,7 @@ class ZeroShotImageClassificationPipeline(Pipeline): ```python >>> from transformers import pipeline - >>> classifier = pipeline(model="openai/clip-vit-large-patch14") + >>> classifier = pipeline(model="google/siglip-so400m-patch14-384") >>> classifier( ... "https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png", ... candidate_labels=["animals", "humans", "landscape"],