feat: Integrating ChemTEB (#1708)

* Add SMILES, AI Paraphrase and Inter-Source Paragraphs PairClassification Tasks * Add chemical subsets of NQ and HotpotQA datasets as Retrieval tasks * Add PubChem Synonyms PairClassification task * Update task __init__ for previously added tasks * Add nomic-bert loader * Add a script to run the evaluation pipeline for chemical-related tasks * Add 15 Wikipedia article classification tasks * Add PairClassification and BitextMining tasks for Coconut SMILES * Fix naming of some Classification and PairClassification tasks * Fix some classification tasks naming issues * Integrate WANDB with benchmarking script * Update .gitignore * Fix `nomic_models.py` issue with retrieval tasks, similar to issue #1115 in original repo * Add one chemical model and some SentenceTransformer models * Fix a naming issue for SentenceTransformer models * Add OpenAI, bge-m3 and matscibert models * Add PubChem SMILES Bitext Mining tasks * Change metric namings to be more descriptive * Add English e5 and bge v1 models, all the sizes * Add two Wikipedia Clustering tasks * Add a try-except in evaluation script to skip faulty models during the benchmark. * Add bge v1.5 models and clustering score extraction to json parser * Add Amazon Titan embedding models * Add Cohere Bedrock models * Add two SDS Classification tasks * Add SDS Classification tasks to classification init and chem_eval * Add a retrieval dataset, update dataset names and revisions * Update revision for the CoconutRetrieval dataset: handle duplicate SMILES (documents) * Update `CoconutSMILES2FormulaPC` task * Change CoconutRetrieval dataset to a smaller one * Update some models - Integrate models added in ChemTEB (such as amazon, cohere bedrock and nomic bert) with latest modeling format in mteb. - Update the metadata for the mentioned models * Fix a typo `open_weights` argument is repeated twice * Update ChemTEB tasks - Rename some tasks for better readability. - Merge some BitextMining and PairClassification tasks into a single task with subsets (`PubChemSMILESBitextMining` and `PubChemSMILESPC`) - Add a new multilingual task (`PubChemWikiPairClassification`) consisting of 12 languages. - Update dataset paths, revisions and metadata for most tasks. - Add a `Chemistry` domain to `TaskMetadata` * Remove unnecessary files and tasks for MTEB * Update some ChemTEB tasks - Move `PubChemSMILESBitextMining` to `eng` folder - Add citations for tasks involving SDS, NQ, Hotpot, PubChem data - Update Clustering tasks `category` - Change `main_score` for `PubChemAISentenceParaphrasePC` * Create ChemTEB benchmark * Remove `CoconutRetrieval` * Update tasks and benchmarks tables with ChemTEB * Mention ChemTEB in readme * Fix some issues, update task metadata, lint - `eval_langs` fixed - Dataset path was fixed for two datasets - Metadata was completed for all tasks, mainly following fields: `date`, `task_subtypes`, `dialect`, `sample_creation` - ruff lint - rename `nomic_bert_models.py` to `nomic_bert_model.py` and update it. * Remove `nomic_bert_model.py` as it is now compatible with SentenceTransformer. * Remove `WikipediaAIParagraphsParaphrasePC` task due to being trivial. * Merge `amazon_models` and `cohere_bedrock_models.py` into `bedrock_models.py` * Remove unnecessary `load_data` for some tasks. * Update `bedrock_models.py`, `openai_models.py` and two dataset revisions - Text should be truncated for amazon text embedding models. - `text-embedding-ada-002` returns null embeddings for some inputs with 8192 tokens. - Two datasets are updated, dropping very long samples (len > 99th percentile) * Add a layer of dynamic truncation for amazon models in `bedrock_models.py` * Replace `metadata_dict` with `self.metadata` in `PubChemSMILESPC.py` * fix model meta for bedrock models * Add reference comment to original Cohere API implementation
embeddings-benchmark · Jan 25, 2025 · 4d66434 · 4d66434
1 parent fa5127a
commit 4d66434
Show file tree

Hide file tree

Showing 40 changed files with 1,678 additions and 25 deletions.
diff --git a/README.md b/README.md
@@ -517,5 +517,6 @@ You may also want to read and cite the amazing work that has extended MTEB & int
 - Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, Luca Soldaini. "[FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions](https://arxiv.org/abs/2403.15246)" arXiv 2024
 - Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li. "[LongEmbed: Extending Embedding Models for Long Context Retrieval](https://arxiv.org/abs/2404.12096)" arXiv 2024
 - Kenneth Enevoldsen, Márton Kardos, Niklas Muennighoff, Kristoffer Laigaard Nielbo. "[The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding](https://arxiv.org/abs/2406.02396)" arXiv 2024
+- Ali Shiraee Kasmaee, Mohammad Khodadad, Mohammad Arshi Saloot, Nick Sherck, Stephen Dokas, Hamidreza Mahyar, Soheila Samiee. "[ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain](https://arxiv.org/abs/2412.00532)" arXiv 2024
 
 For works that have used MTEB for benchmarking, you can find them on the [leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
diff --git a/docs/benchmarks.md b/docs/benchmarks.md
diff --git a/docs/tasks.md b/docs/tasks.md
diff --git a/mteb/abstasks/TaskMetadata.py b/mteb/abstasks/TaskMetadata.py
@@ -70,6 +70,7 @@
     "Web",
     "Written",
     "Programming",
+    "Chemistry",
 ]
 
 SAMPLE_CREATION_METHOD = Literal[

diff --git a/mteb/benchmarks/benchmarks.py b/mteb/benchmarks/benchmarks.py
@@ -1232,3 +1232,47 @@ def load_results(
       primaryClass={cs.CL}
 }""",
 )
+
+CHEMTEB = Benchmark(
+    name="ChemTEB",
+    tasks=get_tasks(
+        tasks=[
+            "PubChemSMILESBitextMining",
+            "SDSEyeProtectionClassification",
+            "SDSGlovesClassification",
+            "WikipediaBioMetChemClassification",
+            "WikipediaGreenhouseEnantiopureClassification",
+            "WikipediaSolidStateColloidalClassification",
+            "WikipediaOrganicInorganicClassification",
+            "WikipediaCryobiologySeparationClassification",
+            "WikipediaChemistryTopicsClassification",
+            "WikipediaTheoreticalAppliedClassification",
+            "WikipediaChemFieldsClassification",
+            "WikipediaLuminescenceClassification",
+            "WikipediaIsotopesFissionClassification",
+            "WikipediaSaltsSemiconductorsClassification",
+            "WikipediaBiolumNeurochemClassification",
+            "WikipediaCrystallographyAnalyticalClassification",
+            "WikipediaCompChemSpectroscopyClassification",
+            "WikipediaChemEngSpecialtiesClassification",
+            "WikipediaChemistryTopicsClustering",
+            "WikipediaSpecialtiesInChemistryClustering",
+            "PubChemAISentenceParaphrasePC",
+            "PubChemSMILESPC",
+            "PubChemSynonymPC",
+            "PubChemWikiParagraphsPC",
+            "PubChemWikiPairClassification",
+            "ChemNQRetrieval",
+            "ChemHotpotQARetrieval",
+        ],
+    ),
+    description="ChemTEB evaluates the performance of text embedding models on chemical domain data.",
+    reference="https://arxiv.org/abs/2412.00532",
+    citation="""
+    @article{kasmaee2024chemteb,
+    title={ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance \& Efficiency on a Specific Domain},
+    author={Kasmaee, Ali Shiraee and Khodadad, Mohammad and Saloot, Mohammad Arshi and Sherck, Nick and Dokas, Stephen and Mahyar, Hamidreza and Samiee, Soheila},
+    journal={arXiv preprint arXiv:2412.00532},
+    year={2024}
+}""",
+)
diff --git a/mteb/models/bedrock_models.py b/mteb/models/bedrock_models.py
@@ -0,0 +1,264 @@
+from __future__ import annotations
+
+import json
+import logging
+import re
+from functools import partial
+from typing import Any
+
+import numpy as np
+import tqdm
+
+from mteb.encoder_interface import PromptType
+from mteb.model_meta import ModelMeta
+from mteb.models.cohere_models import model_prompts as cohere_model_prompts
+from mteb.models.cohere_models import supported_languages as cohere_supported_languages
+from mteb.requires_package import requires_package
+
+from .wrapper import Wrapper
+
+logger = logging.getLogger(__name__)
+
+
+class BedrockWrapper(Wrapper):
+    def __init__(
+        self,
+        model_id: str,
+        provider: str,
+        max_tokens: int,
+        model_prompts: dict[str, str] | None = None,
+        **kwargs,
+    ) -> None:
+        requires_package(self, "boto3", "The AWS SDK for Python")
+        import boto3
+
+        boto3_session = boto3.session.Session()
+        region_name = boto3_session.region_name
+        self._client = boto3.client("bedrock-runtime", region_name)
+
+        self._model_id = model_id
+        self._provider = provider.lower()
+
+        if self._provider == "cohere":
+            self.model_prompts = (
+                self.validate_task_to_prompt_name(model_prompts)
+                if model_prompts
+                else None
+            )
+            self._max_batch_size = 96
+            self._max_sequence_length = max_tokens * 4
+        else:
+            self._max_tokens = max_tokens
+
+    def encode(
+        self,
+        sentences: list[str],
+        *,
+        task_name: str | None = None,
+        prompt_type: PromptType | None = None,
+        **kwargs: Any,
+    ) -> np.ndarray:
+        requires_package(self, "boto3", "Amazon Bedrock")
+        show_progress_bar = (
+            False
+            if "show_progress_bar" not in kwargs
+            else kwargs.pop("show_progress_bar")
+        )
+        if self._provider == "amazon":
+            return self._encode_amazon(sentences, show_progress_bar)
+        elif self._provider == "cohere":
+            prompt_name = self.get_prompt_name(
+                self.model_prompts, task_name, prompt_type
+            )
+            cohere_task_type = self.model_prompts.get(prompt_name, "search_document")
+            return self._encode_cohere(sentences, cohere_task_type, show_progress_bar)
+        else:
+            raise ValueError(
+                f"Unknown provider '{self._provider}'. Must be 'amazon' or 'cohere'."
+            )
+
+    def _encode_amazon(
+        self, sentences: list[str], show_progress_bar: bool = False
+    ) -> np.ndarray:
+        from botocore.exceptions import ValidationError
+
+        all_embeddings = []
+        # https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html
+        max_sequence_length = int(self._max_tokens * 4.5)
+
+        for sentence in tqdm.tqdm(
+            sentences, leave=False, disable=not show_progress_bar
+        ):
+            if len(sentence) > max_sequence_length:
+                truncated_sentence = sentence[:max_sequence_length]
+            else:
+                truncated_sentence = sentence
+
+            try:
+                embedding = self._embed_amazon(truncated_sentence)
+                all_embeddings.append(embedding)
+
+            except ValidationError as e:
+                error_str = str(e)
+                pattern = r"request input token count:\s*(\d+)"
+                match = re.search(pattern, error_str)
+                if match:
+                    num_tokens = int(match.group(1))
+
+                    ratio = 0.9 * (self._max_tokens / num_tokens)
+                    dynamic_cutoff = int(len(truncated_sentence) * ratio)
+
+                    embedding = self._embed_amazon(truncated_sentence[:dynamic_cutoff])
+                    all_embeddings.append(embedding)
+                else:
+                    raise e
+
+        return np.array(all_embeddings)
+
+    def _encode_cohere(
+        self,
+        sentences: list[str],
+        cohere_task_type: str,
+        show_progress_bar: bool = False,
+    ) -> np.ndarray:
+        batches = [
+            sentences[i : i + self._max_batch_size]
+            for i in range(0, len(sentences), self._max_batch_size)
+        ]
+
+        all_embeddings = []
+
+        for batch in tqdm.tqdm(batches, leave=False, disable=not show_progress_bar):
+            response = self._client.invoke_model(
+                body=json.dumps(
+                    {
+                        "texts": [sent[: self._max_sequence_length] for sent in batch],
+                        "input_type": cohere_task_type,
+                    }
+                ),
+                modelId=self._model_id,
+                accept="*/*",
+                contentType="application/json",
+            )
+            all_embeddings.extend(self._to_numpy(response))
+
+        return np.array(all_embeddings)
+
+    def _embed_amazon(self, sentence: str) -> np.ndarray:
+        response = self._client.invoke_model(
+            body=json.dumps({"inputText": sentence}),
+            modelId=self._model_id,
+            accept="application/json",
+            contentType="application/json",
+        )
+        return self._to_numpy(response)
+
+    def _to_numpy(self, embedding_response) -> np.ndarray:
+        response = json.loads(embedding_response.get("body").read())
+        key = "embedding" if self._provider == "amazon" else "embeddings"
+        return np.array(response[key])
+
+
+amazon_titan_embed_text_v1 = ModelMeta(
+    name="bedrock/amazon-titan-embed-text-v1",
+    revision="1",
+    release_date="2023-09-27",
+    languages=None,  # not specified
+    loader=partial(
+        BedrockWrapper,
+        model_id="amazon.titan-embed-text-v1",
+        provider="amazon",
+        max_tokens=8192,
+    ),
+    max_tokens=8192,
+    embed_dim=1536,
+    open_weights=False,
+    n_parameters=None,
+    public_training_code=None,
+    public_training_data=None,  # assumed
+    training_datasets=None,
+    license=None,
+    reference="https://aws.amazon.com/about-aws/whats-new/2023/09/amazon-titan-embeddings-generally-available/",
+    similarity_fn_name="cosine",
+    framework=["API"],
+    use_instructions=False,
+)
+
+amazon_titan_embed_text_v2 = ModelMeta(
+    name="bedrock/amazon-titan-embed-text-v2",
+    revision="1",
+    release_date="2024-04-30",
+    languages=None,  # not specified
+    loader=partial(
+        BedrockWrapper,
+        model_id="amazon.titan-embed-text-v2:0",
+        provider="amazon",
+        max_tokens=8192,
+    ),
+    max_tokens=8192,
+    embed_dim=1024,
+    open_weights=False,
+    n_parameters=None,
+    public_training_code=None,
+    public_training_data=None,  # assumed
+    training_datasets=None,
+    license=None,
+    reference="https://aws.amazon.com/about-aws/whats-new/2024/04/amazon-titan-text-embeddings-v2-amazon-bedrock/",
+    similarity_fn_name="cosine",
+    framework=["API"],
+    use_instructions=False,
+)
+# Note: For the original Cohere API implementation, refer to:
+# https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/cohere_models.py
+# This implementation uses the Amazon Bedrock endpoint for Cohere models.
+cohere_embed_english_v3 = ModelMeta(
+    loader=partial(
+        BedrockWrapper,
+        model_id="cohere.embed-english-v3",
+        provider="cohere",
+        max_tokens=512,
+        model_prompts=cohere_model_prompts,
+    ),
+    name="bedrock/cohere-embed-english-v3",
+    languages=["eng-Latn"],
+    open_weights=False,
+    reference="https://cohere.com/blog/introducing-embed-v3",
+    revision="1",
+    release_date="2023-11-02",
+    n_parameters=None,
+    public_training_code=None,
+    public_training_data=None,  # assumed
+    training_datasets=None,
+    max_tokens=512,
+    embed_dim=1024,
+    license=None,
+    similarity_fn_name="cosine",
+    framework=["API"],
+    use_instructions=True,
+)
+
+cohere_embed_multilingual_v3 = ModelMeta(
+    loader=partial(
+        BedrockWrapper,
+        model_id="cohere.embed-multilingual-v3",
+        provider="cohere",
+        max_tokens=512,
+        model_prompts=cohere_model_prompts,
+    ),
+    name="bedrock/cohere-embed-multilingual-v3",
+    languages=cohere_supported_languages,
+    open_weights=False,
+    reference="https://cohere.com/blog/introducing-embed-v3",
+    revision="1",
+    release_date="2023-11-02",
+    n_parameters=None,
+    public_training_code=None,
+    public_training_data=None,  # assumed
+    training_datasets=None,
+    max_tokens=512,
+    embed_dim=1024,
+    license=None,
+    similarity_fn_name="cosine",
+    framework=["API"],
+    use_instructions=True,
+)
diff --git a/mteb/models/overview.py b/mteb/models/overview.py
@@ -13,6 +13,7 @@
 from mteb.model_meta import ModelMeta
 from mteb.models import (
     arctic_models,
+    bedrock_models,
     bge_models,
     bm25,
     cde_models,
@@ -100,6 +101,7 @@
     uae_models,
     text2vec_models,
     stella_models,
+    bedrock_models,
     uae_models,
     voyage_models,
 ]

diff --git a/mteb/tasks/BitextMining/__init__.py b/mteb/tasks/BitextMining/__init__.py
@@ -1,6 +1,7 @@
 from __future__ import annotations
 
 from .dan.BornholmskBitextMining import *
+from .eng.PubChemSMILESBitextMining import *
 from .kat.TbilisiCityHallBitextMining import *
 from .multilingual.BibleNLPBitextMining import *
 from .multilingual.BUCCBitextMining import *