Skip to content

Commit

Permalink
feat: Integrating ChemTEB (#1708)
Browse files Browse the repository at this point in the history
* Add SMILES, AI Paraphrase and Inter-Source Paragraphs PairClassification Tasks

* Add chemical subsets of NQ and HotpotQA datasets as Retrieval tasks

* Add PubChem Synonyms PairClassification task

* Update task __init__ for previously added tasks

* Add nomic-bert loader

* Add a script to run the evaluation pipeline for chemical-related tasks

* Add 15 Wikipedia article classification tasks

* Add PairClassification and BitextMining tasks for Coconut SMILES

* Fix naming of some Classification and PairClassification tasks

* Fix some classification tasks naming issues

* Integrate WANDB with benchmarking script

* Update .gitignore

* Fix `nomic_models.py` issue with retrieval tasks, similar to issue #1115 in original repo

* Add one chemical model and some SentenceTransformer models

* Fix a naming issue for SentenceTransformer models

* Add OpenAI, bge-m3 and matscibert models

* Add PubChem SMILES Bitext Mining tasks

* Change metric namings to be more descriptive

* Add English e5 and bge v1 models, all the sizes

* Add two Wikipedia Clustering tasks

* Add a try-except in evaluation script to skip faulty models during the benchmark.

* Add bge v1.5 models and clustering score extraction to json parser

* Add Amazon Titan embedding models

* Add Cohere Bedrock models

* Add two SDS Classification tasks

* Add SDS Classification tasks to classification init and chem_eval

* Add a retrieval dataset, update dataset names and revisions

* Update revision for the CoconutRetrieval dataset: handle duplicate SMILES (documents)

* Update `CoconutSMILES2FormulaPC` task

* Change CoconutRetrieval dataset to a smaller one

* Update some models
- Integrate models added in ChemTEB (such as amazon, cohere bedrock and nomic bert) with latest modeling format in mteb.
- Update the metadata for the mentioned models

* Fix a typo
`open_weights` argument is repeated twice

* Update ChemTEB tasks
- Rename some tasks for better readability.
- Merge some BitextMining and PairClassification tasks into a single task with subsets (`PubChemSMILESBitextMining` and `PubChemSMILESPC`)
- Add a new multilingual task (`PubChemWikiPairClassification`) consisting of 12 languages.
- Update dataset paths, revisions and metadata for most tasks.
- Add a `Chemistry` domain to `TaskMetadata`

* Remove unnecessary files and tasks for MTEB

* Update some ChemTEB tasks
- Move `PubChemSMILESBitextMining` to `eng` folder
- Add citations for tasks involving SDS, NQ, Hotpot, PubChem data
- Update Clustering tasks `category`
- Change `main_score` for `PubChemAISentenceParaphrasePC`

* Create ChemTEB benchmark

* Remove `CoconutRetrieval`

* Update tasks and benchmarks tables with ChemTEB

* Mention ChemTEB in readme

* Fix some issues, update task metadata, lint
- `eval_langs` fixed
- Dataset path was fixed for two datasets
- Metadata was completed for all tasks, mainly following fields: `date`, `task_subtypes`, `dialect`, `sample_creation`
- ruff lint
- rename `nomic_bert_models.py` to `nomic_bert_model.py` and update it.

* Remove `nomic_bert_model.py` as it is now compatible with SentenceTransformer.

* Remove `WikipediaAIParagraphsParaphrasePC` task due to being trivial.

* Merge `amazon_models` and `cohere_bedrock_models.py` into `bedrock_models.py`

* Remove unnecessary `load_data` for some tasks.

* Update `bedrock_models.py`, `openai_models.py` and two dataset revisions
- Text should be truncated for amazon text embedding models.
- `text-embedding-ada-002` returns null embeddings for some inputs with 8192 tokens.
- Two datasets are updated, dropping very long samples (len > 99th percentile)

* Add a layer of dynamic truncation for amazon models in `bedrock_models.py`

* Replace `metadata_dict` with `self.metadata` in `PubChemSMILESPC.py`

* fix model meta for bedrock models

* Add reference comment to original Cohere API implementation
  • Loading branch information
HSILA authored Jan 25, 2025
1 parent fa5127a commit 4d66434
Show file tree
Hide file tree
Showing 40 changed files with 1,678 additions and 25 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -517,5 +517,6 @@ You may also want to read and cite the amazing work that has extended MTEB & int
- Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, Luca Soldaini. "[FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions](https://arxiv.org/abs/2403.15246)" arXiv 2024
- Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li. "[LongEmbed: Extending Embedding Models for Long Context Retrieval](https://arxiv.org/abs/2404.12096)" arXiv 2024
- Kenneth Enevoldsen, Márton Kardos, Niklas Muennighoff, Kristoffer Laigaard Nielbo. "[The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding](https://arxiv.org/abs/2406.02396)" arXiv 2024
- Ali Shiraee Kasmaee, Mohammad Khodadad, Mohammad Arshi Saloot, Nick Sherck, Stephen Dokas, Hamidreza Mahyar, Soheila Samiee. "[ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain](https://arxiv.org/abs/2412.00532)" arXiv 2024

For works that have used MTEB for benchmarking, you can find them on the [leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
33 changes: 22 additions & 11 deletions docs/benchmarks.md

Large diffs are not rendered by default.

55 changes: 41 additions & 14 deletions docs/tasks.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions mteb/abstasks/TaskMetadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
"Web",
"Written",
"Programming",
"Chemistry",
]

SAMPLE_CREATION_METHOD = Literal[
Expand Down
44 changes: 44 additions & 0 deletions mteb/benchmarks/benchmarks.py
Original file line number Diff line number Diff line change
Expand Up @@ -1232,3 +1232,47 @@ def load_results(
primaryClass={cs.CL}
}""",
)

CHEMTEB = Benchmark(
name="ChemTEB",
tasks=get_tasks(
tasks=[
"PubChemSMILESBitextMining",
"SDSEyeProtectionClassification",
"SDSGlovesClassification",
"WikipediaBioMetChemClassification",
"WikipediaGreenhouseEnantiopureClassification",
"WikipediaSolidStateColloidalClassification",
"WikipediaOrganicInorganicClassification",
"WikipediaCryobiologySeparationClassification",
"WikipediaChemistryTopicsClassification",
"WikipediaTheoreticalAppliedClassification",
"WikipediaChemFieldsClassification",
"WikipediaLuminescenceClassification",
"WikipediaIsotopesFissionClassification",
"WikipediaSaltsSemiconductorsClassification",
"WikipediaBiolumNeurochemClassification",
"WikipediaCrystallographyAnalyticalClassification",
"WikipediaCompChemSpectroscopyClassification",
"WikipediaChemEngSpecialtiesClassification",
"WikipediaChemistryTopicsClustering",
"WikipediaSpecialtiesInChemistryClustering",
"PubChemAISentenceParaphrasePC",
"PubChemSMILESPC",
"PubChemSynonymPC",
"PubChemWikiParagraphsPC",
"PubChemWikiPairClassification",
"ChemNQRetrieval",
"ChemHotpotQARetrieval",
],
),
description="ChemTEB evaluates the performance of text embedding models on chemical domain data.",
reference="https://arxiv.org/abs/2412.00532",
citation="""
@article{kasmaee2024chemteb,
title={ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance \& Efficiency on a Specific Domain},
author={Kasmaee, Ali Shiraee and Khodadad, Mohammad and Saloot, Mohammad Arshi and Sherck, Nick and Dokas, Stephen and Mahyar, Hamidreza and Samiee, Soheila},
journal={arXiv preprint arXiv:2412.00532},
year={2024}
}""",
)
264 changes: 264 additions & 0 deletions mteb/models/bedrock_models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
from __future__ import annotations

import json
import logging
import re
from functools import partial
from typing import Any

import numpy as np
import tqdm

from mteb.encoder_interface import PromptType
from mteb.model_meta import ModelMeta
from mteb.models.cohere_models import model_prompts as cohere_model_prompts
from mteb.models.cohere_models import supported_languages as cohere_supported_languages
from mteb.requires_package import requires_package

from .wrapper import Wrapper

logger = logging.getLogger(__name__)


class BedrockWrapper(Wrapper):
def __init__(
self,
model_id: str,
provider: str,
max_tokens: int,
model_prompts: dict[str, str] | None = None,
**kwargs,
) -> None:
requires_package(self, "boto3", "The AWS SDK for Python")
import boto3

boto3_session = boto3.session.Session()
region_name = boto3_session.region_name
self._client = boto3.client("bedrock-runtime", region_name)

self._model_id = model_id
self._provider = provider.lower()

if self._provider == "cohere":
self.model_prompts = (
self.validate_task_to_prompt_name(model_prompts)
if model_prompts
else None
)
self._max_batch_size = 96
self._max_sequence_length = max_tokens * 4
else:
self._max_tokens = max_tokens

def encode(
self,
sentences: list[str],
*,
task_name: str | None = None,
prompt_type: PromptType | None = None,
**kwargs: Any,
) -> np.ndarray:
requires_package(self, "boto3", "Amazon Bedrock")
show_progress_bar = (
False
if "show_progress_bar" not in kwargs
else kwargs.pop("show_progress_bar")
)
if self._provider == "amazon":
return self._encode_amazon(sentences, show_progress_bar)
elif self._provider == "cohere":
prompt_name = self.get_prompt_name(
self.model_prompts, task_name, prompt_type
)
cohere_task_type = self.model_prompts.get(prompt_name, "search_document")
return self._encode_cohere(sentences, cohere_task_type, show_progress_bar)
else:
raise ValueError(
f"Unknown provider '{self._provider}'. Must be 'amazon' or 'cohere'."
)

def _encode_amazon(
self, sentences: list[str], show_progress_bar: bool = False
) -> np.ndarray:
from botocore.exceptions import ValidationError

all_embeddings = []
# https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html
max_sequence_length = int(self._max_tokens * 4.5)

for sentence in tqdm.tqdm(
sentences, leave=False, disable=not show_progress_bar
):
if len(sentence) > max_sequence_length:
truncated_sentence = sentence[:max_sequence_length]
else:
truncated_sentence = sentence

try:
embedding = self._embed_amazon(truncated_sentence)
all_embeddings.append(embedding)

except ValidationError as e:
error_str = str(e)
pattern = r"request input token count:\s*(\d+)"
match = re.search(pattern, error_str)
if match:
num_tokens = int(match.group(1))

ratio = 0.9 * (self._max_tokens / num_tokens)
dynamic_cutoff = int(len(truncated_sentence) * ratio)

embedding = self._embed_amazon(truncated_sentence[:dynamic_cutoff])
all_embeddings.append(embedding)
else:
raise e

return np.array(all_embeddings)

def _encode_cohere(
self,
sentences: list[str],
cohere_task_type: str,
show_progress_bar: bool = False,
) -> np.ndarray:
batches = [
sentences[i : i + self._max_batch_size]
for i in range(0, len(sentences), self._max_batch_size)
]

all_embeddings = []

for batch in tqdm.tqdm(batches, leave=False, disable=not show_progress_bar):
response = self._client.invoke_model(
body=json.dumps(
{
"texts": [sent[: self._max_sequence_length] for sent in batch],
"input_type": cohere_task_type,
}
),
modelId=self._model_id,
accept="*/*",
contentType="application/json",
)
all_embeddings.extend(self._to_numpy(response))

return np.array(all_embeddings)

def _embed_amazon(self, sentence: str) -> np.ndarray:
response = self._client.invoke_model(
body=json.dumps({"inputText": sentence}),
modelId=self._model_id,
accept="application/json",
contentType="application/json",
)
return self._to_numpy(response)

def _to_numpy(self, embedding_response) -> np.ndarray:
response = json.loads(embedding_response.get("body").read())
key = "embedding" if self._provider == "amazon" else "embeddings"
return np.array(response[key])


amazon_titan_embed_text_v1 = ModelMeta(
name="bedrock/amazon-titan-embed-text-v1",
revision="1",
release_date="2023-09-27",
languages=None, # not specified
loader=partial(
BedrockWrapper,
model_id="amazon.titan-embed-text-v1",
provider="amazon",
max_tokens=8192,
),
max_tokens=8192,
embed_dim=1536,
open_weights=False,
n_parameters=None,
public_training_code=None,
public_training_data=None, # assumed
training_datasets=None,
license=None,
reference="https://aws.amazon.com/about-aws/whats-new/2023/09/amazon-titan-embeddings-generally-available/",
similarity_fn_name="cosine",
framework=["API"],
use_instructions=False,
)

amazon_titan_embed_text_v2 = ModelMeta(
name="bedrock/amazon-titan-embed-text-v2",
revision="1",
release_date="2024-04-30",
languages=None, # not specified
loader=partial(
BedrockWrapper,
model_id="amazon.titan-embed-text-v2:0",
provider="amazon",
max_tokens=8192,
),
max_tokens=8192,
embed_dim=1024,
open_weights=False,
n_parameters=None,
public_training_code=None,
public_training_data=None, # assumed
training_datasets=None,
license=None,
reference="https://aws.amazon.com/about-aws/whats-new/2024/04/amazon-titan-text-embeddings-v2-amazon-bedrock/",
similarity_fn_name="cosine",
framework=["API"],
use_instructions=False,
)
# Note: For the original Cohere API implementation, refer to:
# https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/cohere_models.py
# This implementation uses the Amazon Bedrock endpoint for Cohere models.
cohere_embed_english_v3 = ModelMeta(
loader=partial(
BedrockWrapper,
model_id="cohere.embed-english-v3",
provider="cohere",
max_tokens=512,
model_prompts=cohere_model_prompts,
),
name="bedrock/cohere-embed-english-v3",
languages=["eng-Latn"],
open_weights=False,
reference="https://cohere.com/blog/introducing-embed-v3",
revision="1",
release_date="2023-11-02",
n_parameters=None,
public_training_code=None,
public_training_data=None, # assumed
training_datasets=None,
max_tokens=512,
embed_dim=1024,
license=None,
similarity_fn_name="cosine",
framework=["API"],
use_instructions=True,
)

cohere_embed_multilingual_v3 = ModelMeta(
loader=partial(
BedrockWrapper,
model_id="cohere.embed-multilingual-v3",
provider="cohere",
max_tokens=512,
model_prompts=cohere_model_prompts,
),
name="bedrock/cohere-embed-multilingual-v3",
languages=cohere_supported_languages,
open_weights=False,
reference="https://cohere.com/blog/introducing-embed-v3",
revision="1",
release_date="2023-11-02",
n_parameters=None,
public_training_code=None,
public_training_data=None, # assumed
training_datasets=None,
max_tokens=512,
embed_dim=1024,
license=None,
similarity_fn_name="cosine",
framework=["API"],
use_instructions=True,
)
2 changes: 2 additions & 0 deletions mteb/models/overview.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from mteb.model_meta import ModelMeta
from mteb.models import (
arctic_models,
bedrock_models,
bge_models,
bm25,
cde_models,
Expand Down Expand Up @@ -100,6 +101,7 @@
uae_models,
text2vec_models,
stella_models,
bedrock_models,
uae_models,
voyage_models,
]
Expand Down
1 change: 1 addition & 0 deletions mteb/tasks/BitextMining/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import annotations

from .dan.BornholmskBitextMining import *
from .eng.PubChemSMILESBitextMining import *
from .kat.TbilisiCityHallBitextMining import *
from .multilingual.BibleNLPBitextMining import *
from .multilingual.BUCCBitextMining import *
Expand Down
Loading

0 comments on commit 4d66434

Please sign in to comment.