-
Notifications
You must be signed in to change notification settings - Fork 307
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add SMILES, AI Paraphrase and Inter-Source Paragraphs PairClassification Tasks * Add chemical subsets of NQ and HotpotQA datasets as Retrieval tasks * Add PubChem Synonyms PairClassification task * Update task __init__ for previously added tasks * Add nomic-bert loader * Add a script to run the evaluation pipeline for chemical-related tasks * Add 15 Wikipedia article classification tasks * Add PairClassification and BitextMining tasks for Coconut SMILES * Fix naming of some Classification and PairClassification tasks * Fix some classification tasks naming issues * Integrate WANDB with benchmarking script * Update .gitignore * Fix `nomic_models.py` issue with retrieval tasks, similar to issue #1115 in original repo * Add one chemical model and some SentenceTransformer models * Fix a naming issue for SentenceTransformer models * Add OpenAI, bge-m3 and matscibert models * Add PubChem SMILES Bitext Mining tasks * Change metric namings to be more descriptive * Add English e5 and bge v1 models, all the sizes * Add two Wikipedia Clustering tasks * Add a try-except in evaluation script to skip faulty models during the benchmark. * Add bge v1.5 models and clustering score extraction to json parser * Add Amazon Titan embedding models * Add Cohere Bedrock models * Add two SDS Classification tasks * Add SDS Classification tasks to classification init and chem_eval * Add a retrieval dataset, update dataset names and revisions * Update revision for the CoconutRetrieval dataset: handle duplicate SMILES (documents) * Update `CoconutSMILES2FormulaPC` task * Change CoconutRetrieval dataset to a smaller one * Update some models - Integrate models added in ChemTEB (such as amazon, cohere bedrock and nomic bert) with latest modeling format in mteb. - Update the metadata for the mentioned models * Fix a typo `open_weights` argument is repeated twice * Update ChemTEB tasks - Rename some tasks for better readability. - Merge some BitextMining and PairClassification tasks into a single task with subsets (`PubChemSMILESBitextMining` and `PubChemSMILESPC`) - Add a new multilingual task (`PubChemWikiPairClassification`) consisting of 12 languages. - Update dataset paths, revisions and metadata for most tasks. - Add a `Chemistry` domain to `TaskMetadata` * Remove unnecessary files and tasks for MTEB * Update some ChemTEB tasks - Move `PubChemSMILESBitextMining` to `eng` folder - Add citations for tasks involving SDS, NQ, Hotpot, PubChem data - Update Clustering tasks `category` - Change `main_score` for `PubChemAISentenceParaphrasePC` * Create ChemTEB benchmark * Remove `CoconutRetrieval` * Update tasks and benchmarks tables with ChemTEB * Mention ChemTEB in readme * Fix some issues, update task metadata, lint - `eval_langs` fixed - Dataset path was fixed for two datasets - Metadata was completed for all tasks, mainly following fields: `date`, `task_subtypes`, `dialect`, `sample_creation` - ruff lint - rename `nomic_bert_models.py` to `nomic_bert_model.py` and update it. * Remove `nomic_bert_model.py` as it is now compatible with SentenceTransformer. * Remove `WikipediaAIParagraphsParaphrasePC` task due to being trivial. * Merge `amazon_models` and `cohere_bedrock_models.py` into `bedrock_models.py` * Remove unnecessary `load_data` for some tasks. * Update `bedrock_models.py`, `openai_models.py` and two dataset revisions - Text should be truncated for amazon text embedding models. - `text-embedding-ada-002` returns null embeddings for some inputs with 8192 tokens. - Two datasets are updated, dropping very long samples (len > 99th percentile) * Add a layer of dynamic truncation for amazon models in `bedrock_models.py` * Replace `metadata_dict` with `self.metadata` in `PubChemSMILESPC.py` * fix model meta for bedrock models * Add reference comment to original Cohere API implementation
- Loading branch information
Showing
40 changed files
with
1,678 additions
and
25 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -70,6 +70,7 @@ | |
"Web", | ||
"Written", | ||
"Programming", | ||
"Chemistry", | ||
] | ||
|
||
SAMPLE_CREATION_METHOD = Literal[ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,264 @@ | ||
from __future__ import annotations | ||
|
||
import json | ||
import logging | ||
import re | ||
from functools import partial | ||
from typing import Any | ||
|
||
import numpy as np | ||
import tqdm | ||
|
||
from mteb.encoder_interface import PromptType | ||
from mteb.model_meta import ModelMeta | ||
from mteb.models.cohere_models import model_prompts as cohere_model_prompts | ||
from mteb.models.cohere_models import supported_languages as cohere_supported_languages | ||
from mteb.requires_package import requires_package | ||
|
||
from .wrapper import Wrapper | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
class BedrockWrapper(Wrapper): | ||
def __init__( | ||
self, | ||
model_id: str, | ||
provider: str, | ||
max_tokens: int, | ||
model_prompts: dict[str, str] | None = None, | ||
**kwargs, | ||
) -> None: | ||
requires_package(self, "boto3", "The AWS SDK for Python") | ||
import boto3 | ||
|
||
boto3_session = boto3.session.Session() | ||
region_name = boto3_session.region_name | ||
self._client = boto3.client("bedrock-runtime", region_name) | ||
|
||
self._model_id = model_id | ||
self._provider = provider.lower() | ||
|
||
if self._provider == "cohere": | ||
self.model_prompts = ( | ||
self.validate_task_to_prompt_name(model_prompts) | ||
if model_prompts | ||
else None | ||
) | ||
self._max_batch_size = 96 | ||
self._max_sequence_length = max_tokens * 4 | ||
else: | ||
self._max_tokens = max_tokens | ||
|
||
def encode( | ||
self, | ||
sentences: list[str], | ||
*, | ||
task_name: str | None = None, | ||
prompt_type: PromptType | None = None, | ||
**kwargs: Any, | ||
) -> np.ndarray: | ||
requires_package(self, "boto3", "Amazon Bedrock") | ||
show_progress_bar = ( | ||
False | ||
if "show_progress_bar" not in kwargs | ||
else kwargs.pop("show_progress_bar") | ||
) | ||
if self._provider == "amazon": | ||
return self._encode_amazon(sentences, show_progress_bar) | ||
elif self._provider == "cohere": | ||
prompt_name = self.get_prompt_name( | ||
self.model_prompts, task_name, prompt_type | ||
) | ||
cohere_task_type = self.model_prompts.get(prompt_name, "search_document") | ||
return self._encode_cohere(sentences, cohere_task_type, show_progress_bar) | ||
else: | ||
raise ValueError( | ||
f"Unknown provider '{self._provider}'. Must be 'amazon' or 'cohere'." | ||
) | ||
|
||
def _encode_amazon( | ||
self, sentences: list[str], show_progress_bar: bool = False | ||
) -> np.ndarray: | ||
from botocore.exceptions import ValidationError | ||
|
||
all_embeddings = [] | ||
# https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html | ||
max_sequence_length = int(self._max_tokens * 4.5) | ||
|
||
for sentence in tqdm.tqdm( | ||
sentences, leave=False, disable=not show_progress_bar | ||
): | ||
if len(sentence) > max_sequence_length: | ||
truncated_sentence = sentence[:max_sequence_length] | ||
else: | ||
truncated_sentence = sentence | ||
|
||
try: | ||
embedding = self._embed_amazon(truncated_sentence) | ||
all_embeddings.append(embedding) | ||
|
||
except ValidationError as e: | ||
error_str = str(e) | ||
pattern = r"request input token count:\s*(\d+)" | ||
match = re.search(pattern, error_str) | ||
if match: | ||
num_tokens = int(match.group(1)) | ||
|
||
ratio = 0.9 * (self._max_tokens / num_tokens) | ||
dynamic_cutoff = int(len(truncated_sentence) * ratio) | ||
|
||
embedding = self._embed_amazon(truncated_sentence[:dynamic_cutoff]) | ||
all_embeddings.append(embedding) | ||
else: | ||
raise e | ||
|
||
return np.array(all_embeddings) | ||
|
||
def _encode_cohere( | ||
self, | ||
sentences: list[str], | ||
cohere_task_type: str, | ||
show_progress_bar: bool = False, | ||
) -> np.ndarray: | ||
batches = [ | ||
sentences[i : i + self._max_batch_size] | ||
for i in range(0, len(sentences), self._max_batch_size) | ||
] | ||
|
||
all_embeddings = [] | ||
|
||
for batch in tqdm.tqdm(batches, leave=False, disable=not show_progress_bar): | ||
response = self._client.invoke_model( | ||
body=json.dumps( | ||
{ | ||
"texts": [sent[: self._max_sequence_length] for sent in batch], | ||
"input_type": cohere_task_type, | ||
} | ||
), | ||
modelId=self._model_id, | ||
accept="*/*", | ||
contentType="application/json", | ||
) | ||
all_embeddings.extend(self._to_numpy(response)) | ||
|
||
return np.array(all_embeddings) | ||
|
||
def _embed_amazon(self, sentence: str) -> np.ndarray: | ||
response = self._client.invoke_model( | ||
body=json.dumps({"inputText": sentence}), | ||
modelId=self._model_id, | ||
accept="application/json", | ||
contentType="application/json", | ||
) | ||
return self._to_numpy(response) | ||
|
||
def _to_numpy(self, embedding_response) -> np.ndarray: | ||
response = json.loads(embedding_response.get("body").read()) | ||
key = "embedding" if self._provider == "amazon" else "embeddings" | ||
return np.array(response[key]) | ||
|
||
|
||
amazon_titan_embed_text_v1 = ModelMeta( | ||
name="bedrock/amazon-titan-embed-text-v1", | ||
revision="1", | ||
release_date="2023-09-27", | ||
languages=None, # not specified | ||
loader=partial( | ||
BedrockWrapper, | ||
model_id="amazon.titan-embed-text-v1", | ||
provider="amazon", | ||
max_tokens=8192, | ||
), | ||
max_tokens=8192, | ||
embed_dim=1536, | ||
open_weights=False, | ||
n_parameters=None, | ||
public_training_code=None, | ||
public_training_data=None, # assumed | ||
training_datasets=None, | ||
license=None, | ||
reference="https://aws.amazon.com/about-aws/whats-new/2023/09/amazon-titan-embeddings-generally-available/", | ||
similarity_fn_name="cosine", | ||
framework=["API"], | ||
use_instructions=False, | ||
) | ||
|
||
amazon_titan_embed_text_v2 = ModelMeta( | ||
name="bedrock/amazon-titan-embed-text-v2", | ||
revision="1", | ||
release_date="2024-04-30", | ||
languages=None, # not specified | ||
loader=partial( | ||
BedrockWrapper, | ||
model_id="amazon.titan-embed-text-v2:0", | ||
provider="amazon", | ||
max_tokens=8192, | ||
), | ||
max_tokens=8192, | ||
embed_dim=1024, | ||
open_weights=False, | ||
n_parameters=None, | ||
public_training_code=None, | ||
public_training_data=None, # assumed | ||
training_datasets=None, | ||
license=None, | ||
reference="https://aws.amazon.com/about-aws/whats-new/2024/04/amazon-titan-text-embeddings-v2-amazon-bedrock/", | ||
similarity_fn_name="cosine", | ||
framework=["API"], | ||
use_instructions=False, | ||
) | ||
# Note: For the original Cohere API implementation, refer to: | ||
# https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/cohere_models.py | ||
# This implementation uses the Amazon Bedrock endpoint for Cohere models. | ||
cohere_embed_english_v3 = ModelMeta( | ||
loader=partial( | ||
BedrockWrapper, | ||
model_id="cohere.embed-english-v3", | ||
provider="cohere", | ||
max_tokens=512, | ||
model_prompts=cohere_model_prompts, | ||
), | ||
name="bedrock/cohere-embed-english-v3", | ||
languages=["eng-Latn"], | ||
open_weights=False, | ||
reference="https://cohere.com/blog/introducing-embed-v3", | ||
revision="1", | ||
release_date="2023-11-02", | ||
n_parameters=None, | ||
public_training_code=None, | ||
public_training_data=None, # assumed | ||
training_datasets=None, | ||
max_tokens=512, | ||
embed_dim=1024, | ||
license=None, | ||
similarity_fn_name="cosine", | ||
framework=["API"], | ||
use_instructions=True, | ||
) | ||
|
||
cohere_embed_multilingual_v3 = ModelMeta( | ||
loader=partial( | ||
BedrockWrapper, | ||
model_id="cohere.embed-multilingual-v3", | ||
provider="cohere", | ||
max_tokens=512, | ||
model_prompts=cohere_model_prompts, | ||
), | ||
name="bedrock/cohere-embed-multilingual-v3", | ||
languages=cohere_supported_languages, | ||
open_weights=False, | ||
reference="https://cohere.com/blog/introducing-embed-v3", | ||
revision="1", | ||
release_date="2023-11-02", | ||
n_parameters=None, | ||
public_training_code=None, | ||
public_training_data=None, # assumed | ||
training_datasets=None, | ||
max_tokens=512, | ||
embed_dim=1024, | ||
license=None, | ||
similarity_fn_name="cosine", | ||
framework=["API"], | ||
use_instructions=True, | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.