Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove * imports #1569

Merged
merged 78 commits into from
Dec 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
dd5d226
fix: Count unique texts, data leaks in calculate metrics (#1438)
Samoed Nov 14, 2024
04ac3f2
fix: update task metadata to allow for null (#1448)
KennethEnevoldsen Nov 14, 2024
f6a49fe
Update tasks table
github-actions[bot] Nov 14, 2024
78c0e4e
1.19.5
invalid-email-address Nov 14, 2024
4e86cea
Fix: Made data parsing in the leaderboard figure more robust (#1450)
x-tabdeveloping Nov 14, 2024
039d010
Fixed task loading (#1451)
x-tabdeveloping Nov 14, 2024
feb1ab7
fix: publish (#1452)
x-tabdeveloping Nov 14, 2024
3397633
1.19.6
invalid-email-address Nov 14, 2024
14d7523
fix: Fix load external results with `None` mteb_version (#1453)
Samoed Nov 14, 2024
68eb498
1.19.7
invalid-email-address Nov 14, 2024
58c459b
WIP: Polishing up leaderboard UI (#1461)
x-tabdeveloping Nov 15, 2024
1b920ac
fix: loading pre 1.11.0 (#1460)
Samoed Nov 15, 2024
a988fef
1.19.8
invalid-email-address Nov 15, 2024
9b2aece
fix: swap touche2020 to maintain compatibility (#1469)
isaac-chung Nov 17, 2024
8bb4a29
1.19.9
invalid-email-address Nov 17, 2024
2fb6fe7
docs: Add sum per language for task counts (#1468)
isaac-chung Nov 18, 2024
fde124a
fix: pinned datasets to <3.0.0 (#1470)
Napuh Nov 19, 2024
7186e04
1.19.10
invalid-email-address Nov 19, 2024
1cc6c9e
feat: add CUREv1 retrieval dataset (#1459)
dbuades Nov 21, 2024
4408717
Update tasks table
github-actions[bot] Nov 21, 2024
3ff38ec
1.20.0
invalid-email-address Nov 21, 2024
917ad7f
fix: check if `model` attr of model exists (#1499)
Samoed Nov 26, 2024
cde720e
1.20.1
invalid-email-address Nov 26, 2024
0affa31
fix: Leaderboard demo data loading (#1507)
x-tabdeveloping Nov 27, 2024
594f643
1.20.2
invalid-email-address Nov 27, 2024
35245d3
fix: leaderboard only shows models that have ModelMeta (#1508)
x-tabdeveloping Nov 27, 2024
9282796
1.20.3
invalid-email-address Nov 27, 2024
942f212
fix: align readme with current mteb (#1493)
Samoed Nov 27, 2024
09f004c
1.20.4
invalid-email-address Nov 27, 2024
cfd43ac
docs: Add lang family mapping and map to task table (#1486)
isaac-chung Nov 28, 2024
377a63d
Update tasks table
github-actions[bot] Nov 28, 2024
e3d2b54
fix: Ensure that models match the names on embedding-benchmarks/resul…
KennethEnevoldsen Nov 29, 2024
9980c60
1.20.5
invalid-email-address Nov 29, 2024
b02ae82
fix: Adding missing metadata on models and mathcing names up with the…
x-tabdeveloping Nov 29, 2024
ba09b11
1.20.6
invalid-email-address Nov 29, 2024
8e12250
feat: Evaluate missing splits (#1525)
isaac-chung Nov 29, 2024
ee1edac
1.21.0
invalid-email-address Nov 29, 2024
343b6e0
fix: Correct typos superseeded -> superseded (#1532)
isaac-chung Nov 30, 2024
e949d2a
1.21.1
invalid-email-address Nov 30, 2024
5b6f20f
fix: Task load data error for SICK-BR-STS and XStance (#1534)
isaac-chung Dec 1, 2024
ec9413a
1.21.2
invalid-email-address Dec 1, 2024
39349ff
fix: Proprietary models now get correctly shown in leaderboard (#1530)
x-tabdeveloping Dec 2, 2024
d07c29b
1.21.3
invalid-email-address Dec 2, 2024
5fa7b7b
docs: Add Model Meta parameters and metadata (#1536)
isaac-chung Dec 2, 2024
36bab4d
fix: add more model meta (jina, e5) (#1537)
isaac-chung Dec 4, 2024
ac4a706
1.21.4
invalid-email-address Dec 4, 2024
c2f4c26
Add cohere models (#1538)
KennethEnevoldsen Dec 4, 2024
5013df8
fix: add nomic models (#1543)
KennethEnevoldsen Dec 4, 2024
97ab272
fix: Added all-minilm-l12-v2 (#1542)
KennethEnevoldsen Dec 4, 2024
df11c38
fix: Added arctic models (#1541)
KennethEnevoldsen Dec 4, 2024
37fdfa1
fix: add sentence trimming to OpenAIWrapper (#1526)
yjoonjang Dec 4, 2024
1e62184
1.21.5
invalid-email-address Dec 4, 2024
a44a46c
fix: Fixed metadata errors (#1547)
x-tabdeveloping Dec 4, 2024
d713525
1.21.6
invalid-email-address Dec 4, 2024
279a4ee
fix: remove curev1 from multlingual (#1552)
KennethEnevoldsen Dec 5, 2024
e339735
1.21.7
invalid-email-address Dec 5, 2024
2ee8d44
fix: Add Model2vec (#1546)
x-tabdeveloping Dec 6, 2024
2905813
Made result loading more permissive, changed eval splits for HotPotQA…
x-tabdeveloping Dec 6, 2024
a6ce6f9
1.21.8
invalid-email-address Dec 6, 2024
fc64791
docs: Correction of SICK-R metadata (#1558)
rafalposwiata Dec 7, 2024
611b6a1
feat(google_models): fix issues and add support for `text-embedding-0…
dbuades Dec 7, 2024
5e7e033
1.22.0
invalid-email-address Dec 7, 2024
ac44e58
fix(bm25s): search implementation (#1566)
dbuades Dec 7, 2024
346179f
Merge branch 'refs/heads/main' into update_cli
Samoed Dec 7, 2024
b8ff89c
1.22.1
invalid-email-address Dec 7, 2024
03347eb
docs: Fix dependency library name for bm25s (#1568)
isaac-chung Dec 7, 2024
6489fca
fix: Add training dataset to model meta (#1561)
KennethEnevoldsen Dec 8, 2024
1d21818
feat: (cohere_models) cohere_task_type issue, batch requests and tqdm…
dbuades Dec 8, 2024
68bd8ac
fix(publichealth-qa): ignore rows with `None` values in `question` o…
dbuades Dec 8, 2024
2550a27
1.23.0
invalid-email-address Dec 8, 2024
d474451
fix wongnai
Samoed Dec 8, 2024
2015ee5
update inits
Samoed Dec 8, 2024
23fb642
fix tests
Samoed Dec 8, 2024
54a7f5c
lint
Samoed Dec 8, 2024
07f1391
Merge branch 'refs/heads/main' into update_imports
Samoed Dec 8, 2024
d67225b
update imports
Samoed Dec 9, 2024
8653c27
fix tests
Samoed Dec 9, 2024
4ba6ff5
lint
Samoed Dec 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 27 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,8 @@ from sentence_transformers import SentenceTransformer

# Define the sentence-transformers model name
model_name = "average_word_embeddings_komninos"
# or directly from huggingface:
# model_name = "sentence-transformers/all-MiniLM-L6-v2"

model = SentenceTransformer(model_name)
model = mteb.get_model(model_name) # if the model is not implemented in MTEB it will be eq. to SentenceTransformer(model_name)
tasks = mteb.get_tasks(tasks=["Banking77Classification"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder=f"results/{model_name}")
Expand Down Expand Up @@ -221,7 +219,10 @@ Note that the public leaderboard uses the test splits for all datasets except MS
Models should implement the following interface, implementing an `encode` function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.). For inspiration, you can look at the [mteb/mtebscripts repo](https://github.com/embeddings-benchmark/mtebscripts) used for running diverse models via SLURM scripts for the paper.

```python
import mteb
from mteb.encoder_interface import PromptType
import numpy as np


class CustomModel:
def encode(
Expand All @@ -245,7 +246,7 @@ class CustomModel:
pass

model = CustomModel()
tasks = mteb.get_task("Banking77Classification")
tasks = mteb.get_tasks(tasks=["Banking77Classification"])
evaluation = MTEB(tasks=tasks)
evaluation.run(model)
```
Expand Down Expand Up @@ -379,6 +380,28 @@ results = mteb.load_results(models=models, tasks=tasks)
df = results_to_dataframe(results)
```

</details>


<details>
<summary> Annotate Contamination in the training data of a model </summary>

### Annotate Contamination

have your found contamination in the training data of a model? Please let us know, either by opening an issue or ideally by submitting a PR
annotatig the training datasets of the model:

```py
model_w_contamination = ModelMeta(
name = "model-with-contamination"
...
training_datasets: {"ArguAna": # name of dataset within MTEB
["test"]} # the splits that have been trained on
...
)
```


</details>

<details>
Expand Down
22 changes: 17 additions & 5 deletions docs/create_tasks_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

import mteb
from mteb.abstasks.TaskMetadata import PROGRAMMING_LANGS, TASK_TYPE
from mteb.languages import ISO_TO_FAM_LEVEL0, ISO_TO_LANGUAGE


def author_from_bibtex(bibtex: str | None) -> str:
Expand Down Expand Up @@ -82,10 +83,21 @@ def create_task_lang_table(tasks: list[mteb.AbsTask], sort_by_sum=False) -> str:
## Wrangle for polars
pl_table_dict = []
for lang, d in table_dict.items():
d.update({"0-lang": lang}) # for sorting columns
d.update({"0-lang-code": lang}) # for sorting columns
pl_table_dict.append(d)

df = pl.DataFrame(pl_table_dict).sort(by="0-lang")
df = pl.DataFrame(pl_table_dict).sort(by="0-lang-code")
df = df.with_columns(
pl.col("0-lang-code")
.replace_strict(ISO_TO_LANGUAGE, default="unknown")
.alias("1-lang-name")
)
df = df.with_columns(
pl.col("0-lang-code")
.replace_strict(ISO_TO_FAM_LEVEL0, default="Unclassified")
.alias("2-lang-fam")
)

df = df.with_columns(sum=pl.sum_horizontal(get_args(TASK_TYPE)))
df = df.select(sorted(df.columns))
if sort_by_sum:
Expand All @@ -96,7 +108,7 @@ def create_task_lang_table(tasks: list[mteb.AbsTask], sort_by_sum=False) -> str:
task_names_md = " | ".join(sorted(get_args(TASK_TYPE)))
horizontal_line_md = "---|---" * (len(sorted(get_args(TASK_TYPE))) + 1)
table = f"""
| Language | {task_names_md} | Sum |
| ISO Code | Language | Family | {task_names_md} | Sum |
|{horizontal_line_md}|
"""

Expand All @@ -119,14 +131,14 @@ def insert_tables(
file_path: str, tables: list[str], tags: list[str] = ["TASKS TABLE"]
) -> None:
"""Insert tables within <!-- TABLE START --> and <!-- TABLE END --> or similar tags."""
md = Path(file_path).read_text()
md = Path(file_path).read_text(encoding="utf-8")

for table, tag in zip(tables, tags):
start = f"<!-- {tag} START -->"
end = f"<!-- {tag} END -->"
md = md.replace(md[md.index(start) + len(start) : md.index(end)], table)

Path(file_path).write_text(md)
Path(file_path).write_text(md, encoding="utf-8")


def main():
Expand Down
16 changes: 13 additions & 3 deletions mteb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,23 @@
MTEB_RETRIEVAL_WITH_INSTRUCTIONS,
CoIR,
)
from mteb.evaluation import *
from mteb.encoder_interface import Encoder
from mteb.evaluation import MTEB
from mteb.load_results import BenchmarkResults, load_results
from mteb.models import get_model, get_model_meta, get_model_metas
from mteb.load_results.task_results import TaskResult
from mteb.models import (
SentenceTransformerWrapper,
get_model,
get_model_meta,
get_model_metas,
)
from mteb.overview import TASKS_REGISTRY, get_task, get_tasks

from .benchmarks.benchmarks import Benchmark
from .benchmarks.get_benchmark import BENCHMARK_REGISTRY, get_benchmark, get_benchmarks

__version__ = version("mteb") # fetch version from install metadata


__all__ = [
"MTEB_ENG_CLASSIC",
"MTEB_MAIN_RU",
Expand All @@ -40,4 +46,8 @@
"get_benchmarks",
"BenchmarkResults",
"BENCHMARK_REGISTRY",
"MTEB",
"TaskResult",
"SentenceTransformerWrapper",
"Encoder",
]
6 changes: 3 additions & 3 deletions mteb/abstasks/AbsTask.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,11 +72,11 @@ def __init__(self, seed: int = 42, **kwargs: Any):
torch.manual_seed(self.seed)
torch.cuda.manual_seed_all(self.seed)

def check_if_dataset_is_superseeded(self):
"""Check if the dataset is superseeded by a newer version"""
def check_if_dataset_is_superseded(self):
"""Check if the dataset is superseded by a newer version"""
if self.superseded_by:
logger.warning(
f"Dataset '{self.metadata.name}' is superseeded by '{self.superseded_by}', you might consider using the newer version of the dataset."
f"Dataset '{self.metadata.name}' is superseded by '{self.superseded_by}', you might consider using the newer version of the dataset."
)

def dataset_transform(self):
Expand Down
1 change: 1 addition & 0 deletions mteb/abstasks/TaskMetadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@
"cc0-1.0",
"bsd-3-clause",
"gpl-3.0",
"lgpl-3.0",
"cdla-sharing-1.0",
"mpl-2.0",
]
Expand Down
44 changes: 31 additions & 13 deletions mteb/abstasks/__init__.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,33 @@
from __future__ import annotations

from ..evaluation.LangMapping import *
from .AbsTask import *
from .AbsTaskBitextMining import *
from .AbsTaskClassification import *
from .AbsTaskClustering import *
from .AbsTaskMultilabelClassification import *
from .AbsTaskPairClassification import *
from .AbsTaskReranking import *
from .AbsTaskRetrieval import *
from .AbsTaskSpeedTask import *
from .AbsTaskSTS import *
from .AbsTaskSummarization import *
from .MultilingualTask import *
from .AbsTask import AbsTask
from .AbsTaskBitextMining import AbsTaskBitextMining
from .AbsTaskClassification import AbsTaskClassification
from .AbsTaskClustering import AbsTaskClustering
from .AbsTaskClusteringFast import AbsTaskClusteringFast
from .AbsTaskMultilabelClassification import AbsTaskMultilabelClassification
from .AbsTaskPairClassification import AbsTaskPairClassification
from .AbsTaskReranking import AbsTaskReranking
from .AbsTaskRetrieval import AbsTaskRetrieval
from .AbsTaskSpeedTask import AbsTaskSpeedTask
from .AbsTaskSTS import AbsTaskSTS
from .AbsTaskSummarization import AbsTaskSummarization
from .MultilingualTask import MultilingualTask
from .TaskMetadata import TaskMetadata

__all__ = [
"AbsTask",
"AbsTaskBitextMining",
"AbsTaskClassification",
"AbsTaskClustering",
"AbsTaskClusteringFast",
"AbsTaskMultilabelClassification",
"AbsTaskPairClassification",
"AbsTaskReranking",
"AbsTaskRetrieval",
"AbsTaskSpeedTask",
"AbsTaskSTS",
"AbsTaskSummarization",
"MultilingualTask",
"TaskMetadata",
]
57 changes: 55 additions & 2 deletions mteb/benchmarks/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,57 @@
from __future__ import annotations

from mteb.benchmarks.benchmarks import *
from mteb.benchmarks.get_benchmark import *
from mteb.benchmarks.benchmarks import (
BRIGHT,
LONG_EMBED,
MTEB_DEU,
MTEB_EN,
MTEB_ENG_CLASSIC,
MTEB_EU,
MTEB_FRA,
MTEB_INDIC,
MTEB_JPN,
MTEB_KOR,
MTEB_MAIN_RU,
MTEB_MINERS_BITEXT_MINING,
MTEB_POL,
MTEB_RETRIEVAL_LAW,
MTEB_RETRIEVAL_MEDICAL,
MTEB_RETRIEVAL_WITH_INSTRUCTIONS,
SEB,
Benchmark,
CoIR,
MTEB_code,
MTEB_multilingual,
)
from mteb.benchmarks.get_benchmark import (
BENCHMARK_REGISTRY,
get_benchmark,
get_benchmarks,
)

__all__ = [
"Benchmark",
"MTEB_EN",
"MTEB_ENG_CLASSIC",
"MTEB_MAIN_RU",
"MTEB_RETRIEVAL_WITH_INSTRUCTIONS",
"MTEB_RETRIEVAL_LAW",
"MTEB_RETRIEVAL_MEDICAL",
"MTEB_MINERS_BITEXT_MINING",
"SEB",
"CoIR",
"MTEB_FRA",
"MTEB_DEU",
"MTEB_KOR",
"MTEB_POL",
"MTEB_code",
"MTEB_multilingual",
"MTEB_JPN",
"MTEB_INDIC",
"MTEB_EU",
"LONG_EMBED",
"BRIGHT",
"BENCHMARK_REGISTRY",
"get_benchmarks",
"get_benchmark",
]
44 changes: 44 additions & 0 deletions mteb/descriptive_stats/Classification/Ddisco.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
{
"test": {
"num_samples": 201,
"number_of_characters": 200062,
"number_texts_intersect_with_train": 1,
"min_text_length": 529,
"average_text_length": 995.3333333333334,
"max_text_length": 2050,
"unique_text": 201,
"unique_labels": 3,
"labels": {
"2": {
"count": 76
},
"3": {
"count": 115
},
"1": {
"count": 10
}
}
},
"train": {
"num_samples": 801,
"number_of_characters": 779241,
"number_texts_intersect_with_train": null,
"min_text_length": 492,
"average_text_length": 972.8352059925094,
"max_text_length": 2411,
"unique_text": 796,
"unique_labels": 3,
"labels": {
"1": {
"count": 30
},
"2": {
"count": 325
},
"3": {
"count": 446
}
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
{
"test": {
"num_samples": 1200,
"number_of_characters": 141679,
"number_texts_intersect_with_train": 0,
"min_text_length": 25,
"average_text_length": 118.06583333333333,
"max_text_length": 566,
"unique_text": 1200,
"unique_labels": 2,
"labels": {
"1": {
"count": 600
},
"0": {
"count": 600
}
}
},
"train": {
"num_samples": 330,
"number_of_characters": 37706,
"number_texts_intersect_with_train": null,
"min_text_length": 19,
"average_text_length": 114.26060606060607,
"max_text_length": 315,
"unique_text": 330,
"unique_labels": 2,
"labels": {
"1": {
"count": 165
},
"0": {
"count": 165
}
}
}
}
Loading
Loading