Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FaMTEB (Farsi/Persian Text Embedding Benchmark) #1843

Merged
merged 27 commits into from
Jan 30, 2025

Conversation

mehran-sarmadi
Copy link
Contributor

@mehran-sarmadi mehran-sarmadi commented Jan 20, 2025

We are a research team from Sharif University of Technology and MCINext Company developing a text embedding benchmark for the Persian language based on MTEB. So far, we have gathered around 63 datasets spanning 7 tasks (Classification, Clustering, Pair Classification, Reranking, Retrieval, STS, and Summary Retrieval), including a mix of existing, translated, and newly generated datasets. Notably, we are introducing the Summary Retrieval task for the first time, which focuses on identifying the correct summary of a paragraph from a set of candidates. We have also evaluated several Persian language models and text embeddings that support Persian for this benchmark.

We also open related PR for the results and leaderboard tab, and we are finalizing a paper on this work, which will be published in the near future.

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding datasets checklist

Reason for dataset addition: ...

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

@mehran-sarmadi mehran-sarmadi marked this pull request as ready for review January 25, 2025 14:41
Copy link
Collaborator

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition! Can you add mock task of AbsTaskSummaryRetrieval task to https://github.com/embeddings-benchmark/mteb/blob/main/tests/test_benchmark/mock_tasks.py?

mteb/abstasks/AbsTaskSummaryRetrieval.py Outdated Show resolved Hide resolved
mteb/abstasks/AbsTaskSummaryRetrieval.py Outdated Show resolved Hide resolved
mteb/abstasks/AbsTaskSummaryRetrieval.py Outdated Show resolved Hide resolved
mteb/tasks/Classification/fas/FaMTEBClassification.py Outdated Show resolved Hide resolved
mteb/tasks/SummaryRetrieval/fas/FaMTEBSummaryRetrieval.py Outdated Show resolved Hide resolved
mteb/tasks/SummaryRetrieval/fas/FaMTEBSummaryRetrieval.py Outdated Show resolved Hide resolved
mteb/tasks/SummaryRetrieval/fas/FaMTEBSummaryRetrieval.py Outdated Show resolved Hide resolved
@Samoed
Copy link
Collaborator

Samoed commented Jan 25, 2025

Maybe we should move this PR to v2 branch?

@mehran-sarmadi
Copy link
Contributor Author

Maybe we should move this PR to v2 branch?

I haven’t checked the next version yet, so I’m not sure if any changes are needed. If needed, I’ll make the updates.

@mehran-sarmadi
Copy link
Contributor Author

Great addition! Can you add mock task of AbsTaskSummaryRetrieval task to https://github.com/embeddings-benchmark/mteb/blob/main/tests/test_benchmark/mock_tasks.py?

Yes, It's done

@mehran-sarmadi mehran-sarmadi marked this pull request as draft January 27, 2025 14:22
@mehran-sarmadi mehran-sarmadi marked this pull request as ready for review January 27, 2025 15:57
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mehran-sarmadi,

First of all, thank you for your contribution! The community will benefit from the extended language coverage of this benchmark.

There are 2 main points that I'd like to discuss:

  1. While I understand the name of the new task type is summary retrieval, I believe that the tasks can actually subclass from AbsTaskBitextMining with minimal changes. AbsTaskSummaryRetrieval and SummaryRetrievalEvaluator actually has mostly the same code as the Bitext counterparts. For example, to resolve different in columns, we could define a dataset_transform(), like
def dataset_transform(self):
        self.dataset = self.dataset.rename_columns(
            {"text": "sentence1", "summary": "sentence2"}
        )

As for the evaluator, theBitextMiningEvaluator can simply be used after the above change. We can use the task metadata's description to indicate that this is a summary retrieval task. This way, we can revert all changes related to adding the new AbsTask.
2. We are in the process of releasing an updated leaderboard. As such, we will not be reviewing proposed changes to the current leaderboard. Since you've already added a Benchmark object, it will be made available automatically once the new one is released. No need for additional PRs. We appreciate the foresight and effort though.

Let me know if you have any further questions.

mteb/benchmarks/benchmarks.py Show resolved Hide resolved
@mehran-sarmadi
Copy link
Contributor Author

Hi @mehran-sarmadi,

First of all, thank you for your contribution! The community will benefit from the extended language coverage of this benchmark.

There are 2 main points that I'd like to discuss:

  1. While I understand the name of the new task type is summary retrieval, I believe that the tasks can actually subclass from AbsTaskBitextMining with minimal changes. AbsTaskSummaryRetrieval and SummaryRetrievalEvaluator actually has mostly the same code as the Bitext counterparts. For example, to resolve different in columns, we could define a dataset_transform(), like
def dataset_transform(self):
        self.dataset = self.dataset.rename_columns(
            {"text": "sentence1", "summary": "sentence2"}
        )

As for the evaluator, theBitextMiningEvaluator can simply be used after the above change. We can use the task metadata's description to indicate that this is a summary retrieval task. This way, we can revert all changes related to adding the new AbsTask. 2. We are in the process of releasing an updated leaderboard. As such, we will not be reviewing proposed changes to the current leaderboard. Since you've already added a Benchmark object, it will be made available automatically once the new one is released. No need for additional PRs. We appreciate the foresight and effort though.

Let me know if you have any further questions.

Hi @isaac-chung,

Thank you for your detailed feedback and suggestions!

  1. I have updated the task to subclass from AbsTaskBitextMining as suggested.

  2. I understand the leaderboard update, thanks for the clarification!

I appreciate your guidance and the opportunity to contribute to this benchmark. Let me know if there's anything else I should consider or adjust.

Thanks again!

@isaac-chung
Copy link
Collaborator

@mehran-sarmadi thanks for such a quick turnaround! The changes look good to me. cc @KennethEnevoldsen + @x-tabdeveloping on the added task type.

I think we'll be ready once the datasets that needed to be aggregated have been specified.

@mehran-sarmadi
Copy link
Contributor Author

@isaac-chung Glad to hear the changes look good.
Here are the groups that need to be specified:

"SynPerChatbotConvSAClassification": [
    "SynPerChatbotConvSAAnger",
    "SynPerChatbotConvSAFear",
    "SynPerChatbotConvSAFriendship",
    "SynPerChatbotConvSAHappiness",
    "SynPerChatbotConvSAJealousy",
    "SynPerChatbotConvSALove",
    "SynPerChatbotConvSASadness",
    "SynPerChatbotConvSASatisfaction",
    "SynPerChatbotConvSASurprise"
  ],
  "SynPerChatbotConvSAToneClassification": [
    "SynPerChatbotConvSAToneChatbotClassification",
    "SynPerChatbotConvSAToneUserClassification"
  ],
  "SynPerChatbotRAGToneClassification": [
    "SynPerChatbotRAGToneChatbotClassification",
    "SynPerChatbotRAGToneUserClassification"
  ],
  "SynPerChatbotToneClassification": [
    "SynPerChatbotToneChatbotClassification",
    "SynPerChatbotToneUserClassification"
  ],
  "CQADupstackRetrieval-Fa": [
    "CQADupstackAndroidRetrieval-Fa",
    "CQADupstackEnglishRetrieval-Fa",
    "CQADupstackGamingRetrieval-Fa",
    "CQADupstackGisRetrieval-Fa",
    "CQADupstackMathematicaRetrieval-Fa",
    "CQADupstackPhysicsRetrieval-Fa",
    "CQADupstackProgrammersRetrieval-Fa",
    "CQADupstackStatsRetrieval-Fa",
    "CQADupstackTexRetrieval-Fa",
    "CQADupstackUnixRetrieval-Fa",
    "CQADupstackWebmastersRetrieval-Fa",
    "CQADupstackWordpressRetrieval-Fa"
  ]

However, if you think any of these changes are unnecessary, we can skip them as needed.

@isaac-chung
Copy link
Collaborator

@mehran-sarmadi thanks again. This is more for me to understand your paper better: Will you be reporting the aggregated scores per group in your paper only, or will you also report the individual task scores for those within groups? In general, we aim to be as close as possible to reproducing what's been reported. So if it is the latter, then these changes are fine as is, and aggregating can be a separate PR. But if it is the former (only reporting aggregated scores), then let's add in the AggregateTasks as well.

@mehran-sarmadi
Copy link
Contributor Author

@isaac-chung Thanks for your question! For SynPerChatbotConvSAClassification and CQADupstackRetrieval-Fa, since they contain a large number of datasets, we have reported the scores in an aggregated manner.

For the other cases, as they are only two, we have reported them individually. I'll go ahead and add the AggregateTask for these and will inform you once it's done.

@mehran-sarmadi
Copy link
Contributor Author

Hi,
I have added the combined version of those two sets of datasets. Now, I have just one question: Should I add them here?

@pytest.mark.parametrize("task_name", ["BornholmBitextMining", "CQADupstackRetrieval"])
@pytest.mark.parametrize("eval_splits", [["test"], None])
def test_get_task(task_name: str, eval_splits: list[str] | None):
    task = get_task(task_name, eval_splits=eval_splits)

in tests/test_overview.py, like CQADupstackRetrieval, or not?

If so, it would look like this:

@pytest.mark.parametrize("task_name", ["BornholmBitextMining", "CQADupstackRetrieval", "SynPerChatbotConvSAClassification", "CQADupstackRetrieval-Fa"])
@pytest.mark.parametrize("eval_splits", [["test"], None])
def test_get_task(task_name: str, eval_splits: list[str] | None):
    task = get_task(task_name, eval_splits=eval_splits)

@isaac-chung
Copy link
Collaborator

Thanks @mehran-sarmadi , good work!

Since the test is to help us feel more confident about the implementation of AggregateTask, I don't think we need to add them there as we're simply using it.

If we want, I'd suggest adding them temporarily to run the test locally, but not commit to the PR. This is optional.

I'll run the tests now, and if they pass, I think it's good to go.

@mehran-sarmadi
Copy link
Contributor Author

Thanks, @isaac-chung!

That makes sense. I really appreciate your help.

Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks again!

@isaac-chung isaac-chung merged commit f3404b4 into embeddings-benchmark:main Jan 30, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants