Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Longformer for longer context #195

Open
JohannesTK opened this issue Apr 16, 2020 · 7 comments
Open

Longformer for longer context #195

JohannesTK opened this issue Apr 16, 2020 · 7 comments

Comments

@JohannesTK
Copy link

Hey,

Thank you for open-sourcing this super useful work! 🤗

I have a question - Longformer by AI2 supports up to 16K tokens and achieves SOTA on many long-text tasks. As described in many issues, then long context is something a lot of people are looking for and could be extremely useful. Are you possibly looking to integrate Longformer to some of the models?

Thanks & stay healthy,
Johannes

@nreimers
Copy link
Member

Hi @JohannesTK
Thanks for sharing. Looks really interesting.

I hope if will be soon integrated into huggingface transformers. Then, it will automatically also be available in the repository.

A challenge for this longform could be finding suitable training data. So far, usual training data are based on sentences (like NLI or STSb dataset). Not sure if and how this would scale to longer documents like paragraphs. Here we often observe quite a performance drop, as it is not clear how to extract this information.

Image you have a paragraph with sentences that talk on these topics:
Paragraph 1: A, A, A, B, A, A, A
Paragraph 2: C, C, B, C, C,

These paragraphs would be mostly different, however, they both have a sentence talking about 'B'. How could the vector than look like? Should Paragraph 1 & 2 be dissimilar, as they talk mainly about different things. Or should they be similar, as they both include C?

Maybe in these cases a double poly-encoder would be the right thing: Generate multiple vectors for each paragraph and then do some cross-comparison to find matches. In that case, B from paragraph 1 and B from paragraph 2 could match.

In conclusion, it will not be easy to implement and use this. But I will give my best.

Best
Nils Reimers

@JohannesTK
Copy link
Author

Thank you for a quick reply, @nreimers

Indeed, the unequal mass between source and target documents is a challenge. Seems like Facebook has largely solved it recently with their work Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover’s Distance - 4.4 Handling Imbalanced Document Mass.

Though the dataset used in their work A Massive Collection of Cross-Lingual Web-Document Pairs seems yet to be open-sourced.

Best,
Johannes

@AlexMRuch
Copy link

Definitely looking forward to such an addition too! Hopefully it'll get pushed to huggingface's transformers.

@timsuchanek
Copy link
Contributor

timsuchanek commented May 28, 2020

It seems like both Reformer and Longformer are now available in the transformers library:

Longformer

huggingface/transformers#4352
https://huggingface.co/transformers/model_doc/longformer.html

Reformer

huggingface/transformers#3351
https://huggingface.co/transformers/model_doc/reformer.html

Looking forward to seeing them included here!

@nreimers
Copy link
Member

Hi @timsuchanek
Looking forward to use them.

I think they can already be used with the models.Transformer class. Just specify the model name to a pre-traing longformer/reformer.

I don't know how they will perform. Will be interesting :)

@cabhijith
Copy link

Has anyone tried it out yet for semantic search? How did it perform?

@rjurney
Copy link

rjurney commented Aug 29, 2020

@JohannesTK one wonders whether the third method, document mass normalization would have applications for semantic search... would weighting the query sentence as heavily as all the sentences in the document we’re computing a distance to result in reasonable performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants