-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Longformer for longer context #195
Comments
Hi @JohannesTK I hope if will be soon integrated into huggingface transformers. Then, it will automatically also be available in the repository. A challenge for this longform could be finding suitable training data. So far, usual training data are based on sentences (like NLI or STSb dataset). Not sure if and how this would scale to longer documents like paragraphs. Here we often observe quite a performance drop, as it is not clear how to extract this information. Image you have a paragraph with sentences that talk on these topics: These paragraphs would be mostly different, however, they both have a sentence talking about 'B'. How could the vector than look like? Should Paragraph 1 & 2 be dissimilar, as they talk mainly about different things. Or should they be similar, as they both include C? Maybe in these cases a double poly-encoder would be the right thing: Generate multiple vectors for each paragraph and then do some cross-comparison to find matches. In that case, B from paragraph 1 and B from paragraph 2 could match. In conclusion, it will not be easy to implement and use this. But I will give my best. Best |
Thank you for a quick reply, @nreimers Indeed, the unequal mass between source and target documents is a challenge. Seems like Facebook has largely solved it recently with their work Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover’s Distance - 4.4 Handling Imbalanced Document Mass. Though the dataset used in their work A Massive Collection of Cross-Lingual Web-Document Pairs seems yet to be open-sourced. Best, |
Definitely looking forward to such an addition too! Hopefully it'll get pushed to huggingface's transformers. |
It seems like both Longformerhuggingface/transformers#4352 Reformerhuggingface/transformers#3351 Looking forward to seeing them included here! |
Hi @timsuchanek I think they can already be used with the models.Transformer class. Just specify the model name to a pre-traing longformer/reformer. I don't know how they will perform. Will be interesting :) |
Has anyone tried it out yet for semantic search? How did it perform? |
@JohannesTK one wonders whether the third method, document mass normalization would have applications for semantic search... would weighting the query sentence as heavily as all the sentences in the document we’re computing a distance to result in reasonable performance? |
Hey,
Thank you for open-sourcing this super useful work! 🤗
I have a question - Longformer by AI2 supports up to 16K tokens and achieves SOTA on many long-text tasks. As described in many issues, then long context is something a lot of people are looking for and could be extremely useful. Are you possibly looking to integrate Longformer to some of the models?
Thanks & stay healthy,
Johannes
The text was updated successfully, but these errors were encountered: