Longformer for longer context #195

JohannesTK · 2020-04-16T15:41:39Z

Hey,

Thank you for open-sourcing this super useful work! 🤗

I have a question - Longformer by AI2 supports up to 16K tokens and achieves SOTA on many long-text tasks. As described in many issues, then long context is something a lot of people are looking for and could be extremely useful. Are you possibly looking to integrate Longformer to some of the models?

Thanks & stay healthy,
Johannes

nreimers · 2020-04-16T16:04:49Z

Hi @JohannesTK
Thanks for sharing. Looks really interesting.

I hope if will be soon integrated into huggingface transformers. Then, it will automatically also be available in the repository.

A challenge for this longform could be finding suitable training data. So far, usual training data are based on sentences (like NLI or STSb dataset). Not sure if and how this would scale to longer documents like paragraphs. Here we often observe quite a performance drop, as it is not clear how to extract this information.

Image you have a paragraph with sentences that talk on these topics:
Paragraph 1: A, A, A, B, A, A, A
Paragraph 2: C, C, B, C, C,

These paragraphs would be mostly different, however, they both have a sentence talking about 'B'. How could the vector than look like? Should Paragraph 1 & 2 be dissimilar, as they talk mainly about different things. Or should they be similar, as they both include C?

Maybe in these cases a double poly-encoder would be the right thing: Generate multiple vectors for each paragraph and then do some cross-comparison to find matches. In that case, B from paragraph 1 and B from paragraph 2 could match.

In conclusion, it will not be easy to implement and use this. But I will give my best.

Best
Nils Reimers

JohannesTK · 2020-04-17T07:35:25Z

Thank you for a quick reply, @nreimers

Indeed, the unequal mass between source and target documents is a challenge. Seems like Facebook has largely solved it recently with their work Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover’s Distance - 4.4 Handling Imbalanced Document Mass.

Though the dataset used in their work A Massive Collection of Cross-Lingual Web-Document Pairs seems yet to be open-sourced.

Best,
Johannes

AlexMRuch · 2020-04-23T18:17:47Z

Definitely looking forward to such an addition too! Hopefully it'll get pushed to huggingface's transformers.

timsuchanek · 2020-05-28T17:35:48Z

It seems like both Reformer and Longformer are now available in the transformers library:

Longformer

huggingface/transformers#4352
https://huggingface.co/transformers/model_doc/longformer.html

Reformer

huggingface/transformers#3351
https://huggingface.co/transformers/model_doc/reformer.html

Looking forward to seeing them included here!

nreimers · 2020-05-29T07:11:17Z

Hi @timsuchanek
Looking forward to use them.

I think they can already be used with the models.Transformer class. Just specify the model name to a pre-traing longformer/reformer.

I don't know how they will perform. Will be interesting :)

cabhijith · 2020-07-18T04:02:15Z

Has anyone tried it out yet for semantic search? How did it perform?

rjurney · 2020-08-29T08:20:43Z

@JohannesTK one wonders whether the third method, document mass normalization would have applications for semantic search... would weighting the query sentence as heavily as all the sentences in the document we’re computing a distance to result in reasonable performance?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Longformer for longer context #195

Longformer for longer context #195

JohannesTK commented Apr 16, 2020

nreimers commented Apr 16, 2020

JohannesTK commented Apr 17, 2020

AlexMRuch commented Apr 23, 2020

timsuchanek commented May 28, 2020 •

edited

Loading

nreimers commented May 29, 2020

cabhijith commented Jul 18, 2020

rjurney commented Aug 29, 2020

Longformer for longer context #195

Longformer for longer context #195

Comments

JohannesTK commented Apr 16, 2020

nreimers commented Apr 16, 2020

JohannesTK commented Apr 17, 2020

AlexMRuch commented Apr 23, 2020

timsuchanek commented May 28, 2020 • edited Loading

Longformer

Reformer

nreimers commented May 29, 2020

cabhijith commented Jul 18, 2020

rjurney commented Aug 29, 2020

timsuchanek commented May 28, 2020 •

edited

Loading