Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BROS #23190

Merged
merged 120 commits into from
Sep 14, 2023
Merged

Add BROS #23190

merged 120 commits into from
Sep 14, 2023

Conversation

jinhopark8345
Copy link
Contributor

@jinhopark8345 jinhopark8345 commented May 7, 2023

What does this PR do?

Add BROS(BERT Relying On Spatiality) to 🤗 Transformers

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@NielsRogge

@jinhopark8345 jinhopark8345 force-pushed the add-bros branch 2 times, most recently from 042ab4a to ad32c01 Compare May 10, 2023 11:32
@amyeroberts
Copy link
Collaborator

@jinhopark8345 Awesome work - looking forward to having this model added! Feel free to ping us when the PR is ready for review or you have any implementation questions in the meantime.

@jinhopark8345
Copy link
Contributor Author

jinhopark8345 commented May 14, 2023

@amyeroberts

I am confused about what needs to be done.

According to the How to add a new model guideline, a big part of it is porting pretrained models (from the original repo) into Huggingface transformers and making sure they are correctly ported by checking the outputs of each layer's forward step.

However, it seems like the authors of the Bros model used transformers-cli to create the boilerplate code, and I don't think there is much to change from the original code.

Do I need to write a conversion script? Or can I skip this step and move to the step where I add model test codes?

Thanks for the help in advance!

@amyeroberts
Copy link
Collaborator

@jinhopark8345 Interesting - that will definitely make things easier! In this case, if the files are already on the hub and in the correct format, there's no need for the conversion script. It's possible there might be additional arguments required in the config files or additional files needed in the hub repo, in which case, I'd suggest writing a script to add these. You probably won't be able to write directly to the org's repo, but can open a PR with any necessary changes.

@jinhopark8345
Copy link
Contributor Author

@amyeroberts I added convert_bros_to_pytorch.py script because bbox_projection (linear layer) moved from BrosTextEmbeddings to newly added BrosBboxEmbedding.

@amyeroberts amyeroberts changed the title [WIP] Add BROS Add BROS Sep 14, 2023
README.md Outdated
Comment on lines 313 to 314
1. **[Bros](https://huggingface.co/docs/transformers/model_doc/bros)** (from NAVER CLOVA) released with the paper [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539) by Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park.
1. **[BROS](https://huggingface.co/docs/transformers/main/model_doc/bros)** (from <FILL INSTITUTION>) released with the paper [<FILL PAPER TITLE>](<FILL ARKIV LINK>) by <FILL AUTHORS>.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be fixed for the documentation to build. Once resolved we can merge 🤗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied the fix!

@amyeroberts
Copy link
Collaborator

@jinhopark8345 Great - thank you! Could you rebase on main to resolve the conflicts? We should be good to go after that :)

@amyeroberts amyeroberts merged commit 17fdd35 into huggingface:main Sep 14, 2023
@amyeroberts
Copy link
Collaborator

@jinhopark8345 Thanks for contributing this model! Make sure to share about it's addition to the library on twitter/linkedin/your medium of choice 🤗

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 19, 2023

@jinhopark8345 Thank you for adding this model into transformers.

Regarding the test BrosModelIntegrationTest.test_inference_no_head, it fails on our T4 GPU VM as the expected and actual output differ too much, as you can see.

Could you double check on your machines, please? And what's your machine (GPU) type?

Thank you in advance.

(Pdb) outputs.last_hidden_state[0, :3, :3]
tensor([[-0.3165,  0.0830, -0.1203],
        [-0.0089,  0.0031,  0.0736],
        [-0.0461,  0.0146,  0.0880]], device='cuda:0')
(Pdb) expected_slice
tensor([[-0.4027,  0.0756, -0.0647],
        [-0.0192, -0.0065,  0.1042],
        [-0.0671,  0.0214,  0.0960]], device='cuda:0')

You can run the test with

TF_FORCE_GPU_ALLOW_GROWTH=true RUN_SLOW=1 python3 -m pytest -v tests/models/bros/test_modeling_bros.py::BrosModelIntegrationTest::test_inference_no_head

@jinhopark8345
Copy link
Contributor Author

jinhopark8345 commented Sep 19, 2023

@jinhopark8345 Thank you for adding this model into transformers.

Regarding the test BrosModelIntegrationTest.test_inference_no_head, it fails on our T4 GPU VM as the expected and actual output differ too much, as you can see.

Could you double check on your machines, please? And what's your machine (GPU) type?

Thank you in advance.

(Pdb) outputs.last_hidden_state[0, :3, :3]
tensor([[-0.3165,  0.0830, -0.1203],
        [-0.0089,  0.0031,  0.0736],
        [-0.0461,  0.0146,  0.0880]], device='cuda:0')
(Pdb) expected_slice
tensor([[-0.4027,  0.0756, -0.0647],
        [-0.0192, -0.0065,  0.1042],
        [-0.0671,  0.0214,  0.0960]], device='cuda:0')

You can run the test with

TF_FORCE_GPU_ALLOW_GROWTH=true RUN_SLOW=1 python3 -m pytest -v tests/models/bros/test_modeling_bros.py::BrosModelIntegrationTest::test_inference_no_head

@ydshieh Thank you for providing the test command!

I was able to reproduce the issue, but the outputs.last_hidden_state[0, :3, :3] value I obtained was different.
Interestingly, not only did the output was different from the expected_slice but the value of outputs.last_hidden_state[0, :3, :3] changed every time I ran the command you provided. For testing, I am using an RTX3090.

After some testing, I found that some weights weren't being initialized properly.
This issue came from that bbox_projection layer (a linear layer) was moved from BrosEmbeddings to under BrosBboxEmbeddings (BrosBboxEmbeddings is newly added class)

By changing:

model = BrosModel.from_pretrained("naver-clova-ocr/bros-base-uncased").to(torch_device)

to:

model = BrosModel.from_pretrained("jinho8345/bros-base-uncased").to(torch_device)

I was able to get consistent outputs. (conversion script : transformers/models/bros/convert_bros_to_pytorch.py)

I suspect this issue wasn't detected earlier because when running:

python3 -m pytest -v tests/models/bros/test_modeling_bros.py

torch cuda seed is manually set to certain value, perhaps due to other tests or other reasons.

The update is here but I am not sure how I should apply this patch to Transformers library.

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 19, 2023

Hi @jinhopark8345 Thanks a lot for looking into this!

You can open a PR to update the checkpoint repo used in the test, or we can do it on our own side.

But is it expected that naver-clova-ocr/bros-base-uncased doesn't have all the weights? What are the difference between these 2 checkpoints?

@jinhopark8345
Copy link
Contributor Author

jinhopark8345 commented Sep 19, 2023

Hi @jinhopark8345 Thanks a lot for looking into this!

You can open a PR to update the checkpoint repo used in the test, or we can do it on our own side.

But is it expected that naver-clova-ocr/bros-base-uncased doesn't have all the weights? What are the difference between these 2 checkpoints?

The naver-clova-ocr/bros-base-uncased has all the weights. But some weights have been renamed. So if we load BrosModel with naver-clova-ocr/bros-base-uncased checkpoint (original checkpoint), the renamed weights won't be initialized correctly with pretrained weights.

these are the renamed weights!

def rename_key(name):
   if name == "embeddings.bbox_projection.weight":
       name = "bbox_embeddings.bbox_projection.weight"

   if name == "embeddings.bbox_sinusoid_emb.x_pos_emb.inv_freq":
       name = "bbox_embeddings.bbox_sinusoid_emb.x_pos_emb.inv_freq"

   if name == "embeddings.bbox_sinusoid_emb.y_pos_emb.inv_freq":
       name = "bbox_embeddings.bbox_sinusoid_emb.y_pos_emb.inv_freq"

   return name

If you confirm updating the checkpoint is okay, I would like to open PR!

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 19, 2023

Sure, go for it. BTW, I see a lot of naver-clova-ocr/bros-base-uncased used, in particular in the examples. So just to be sure, is the user expected to use naver-clova-ocr/bros-base-uncased or the renamed one jinho8345/bros-base-uncased?

From your description, I think it is jinho8345/bros-base-uncased. If this is the case, could you update all occurrence (not just in the tests). Thank you!

@jinhopark8345 jinhopark8345 mentioned this pull request Sep 20, 2023
5 tasks
@ydshieh
Copy link
Collaborator

ydshieh commented Sep 20, 2023

Hello @jinhopark8345 Thank you again for fixing the checkpoint. I have yet another question needs your help.

For BrosModel, the bbox_position_embeddings could be None before calling self.encoder (if bbox is None)

bbox_position_embeddings = None
if bbox is not None:
# if bbox has 2 points (4 float tensors) per token, convert it to 4 points (8 float tensors) per token
if bbox.shape[-1] == 4:
bbox = bbox[:, :, [0, 1, 2, 1, 2, 3, 0, 3]]
scaled_bbox = bbox * self.config.bbox_scale
bbox_position_embeddings = self.bbox_embeddings(scaled_bbox)

but eventually, BrosSelfAttention will fail if it receives None for bbox_pos_emb

bbox_pos_emb = bbox_pos_emb.view(seq_length, seq_length, batch_size, d_head)

Could you double check if BrosModel will only work if bbox is not None in the original implementation? If this is not the case, how is bbox_pos_emb being created if bbox is None etc.

Thank you in advance, again!

@jinhopark8345
Copy link
Contributor Author

Hello @ydshieh Thank you for asking!

Below code is the original implementation

        scaled_bbox = bbox * self.config.bbox_scale
        bbox_pos_emb = self.embeddings.calc_bbox_pos_emb(
            scaled_bbox, self.config.pe_type
        )

In original implementation, BrosModel will only work if bbox is not None.

Would it be more helpful to users if we remove

so that BrosModel fails earlier? or do you suggest different solutions?

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 20, 2023

Hi! In this case, you can add a try: except at the beginning of BrosModel.forward method as the input validation.

(we might need a few more fixes if CI fails due to this)

Thank you !

@jinhopark8345 jinhopark8345 mentioned this pull request Sep 20, 2023
5 tasks
@NielsRogge
Copy link
Contributor

Hi @jinhopark8345 , congrats on this amazing contribution.

Feel free to share about it on Twitter/LinkedIn and we'll amplify.

parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023
* add Bros boilerplate

* copy and pasted modeling_bros.py from official Bros repo

* update copyright of bros files

* copy tokenization_bros.py from official repo and update import path

* copy tokenization_bros_fast.py from official repo and update import path

* copy configuration_bros.py from official repo and update import path

* remove trailing period in copyright line

* copy and paste bros/__init__.py from official repo

* save formatting

* remove unused unnecessary pe_type argument - using only crel type

* resolve import issue

* remove unused model classes

* remove unnecessary tests

* remove unused classes

* fix original code's bug - layer_module's argument order

* clean up modeling auto

* add bbox to prepare_config_and_inputs

* set temporary value to hidden_size (32 is too low because of the of the
Bros' positional embedding)

* remove decoder test, update create_and_check* input arguemnts

* add missing variable to model tests

* do make fixup

* update bros.mdx

* add boilerate plate for no_head inference test

* update BROS_PRETRAINED_MODEL_ARCHIVE_LIST (add naver-clova-ocr prefix)

* add prepare_bros_batch_inputs function

* update modeling_common to add bbox inputs in Bros Model Test

* remove unnecessary model inference

* add test case

* add model_doc

* add test case for token_classification

* apply fixup

* update modeling code

* update BrosForTokenClassification loss calculation logic

* revert logits preprocessing logic to make sure logits have original shape

* - update class name

* - add BrosSpadeOutput
- update BrosConfig arguments

* add boilerate plate for no_head inference test

* add prepare_bros_batch_inputs function

* add test case

* add test case for token_classification

* update modeling code

* update BrosForTokenClassification loss calculation logic

* revert logits preprocessing logic to make sure logits have original shape

* apply masking on the fly

* add BrosSpadeForTokenLinking

* update class name
put docstring to the beginning of the file

* separate the logits calculation logic and loss calculation logic

* update logic for loss calculation so that logits shape doesn't change
when return

* update typo

* update prepare_config_and_inputs

* update dummy node initialization

* update last_hidden_states getting logic to consider when return_dict is False

* update box first token mask param

* bugfix: remove random attention mask generation

* update keys to ignore on load missing

* run make style and quality

* apply make style and quality of other codes

* update box_first_token_mask to bool type

* update index.md

* apply make style and quality

* apply make fix-copies

* pass check_repo

* update bros model doc

* docstring bugfix fix

* add checkpoint for doc, tokenizer for doc

* Update README.md

* Update docs/source/en/model_doc/bros.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update bros.md

* Update src/transformers/__init__.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update docs/source/en/model_doc/bros.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* apply suggestions from code review

* apply suggestions from code review

* revert test_processor_markuplm.py

* Update test_processor_markuplm.py

* apply suggestions from code review

* apply suggestions from code review

* apply suggestions from code review

* update BrosSpadeELForTokenClassification head name to entity linker

* add doc string for config params

* update class, var names to more explicit and apply suggestions from code review

* remove unnecessary keys to ignore

* update relation extractor to be initialized with config

* add bros processor

* apply make style and quality

* update bros.md

* remove bros tokenizer, add bros processor that wraps bert tokenizer

* revert change

* apply make fix-copies

* update processor code, update itc -> initial token, stc -> subsequent token

* add type hint

* remove unnecessary condition branches in embedding forward

* fix auto tokenizer fail

* update docstring for each classes

* update bbox input dimension as standard 2 points and convert them to 4
points in forward pass

* update bros docs

* apply suggestions from code review : update Bros -> BROS in bros.md

* 1. box prefix var -> bbox
2. update variable names to be more explicit

* replace einsum with torch matmul

* apply style and quality

* remove unused argument

* remove unused arguments

* update docstrings

* apply suggestions from code review: add BrosBboxEmbeddings, replace
einsum with classical matrix operations

* revert einsum update

* update bros processor

* apply suggestions from code review

* add conversion script for bros

* Apply suggestions from code review

* fix readme

* apply fix-copies

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
@Prathyusha-Akundi
Copy link

Hi @jinhopark8345,
Can you please provide examples of how to use logits from BrosSpadeELForTokenClassification to identify the intra-relationships?
TIA

@jinhopark8345
Copy link
Contributor Author

Hi @Prathyusha-Akundi,

You can refer to the example notebook for identifying intra-relationships.

If you are looking for information on entity linking versus entity extraction, you can check out the entity linking explanation vs entity extraction here.

@Prathyusha-Akundi
Copy link

Thank you @jinhopark8345 , this is extremely helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants