Fuyu processor: box coordinates #27083

pcuenca · 2023-10-26T14:00:40Z

What does this PR do?

PoC to post-process box coordinates returned by the model. The following should work:

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = FuyuForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
processor = FuyuProcessor(image_processor=FuyuImageProcessor(), tokenizer=tokenizer)

# Prompt appropriate for bounding box detection
text = "statistics"
prompt = f"When presented with a box, perform OCR to extract text contained within it. If provided with text, generate the corresponding bounding box.\n{text}"
image = Image.open("screen2words_ui_example.png")

model_inputs = processor(text=prompt, images=[image]).to(device)    
generation_output = model.generate(**model_inputs, max_new_tokens=40)

results = processor.post_process_box_coordinates(generation_output, target_sizes=torch.Tensor([image.size[::-1]]))

# TODO: maybe unbox the <box> here as well??
decoded = processor.decode(results[0], skip_special_tokens=True)
print(decoded)
# <box>60, 124, 100, 268</box>

I'd like to validate whether this approach is appropriate, what do you think @amyeroberts? If it is, then we can:

Support point coordinates too.
Perform the reverse transformations on input prompts. There's already code in the processor for that purpose, I think we could maybe simplify it a bit.
Maybe provide an optional resizing + padding pre-processing step for images, only for the bounding box detection task. According to our conversations with the original authors (and our tests), this task only works properly when the input image size is close to (1080, 1920). The correct approach is to downscale larger images, and then pad to match that size.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts, @molbap

Co-authored-by: Xingcheng Yao <42709675+yaoxingcheng@users.noreply.github.com>

HuggingFaceDocBuilderDev · 2023-10-26T14:20:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

amyeroberts

@pcuenca Nice! LGTM :)

My vote would be not to add unboxing the <box> in decode, as it's a very common method with a standard API and traditionally used as the inverse to encoding with the tokenizer.

amyeroberts · 2023-10-26T17:50:57Z

src/transformers/models/fuyu/processing_fuyu.py

+            except:
+                return tokens
+
+            if bbox_end_pos != bbox_start_pos + 5:


Where does the 5 come from here?

Sorry, should have explained!

The model returns coordinates in the following format:

Beginning of bbox delimiter, which is a single token id.

4 token ids corresponding to the scaled coordinate numbers, without any delimiters.

End of bbox delimiter, another single token id.

So we find the begin and end delimiters, and verify that there are exactly 4 token ids in-between.

The same approach is taken for 2d point coordinates, I'll incorporate them now as well as the reverse pre-processing transformation.

Thanks a lot for the quick review and comments @amyeroberts!

adhikjoshi · 2023-11-02T08:18:36Z

Can I use AutoModelForCausalLM and AutoProcessor instead of using Fuyu-specific pipelines?

amyeroberts · 2023-11-02T11:44:52Z

Hi @adhikjoshi, yes, you can load both the Fuyu model and its processor using AutoModelForCausalLM and AutoProcessor respectively

pcuenca and others added 2 commits October 26, 2023 15:53

Move to device

68b330b

Co-authored-by: Xingcheng Yao <42709675+yaoxingcheng@users.noreply.github.com>

Post-process box coordinates

339c954

pcuenca requested review from molbap and amyeroberts October 26, 2023 14:00

amyeroberts approved these changes Oct 26, 2023

View reviewed changes

amyeroberts reviewed Oct 26, 2023

View reviewed changes

pcuenca mentioned this pull request Oct 30, 2023

Fuyu processing: handle coordinates amyeroberts/transformers#113

Merged

This was referenced Oct 31, 2023

_clamp_coord in FuyuProcessor was not defined #27168

Closed

[Fuyu] Add tests #27001

Merged

molbap deleted the branch huggingface:fuyu_follow_up_image_processing November 2, 2023 11:25

molbap closed this Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuyu processor: box coordinates #27083

Fuyu processor: box coordinates #27083

pcuenca commented Oct 26, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 26, 2023

amyeroberts left a comment

amyeroberts Oct 26, 2023

pcuenca Oct 27, 2023

adhikjoshi commented Nov 2, 2023

amyeroberts commented Nov 2, 2023

Fuyu processor: box coordinates #27083

Fuyu processor: box coordinates #27083

Conversation

pcuenca commented Oct 26, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Oct 26, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Oct 26, 2023

Choose a reason for hiding this comment

pcuenca Oct 27, 2023

Choose a reason for hiding this comment

adhikjoshi commented Nov 2, 2023

amyeroberts commented Nov 2, 2023

pcuenca commented Oct 26, 2023 •

edited

Loading