-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuyu processor: box coordinates #27083
Fuyu processor: box coordinates #27083
Conversation
Co-authored-by: Xingcheng Yao <42709675+yaoxingcheng@users.noreply.github.com>
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pcuenca Nice! LGTM :)
My vote would be not to add unboxing the <box>
in decode, as it's a very common method with a standard API and traditionally used as the inverse to encoding with the tokenizer.
except: | ||
return tokens | ||
|
||
if bbox_end_pos != bbox_start_pos + 5: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the 5 come from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, should have explained!
The model returns coordinates in the following format:
- Beginning of bbox delimiter, which is a single token id.
- 4 token ids corresponding to the scaled coordinate numbers, without any delimiters.
- End of bbox delimiter, another single token id.
So we find the begin and end delimiters, and verify that there are exactly 4 token ids in-between.
The same approach is taken for 2d point coordinates, I'll incorporate them now as well as the reverse pre-processing transformation.
Thanks a lot for the quick review and comments @amyeroberts!
Can I use AutoModelForCausalLM and AutoProcessor instead of using Fuyu-specific pipelines? |
Hi @adhikjoshi, yes, you can load both the Fuyu model and its processor using |
What does this PR do?
PoC to post-process box coordinates returned by the model. The following should work:
I'd like to validate whether this approach is appropriate, what do you think @amyeroberts? If it is, then we can:
point
coordinates too.(1080, 1920)
. The correct approach is to downscale larger images, and then pad to match that size.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@amyeroberts, @molbap