-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support janus model #1140
base: main
Are you sure you want to change the base?
support janus model #1140
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
9113be6
to
fe6dac8
Compare
I wonder why we need to keep |
not sure that I understand your question. This is standard preprocessing/postprocessing part for transformers-based models (like any other stuff - tokenizers, feature_extractors, image processors, e.t.c), usually it is an independent object (except diffusers case). It may be helpful for VLM models to move it closer as it becomes more complicated and bounded. So possibly we can consider keeping processors for other models as well (it may be helpful for alignment result of save_pretrained and optimum-cli, which also save processors and tokenizers if they are available) |
To clarity, I am just looking at the code in the PR description and wondering why it could not look like this: model = OVModelForVisualCausalLM.from_pretrained(model_id, trust_remote_code=True)
...
inputs = model.preprocess_inputs(input_prompt, image)
streamer = TextStreamer(model.tokenizer, skip_prompt=True, skip_special_tokens=True)
model.generate(**inputs, streamer=streamer, max_new_tokens=100, do_sample=False)
...
images = model.generate_image(image_gen_prompt, parallel_size=1) So, But from what I understood your implementation is aligned with diffusers, right? |
@IlyasMoutawwakil @echarlaix could you please take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the addition @eaidova !
What does this PR do?
conversion required fix on optimum side: huggingface/optimum#2179
Multimodal understanding
Answer:
Text to Image generation
Generated Image
![fox](https://private-user-images.githubusercontent.com/29454499/409448172-5e2c3fd1-e8ce-406e-ba76-55688d69d337.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2MjE2ODEsIm5iZiI6MTczOTYyMTM4MSwicGF0aCI6Ii8yOTQ1NDQ5OS80MDk0NDgxNzItNWUyYzNmZDEtZThjZS00MDZlLWJhNzYtNTU2ODhkNjlkMzM3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE1VDEyMDk0MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTUyNWMxYjk0YzZiNmIxNTRlNzQxOWFjYWViYzVhZjllNmM0ZDA0MTFmZWZjMmNmMzdhMTQ2NTcxNTZlZGMzYWYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.6ORnqpts4Ghclo5V0MsLlGQ8F_FLbAbCg4MGLab-U9E)
Before submitting