Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support janus model #1140

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

support janus model #1140

wants to merge 10 commits into from

Conversation

eaidova
Copy link
Collaborator

@eaidova eaidova commented Feb 4, 2025

What does this PR do?

conversion required fix on optimum side: huggingface/optimum#2179

from io import BytesIO
from pathlib import Path

import requests
from janus.models import VLChatProcessor
from PIL import Image
from transformers import TextStreamer

from optimum.intel.openvino import OVModelForVisualCausalLM

model_id = "deepseek-ai/Janus-Pro-1B"

model = OVModelForVisualCausalLM.from_pretrained(model_id, trust_remote_code=True)

processor = VLChatProcessor.from_pretrained(model_id)

Multimodal understanding

input_prompt = "Describe image in details"
image_path = Path("cat_in_box.png")

if not image_path.exists():
    response = requests.get(
        "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
    )
    image = Image.open(BytesIO(response.content)).convert("RGB")
    image.save(image_path)

image = Image.open(image_path)

inputs = model.preprocess_inputs(input_prompt, image, processor)
streamer = TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True)

model.generate(**inputs, streamer=streamer, max_new_tokens=100, do_sample=False)

Answer:

The image shows a gray tabby cat lying inside an open cardboard box on a light-colored carpet. The cat is lying on its back with its belly exposed, legs up in the air, and its tail curled around its body. The background includes a beige couch and a bright, airy room with natural light streaming in, creating a cozy and relaxed atmosphere.

Text to Image generation

image_gen_prompt = "A cute and adorable baby fox with big brown eyes, autumn leaves in the background enchanting,immortal,fluffy, shiny mane,Petals,fairyism,unreal engine 5 and Octane Render,highly detailed, photorealistic, cinematic, natural colors."

images = model.generate_image(processor, image_gen_prompt, parallel_size=1)

images[0].save("fox.png")

Generated Image
fox

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

tests/openvino/utils_tests.py Outdated Show resolved Hide resolved
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@eaidova eaidova force-pushed the ea/janus branch 4 times, most recently from 9113be6 to fe6dac8 Compare February 5, 2025 15:50
@AlexKoff88
Copy link
Collaborator

I wonder why we need to keep VLChatProcessor instance outside the model class and if we can move it inside?

@eaidova
Copy link
Collaborator Author

eaidova commented Feb 6, 2025

I wonder why we need to keep VLChatProcessor instance outside the model class and if we can move it inside?

not sure that I understand your question. This is standard preprocessing/postprocessing part for transformers-based models (like any other stuff - tokenizers, feature_extractors, image processors, e.t.c), usually it is an independent object (except diffusers case). It may be helpful for VLM models to move it closer as it becomes more complicated and bounded. So possibly we can consider keeping processors for other models as well (it may be helpful for alignment result of save_pretrained and optimum-cli, which also save processors and tokenizers if they are available)

@AlexKoff88
Copy link
Collaborator

model = OVModelForVisualCausalLM.from_pretrained(model_id, trust_remote_code=True)

processor = VLChatProcessor.from_pretrained(model_id)

To clarity, I am just looking at the code in the PR description and wondering why it could not look like this:

model = OVModelForVisualCausalLM.from_pretrained(model_id, trust_remote_code=True)
...
inputs = model.preprocess_inputs(input_prompt, image)
streamer = TextStreamer(model.tokenizer, skip_prompt=True, skip_special_tokens=True)
model.generate(**inputs, streamer=streamer, max_new_tokens=100, do_sample=False)
...
images = model.generate_image(image_gen_prompt, parallel_size=1)

So, processor is loaded inside the model and hidden from the user but it can be acquired like model.processor.

But from what I understood your implementation is aligned with diffusers, right?

@eaidova
Copy link
Collaborator Author

eaidova commented Feb 12, 2025

@IlyasMoutawwakil @echarlaix could you please take a look?

Copy link
Collaborator

@echarlaix echarlaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition @eaidova !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants