Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why your method is better than ip-adapter in text control capabilities? #7

Open
zhangqizky opened this issue Jan 18, 2024 · 3 comments

Comments

@zhangqizky
Copy link

Hello, thanks for your nice work, I get confused about sth. Could you please help explain it?
ip adpater doesn't train UNet either, add a cross-attention just as yours, the big difference is you use landmark guide net, and they use clip image to control the structure(IP-Adapter Plus V2), I don't get why your method is better than ip-adapter in text control capabilities. Do you plan to release some training details, e.g. training data?
Thanks for you nice work!!!

@haofanwang
Copy link
Member

@zhangqizky We also have a certain degree of degradation in text editing capabilities, mainly due to cross-attention. But since we introduced IdentityNet, we can maintain fidelity and set a lower weight for image cross-attention.

@zhangqizky
Copy link
Author

@haofanwang ok, thanks for reply.

@askerlee
Copy link

Not sure if it's the reason but XL series have better text consistency. Moreover, custom models usually have better text consistency than the base models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants