Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segformer Support #382

Closed
HugeBob opened this issue Sep 13, 2022 · 11 comments
Closed

Segformer Support #382

HugeBob opened this issue Sep 13, 2022 · 11 comments
Assignees
Labels
feature-request New feature or request

Comments

@HugeBob
Copy link

HugeBob commented Sep 13, 2022

Feature request

Would love for Optimum to add support for transformers.SegformerForSemanticSegmentation

https://huggingface.co/docs/transformers/model_doc/segformer#transformers.SegformerForSemanticSegmentation

As best I could tell, semantic segmentation is not something that Optimum currently supports for any models (https://huggingface.co/docs/optimum/main/en/pipelines) would love for this to be improved!

Motivation

I use HuggingFace's Segformer for an image segmentation model I have and would love to improve my inference speeds.

Your contribution

I don't know what a PR is so I kind of doubt it.

@michaelbenayoun michaelbenayoun added the feature-request New feature or request label Oct 14, 2022
@michaelbenayoun
Copy link
Member

Hi @HugeBob,
So if I understand correctly you would love to use a semantic-segmantation pipeline.
It seems that this is not currently supported by transformers, so we will not support it on our end, until it is supported there.

@TheoMrc
Copy link
Contributor

TheoMrc commented Nov 24, 2022

Hi @michaelbenayoun,

I had the same thing in head than Bob.
From my understanding, some of the architectures from the transformers package, allows for sementic segmentation such as transformers.SegformerForSemanticSegmentation.

For example, nvidias Segformers (https://huggingface.co/nvidia/mit-b0), are apparently based on "a hierarchical Transformer encoder and a lightweight all-MLP decode head".

Those transformer based models, even though they are implemented in transformers cannot be optimized with optimum ?

Thank you,

Theo

@michaelbenayoun
Copy link
Member

michaelbenayoun commented Nov 24, 2022

Hi @TheoMrc
Yes, they can, we just need to support the ONNX export of those models actually. We do support the Segformer export, so you will be able to both export, optimize and quantize a Segformer model with ONNX Runtime.

For pipelines, it might not be usable because it was not available in transformers last time I checked.

What we can do on our end is to add support for an ORTModelForImageSegmentation. I will do it soon.

@michaelbenayoun michaelbenayoun self-assigned this Nov 24, 2022
@TheoMrc
Copy link
Contributor

TheoMrc commented Nov 24, 2022

Hi again,

Thanks for your answer,
If you don't mind, I could need some points of clarification to better understand how things might turn up.

After some very interesting reading time in various documentations, I'm guessing from your answer that:

  1. In order to optimize inference speed, I should consider converting my SegFormer pytorch models to ORT (ONNX runtime) models, and then applying ORTOptimizer and ORTQuantizer from optimum.onnxruntime.
    Links for mortals like me : Transformers export to ONNX runtime ; Optimum tutorial

  2. Once optimized and quantized in the .onnx format, I should theoretically be able to load and do inferences with my model in my python app through the ORT's python API with some kind of weird session-based syntax - Python api ORT tutorial

  3. Optimum is (among other things) some kind of python wrapper for ORT, that allows mortals like me to handily benefit from ORT with some HuggingFace's user-friendly syntax. That is what you plan to implement as ORTModelForImageSegmentation.

  4. If my previous points are kind of accurate, the fact that SegFormers use transformer encoding layers might make them candidate for further optimization through BetterTransformer, after which I should convert to ONNX, optimize and quantize. See BetterTransformer example bellow
    Or maybe the "BetterTransformer" stuff only works upon torch-based inference and is not about the layer structure ?

BetterTransformer example from Hugging Face

from transformers import AutoModelForSequenceClassification
from optimum.bettertransformer import BetterTransformer
model_hf = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")
model = BetterTransformer.transform(model_hf, keep_original_model=True)

Anyway thanks a lot for your time,
I'm starting to feel like I should do an internship at huggingface to learn more about how these things work after my PhD !

See you around,

Theo

@michaelbenayoun
Copy link
Member

To answer to each of your points:

  1. Yes, we also support the export in optimum now, and it is the recommended way; check here.
  2. Yes, in theory, but as you mention in point 3, we save you this pain by implementing wrappers hiding this logic.
  3. Yes
  4. So it is not possible to mix both for now, since the PyTorch kernel for BetterTransformer is not supported by ONNX Runtime. That being said, in a general case I would suggest to both try BetterTransformer in PyTorch and ONNX Runtime, and see what gives the best latency. In your case, Segformer cannot use BetterTransformer because it has some custom way of computing the FFNs.

Maybe!
In any case do not hesitate if you have any questions, or want to contribute!

@TheoMrc
Copy link
Contributor

TheoMrc commented Nov 28, 2022

Thanks once again for your answer.

Just a quick follow-up bellow

  1. After late Sunday night investigations of Hugging Face's transformers tutorials I managed to save my local segformer model to .onnx
    Code bellow:
model = AutoModelForSemanticSegmentation.from_pretrained(model_path)
feature_extractor = SegformerFeatureExtractor()

model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature="semantic-segmentation")
onnx_config = model_onnx_config(model.config)

onnx_inputs, onnx_outputs = transformers.onnx.export(preprocessor=feature_extractor,  
                                                     model=model,  
                                                     config=onnx_config, 
                                                     opset=13, 
                                                     output=target_path)
  1. Turns out that optimization is not supported for segformers
    Code bellow:
optimizer = ORTOptimizer.from_pretrained(onnx_model_path)
optimization_config = OptimizationConfig(optimization_level=99)

>>> KeyError: 'segformer model type is not supported yet. Only albert, bart, [...]

Although I don't mind since I have no idea what it does 😎. Sounded nice though, since it does not impact model outputs but appears to halve latency in some cases Optimum tutorial.

  1. On the other hand, quantization worked :
quantizer = ORTQuantizer.from_pretrained(torch_model_path)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)
quantizer.quantize(save_dir=quantized_onnx_path, quantization_config=qconfig)

(This, I read a bit about the theory)

  1. I took some inspiration from transformers and optimum source codes for Pipeline and ORTmodel classes, and managed to grasp how they work.
    Turns out SegFormer work just fine with pipeline().

From this, I built my own custom Pipeline class with strategic outputs based on my own application (I basically want to output the segmentation map i.e. the argmax of all logits).

class CustomImageSegmentationPipeline(ImageSegmentationPipeline):
    def postprocess(self, model_outputs):
        logits = model_outputs.logits
        logits = nn.functional.interpolate(
            logits,
            size=model_outputs.target_size[0],  # (height, width)
            mode='bilinear',
            align_corners=False
        )
        
        segmentation_map = logits.argmax(dim=1)[0]
        return segmentation_map


# Creating instance
auto_model = AutoModelForSemanticSegmentation.from_pretrained(torch_model_path)
feature_extractor = SegformerFeatureExtractor()
hf_pipe = pipeline("image-segmentation", model=auto_model, feature_extractor=feature_extractor)
custom_pipe = CustomImageSegmentationPipeline(model=auto_model, feature_extractor=feature_extractor)

Everything worked perfectly for torch models (tested only on CPU).

inputs = feature_extractor(pil_image, return_tensors="pt")
print('Duration of the prediction with torch model:')
%timeit auto_model(**inputs)
>>> Duration of the prediction with torch model:
>>> 2.82 s ± 73.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  1. Using the ORTModel class from optimum, I loaded my onnx models, performed predictions, which were around twice as fast than torch inference:
onnx_model = ORTModel(onnx_path)
quantized_model = ORTModel(quantized_path)
onnx_inputs = feature_extractor(pil_image, return_tensors="np")

print('\nDuration of the prediction with onnx model:')
%timeit onnx_model.session.run(None, input_feed=onnx_inputs)
print('\nDuration of the prediction with quantized onnx model:')
%timeit quantized_model.session.run(None, input_feed=onnx_inputs)

>>> Duration of the prediction with onnx model:
>>> 1.68 s ± 113 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> Duration of the prediction with quantized onnx model:
>>> 1.45 s ± 50.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Although, I could not manage to implement a custom transformers.Pipeline because several class attributes are not implemented in ORTModel (ORTModel.config for example) which are necessary upon Pipeline.__init__ and Pipeline._sanitize_parameters calls.

For now, I just built custom "pipeline functions" that work on cpu only, but avoid unecessary tasks performed in implemented Pipeline classes, doing only what is necessary for my goal

def custom_onnx_workflow(image, onnx_model):
    inputs = feature_extractor(image, return_tensors="np")
    onnx_inputs = {'pixel_values': inputs['pixel_values']}
    outputs = onnx_model.session.run(None, input_feed=onnx_inputs)
    upsampled_logits = nn.functional.interpolate(
        torch.from_numpy(outputs[0]),
        size=image.size[::-1],  # (height, width)
        mode='bilinear',
        align_corners=False
    )
    segmentation_map = upsampled_logits.argmax(dim=1)[0]
    return segmentation_map

Next step for me is to enable GPU coverage, which I am sure I will find how-to in optimum.ORTModel source code, for example in the ORTModelForImageClassification source code.

I'd love to try to actually implement it and do a PR for ORTModelForSemanticSegmentation, which would be supported in pipelines.
I'm guessing the forward method will have to return a SemanticSegmenterOutput instead of an ImageClassifierOutput.

Appart from this, I think everything will actually be the exact same as the ORTModelForImageClassification.

Have you already started writing this class ?
Else, any other obvious advice ?

Thanks for your time,

Theo

@michaelbenayoun
Copy link
Member

Hi @TheoMrc,

First, thank you for your feedbacks, they are very valuable!

About you questions:

  1. You can also convert it using optimum.exporters.onnx . It is the suggested way since it is more up-to-date. The API is mostly similar, and you can do it via a command-line:
python -m optimum.exporters.onnx --model model_name --task semantic-segmentation segformer_onnx
  1. Basically, among other things, the ORTOptimizer will look for patterns, and try to fuse operations together (such as attention), and we support common patterns (BERT, GPT-2. etc). I will need to check if segformer can be supported.

  2. That is nice, it would be interesting to check what operators end up quantized, I feel like the speed-up compared to the non-quantized ONNX model is small, you can get more. Performing graph optimization before would also help.

  3. ORTModel do have a config attribute, altough it might not always be set.. I am currently working on improving and cleaning the ORTforXXX classes to avoid such cases and make the API easier. But you're right, adding a ORTModelForImageSegmentation is the first step. Writing such class would most likely consist in preparing the IO binding (cc @JingyaHuang) and returning the proper output class as you mentioned. For this, ORTModelForImageClassification is a great example to follow. I have not started working on it by the way.

You can open a PR and I can help you there, what do you think?

@JingyaHuang
Copy link
Contributor

JingyaHuang commented Nov 30, 2022

Hi @TheoMrc,

Just to expand on the second point of @michaelbenayoun, as Segformer is based on transformer encoder architecture, we can apply a bert-like optimization, by registering Segformer in ORTManager

(But there is a caveat due to the fact that Segformer's encoder blocks have different hidden_size, I am not sure if this has been taken into consideration in ONNX Runtime(although ort supports automatical shape inference to get hidden_size and num_head with Reshape nodes), better check.)

And if you are interested in contributing the ORTModelForImageSegmentation class, please feel free to tag me.

@JingyaHuang
Copy link
Contributor

JingyaHuang commented Nov 30, 2022

With a quick test, the automatic detection of hidden_size and num_head works. I got some fused nodes(level=99) like the following:
image
image

I am thinking of letting BERT-like model infer hidden_size and num_head themselves instead of reading from the config in these cases(various hidden_size for blocks), WDYT? @michaelbenayoun

Ref: https://github.com/microsoft/onnxruntime/blob/441b30b2d26d36ca1db2930ade2fe82622ce0cd4/onnxruntime/python/tools/transformers/onnx_model_bert.py#L47

@TheoMrc
Copy link
Contributor

TheoMrc commented Nov 30, 2022

Hi @michaelbenayoun and @JingyaHuang,

Thanks to you both for your answers, as a "novice" in the field, I find it personnally extremely useful to speak with you. I started ML in python through a from "scratch" manner in tensorflow, and then torch, for which I had to grasp more of the theory, create new loss functions, ... Hugging Face is very nice but hides most the complicated stuff, which is very handy to get working prototypes but surely makes it easy to ignore the way this work. I surely plan to learn and understand everything :)

Michael:

You can open a PR and I can help you there, what do you think?

I will clone the optimum repo and open a PR once I have a first (hopefully working) version of the ORTModelForImageSegmentation and tag you both for review !

Before my previous answer, following your advice, I first had tried this from commandline with optimum :

python -m optimum.exporters.onnx --model model_name --task semantic-segmentation segformer_onnx

But initially failed because I tried to input the path to pytorch_model.bin as model_name instead of parent directory (actually, might also be because I did not input a task).
I then went down a level and managed conversion through transformers.onnx (cf. my previous answer).
Anyways, worked out just fine once I tried with your command and the right inputs, thanks for the tip !

Michael:

  1. Basically, among other things, the ORTOptimizer will look for patterns, and try to fuse operations together (such as attention), and we support common patterns (BERT, GPT-2. etc). I will need to check if segformer can be supported.

Jingya:

Just to expand on the second point of @michaelbenayoun, as Segformer is based on transformer encoder architecture, we can apply a bert-like optimization, by registering Segformer in ORTManager

Being a computer vision guy (and a biologist), I only use segformers from HuggingFace :

  • nvidia/mit-b0 for classification of fish orientation on microscope images
  • nvidia/mit-b4 for zebrafish larvae organ segmentation on images.

I'd obviously enjoy any performance gain from segformer optimization support !
Once again, I'd love to contribute in order to improve my understanding of what's going on behind the nice Hugging Face syntax.

Michael:

  1. That is nice, it would be interesting to check what operators end up quantized, I feel like the speed-up compared to the non-quantized ONNX model is small, you can get more. [...]

To be noted, my quantized_model.onnx file (117 Mo) is half the size when compared to the original model.onnx (246 Mo). Not sure how relevant this is.
I'm guessing that around half the weights were converted from float32 to int8.
32bits/8bits = 4, so theoretical max "size" gain, would be about 4-times smaller. (Probably oversimplyfing though, the model is obviously not just a bunch of weights)

I don't know how to check what was quantized, maybe you could redirect me to some documentation ?

As well, I tested inference on my old-ish laptop CPU which tends to overheat, and latency is quite variable. I'll test inference on my main machine and comeback with more reliable latency data.

Thanks again for your time, see you soon after my PR

@fxmarty
Copy link
Contributor

fxmarty commented Feb 17, 2023

Fixed in #539

@fxmarty fxmarty closed this as completed Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants