Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT convert-to-uff not running under Tensorflow 2.2 #774

Closed
fengye opened this issue Jan 28, 2021 · 14 comments
Closed

TensorRT convert-to-uff not running under Tensorflow 2.2 #774

fengye opened this issue Jan 28, 2021 · 14 comments

Comments

@fengye
Copy link
Contributor

fengye commented Jan 28, 2021

When I followed the docs to convert .h5 to .uff file for TensorRT, I hit an error:
AttributeError: module 'tensorflow' has no attribute 'gfile'

It's essentially the same issue from this guy: https://forums.developer.nvidia.com/t/deepstream-object-detector-ssd-cannot-convert-to-uff-file-in-ds5-0ga/145820

After a bit investigation I found the latest TensorRT 7.2.2 only compatible with Tensorflow 1.15:
https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-7.html#rel_7-2-2

So my workaround is to create another conda environment which has tf 1.15 installed:

conda create -n donkey_tf1 python=3.7
conda activate donkey_tf1
# This will nstall tensorflow 1.15
pip install --pre --extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v44 'tensorflow<2'
# Install TensorRT 7.2.2.3, the tarball file method
# https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html#installing-tar
# Assuming ${TensorRT-7.2.2.3-Dir} is untar directory
cd ${TensorRT-7.2.2.3-Dir}/python
pip install tensorrt-7.2.2.3-cp37-none-linux_x86_64.whl
cd ${TensorRT-7.2.2.3-Dir}/uff
pip install uff-0.6.9-py2.py3-none-any.whl
cd ${TensorRT-7.2.2.3-Dir}/graphsurgeon
pip install graphsurgeon-0.4.5-py2.py3-none-any.whl
cd ${TensorRT-7.2.2.3-Dir}/onnx_graphsurgeon
pip install onnx_graphsurgeon-0.2.6-py2.py3-none-any.whl

# Now we can use convert-to-uff but has to use the python environment conda provides
~/miniconda3/envs/donkey_tf1/bin/convert-to-uff mypilot.pb

NOTE: if a user want to convert his model to TensorRT, before calling convert-to-uff, the current dev branch also has an issue freezing the model. PR is here: #773

I guess this either need to be addressed or to be documented. TensorRT has great potential and it's sad not very well supported.

@cloud-rocket
Copy link
Contributor

I started working on this - but never finished.

If somebody wants to have some reference - dev...cloud-rocket:add-tf2-tensorrt-support

@tikurahul
Copy link
Collaborator

https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#usingtftrt

The conversion process is a bit different starting TF 2.x.

@fengye
Copy link
Contributor Author

fengye commented Feb 6, 2021

Thanks for the link. I'll take a look when I get some spare time.

@fengye
Copy link
Contributor Author

fengye commented Jul 17, 2021

It looks like this issue has been addressed by #884 @DocGarbanzo ? I haven't merged the latest dev branch but if so feel free to close the issue. Thanks!

@fengye
Copy link
Contributor Author

fengye commented Jul 17, 2021

Seems working, although the loading(and compiling on-the-fly?) a .trt model with --type=tensorrt_linear is painfully slow on my Nano. Close this issue.

@fengye fengye closed this as completed Jul 17, 2021
@DocGarbanzo
Copy link
Contributor

@fengye - Great that you tested on Nano. Yes, the on-the-fly graph optimisation is slow, but it is called only on the first inference. I saw it to be very slow on the RPi, but when it finished, the fps rates were close to tflite - like 10% slower. You can also use it on your PC (should work on all OS, I can only confirm ubuntu and OSX). Creating the tensorrt graph on the target architecture should allow the best optimisation for the respective hardware -at least that's what I understood.

@fengye
Copy link
Contributor Author

fengye commented Jul 18, 2021

@fengye - Great that you tested on Nano. Yes, the on-the-fly graph optimisation is slow, but it is called only on the first inference. I saw it to be very slow on the RPi, but when it finished, the fps rates were close to tflite - like 10% slower. You can also use it on your PC (should work on all OS, I can only confirm ubuntu and OSX). Creating the tensorrt graph on the target architecture should allow the best optimisation for the respective hardware -at least that's what I understood.

Sorry for not being specific, I was referring to the loading time. Once the trt model is up and running, the average time spent on pilot is around 40ms and that's not slow at all. Is there another way that can avoid this on-the-fly optimisation and serialise the optimised model onto disk for TensorRT?

@DocGarbanzo
Copy link
Contributor

Ok, 40ms on the nano is quite slow as this equates to 25Hz. Even on the RPi 4 I'm seeing more like 50Hz, I would expect at least 100Hz on the nano. There is a way to generate the tensorrt model already in training, but afaik it won't be optimised to the target platform. Do you want to have a look at this and try it out (https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html)? We can integrate it afterwards, if it works.

@TCIII
Copy link
Contributor

TCIII commented Jul 18, 2021

@fengye,

Jetpack 4.5.1 includes TensorRT 7.1.3. Why are you running TensorRT 7.2.2?

@fengye
Copy link
Contributor Author

fengye commented Jul 18, 2021 via email

@fengye
Copy link
Contributor Author

fengye commented Jul 18, 2021

Ok, 40ms on the nano is quite slow as this equates to 25Hz. Even on the RPi 4 I'm seeing more like 50Hz, I would expect at least 100Hz on the nano. There is a way to generate the tensorrt model already in training, but afaik it won't be optimised to the target platform. Do you want to have a look at this and try it out (https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html)? We can integrate it afterwards, if it works.

Might need to double check the numbers. If memory serves right, my Pi 4 runs about 10hz on tflite models and Nano runs about 20-30hz on trt models in 5w mode, when I still used to export .uff files. Either way it's the loading time bothers me. It takes about 5 minutes to load and compile a trt model on Nano everytime I started a autopilot drive. I will have some work on a new parts code then come back to see if there's anything I can do with the tf-trt integration.

@TCIII
Copy link
Contributor

TCIII commented Jul 18, 2021

@fengye,

NVIDIA has always been behind the curve when it comes to updating their compatibility chart.

@DocGarbanzo
Copy link
Contributor

Ok, 40ms on the nano is quite slow as this equates to 25Hz. Even on the RPi 4 I'm seeing more like 50Hz, I would expect at least 100Hz on the nano. There is a way to generate the tensorrt model already in training, but afaik it won't be optimised to the target platform. Do you want to have a look at this and try it out (https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html)? We can integrate it afterwards, if it works.

Might need to double check the numbers. If memory serves right, my Pi 4 runs about 10hz on tflite models and Nano runs about 20-30hz on trt models in 5w mode, when I still used to export .uff files. Either way it's the loading time bothers me. It takes about 5 minutes to load and compile a trt model on Nano everytime I started a autopilot drive. I will have some work on a new parts code then come back to see if there's anything I can do with the tf-trt integration.

There is a profile.py script that you can run independently of the car loop. You should get > 50fps on RPi for the standard linear model in tflite and standard image size (120x160x3), otherwise your CPU might be throttled. For consistency you should see the same execution time (like 1000 / fps) in ms in the car loop when looking at the performance data that is printed when you stop the car. If you don't see this, then s/th is flawed with your install.

@fengye
Copy link
Contributor Author

fengye commented Jul 18, 2021

Ok, 40ms on the nano is quite slow as this equates to 25Hz. Even on the RPi 4 I'm seeing more like 50Hz, I would expect at least 100Hz on the nano. There is a way to generate the tensorrt model already in training, but afaik it won't be optimised to the target platform. Do you want to have a look at this and try it out (https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html)? We can integrate it afterwards, if it works.

Might need to double check the numbers. If memory serves right, my Pi 4 runs about 10hz on tflite models and Nano runs about 20-30hz on trt models in 5w mode, when I still used to export .uff files. Either way it's the loading time bothers me. It takes about 5 minutes to load and compile a trt model on Nano everytime I started a autopilot drive. I will have some work on a new parts code then come back to see if there's anything I can do with the tf-trt integration.

There is a profile.py script that you can run independently of the car loop. You should get > 50fps on RPi for the standard linear model in tflite and standard image size (120x160x3), otherwise your CPU might be throttled. For consistency you should see the same execution time (like 1000 / fps) in ms in the car loop when looking at the performance data that is printed when you stop the car. If you don't see this, then s/th is flawed with your install.

There're definitely some gaps I can see between a profiler.py and a manage.py drive(camera loop 30hz). For my Nano 5w mode, the profiler.py gives 60fps for a trt model, and actual drive run gives 30ms(33fps) on average for KerasLinear part. For some reason, the PWM steering and PWM throttle parts are taking way too much time and that's probably the reason that lower the overall model inference performance, depending on how the code profiles the parts(whether taking the user mode into account or not, profile the main thread or actual worker thread, haven't looked into it yet).

Anyway given the performance on the profiler running pure model, I don't have too much problem with it for now. It's the loading time that slows down the development iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants