Make the onnx importer more robust for internal/external and large models #2794

daveliddell · 2024-01-24T05:08:04Z

Fix for #2765

The onnx docs say that you can't do shape inference using the in-memory API for models > 2 GB. This fix replaces that API with the file-based API. Since the new API generates an intermediate file, also added a --keep switch to keep that file, which I delete by default.

daveliddell · 2024-01-24T07:19:44Z

Also tested on my two-node test case. It works on small models, too! :-D

stellaraccident · 2024-01-24T21:21:46Z

python/torch_mlir/tools/import_onnx/__main__.py

+    # Do shape inference via files instead of in memory in order to handle
+    # models > 2 GB. See https://github.com/onnx/onnx/blob/main/docs/PythonAPIOverview.md#shape-inference-a-large-onnx-model-2gb
+    # for details about this technique.
+    inferred_path = file_path.with_stem(file_path.stem + '-inferred-shape')


I think this is likely to interplay badly with automation and setups where the source directory is not read-write. We ultimately likely need to be a bit more switchy and have a flag for --with-external-data (as described here: https://github.com/onnx/onnx/blob/main/docs/PythonAPIOverview.md#loading-an-onnx-model-with-external-data)

Thinking of how to not boil the ocean. How about something like this:

Use a with tempfile.TemporaryDirectory(dir=flags.temp_dir, delete=not flags.keep_temps) as td:

temp_path = Path(td.name, "inferred.onnx")

onnx.shape_inference.infer_shapes_path(file_path, temp_path)

inferred_model = onnx.load(temp_path, load_external_data=False)

if flags.external_data: load_external_data_for_model(inferred_model, td.name)

Or something like that. It can be made a lot better if anyone cares, but I'm not thrilled to do that without a reason. I think this will at least avoid some of the copies.

I hate proto based languages. Such a poor design decision.

Oh, good idea! I thought about this problem briefly but promptly forgot to deal with it. On a previous product, we had the same issue with a temp index file. We tried /tmp, user's home dir, and cwd, but had unhappy customers coming to us each time. I think we finally gave them a temp_dir flag and the the wailing quieted down to a low grumbling. :-D I'll make these changes tomorrow when I'm feeling better. Thanks for the feedback!

Wait, I think I'm still unclear on what we're doing with external data and whether this is related to shape inference or just something else we should be supporting (for large models). This statement in particular would seem to imply that we don't have a choice with external data location when doing shape inference:

Current shape_inference supports models with external data, but for those models larger than 2GB, please use the model path for onnx.shape_inference.infer_shapes_path and the external data needs to be under the same directory.

We don't really support it yet (I think) but I was trying to leave the door opened since we'll need to get it right soon.

There are a few modes for transferring back and forth and we'll probably need to elaborate them.

It is implicated here because if you save the graph to a directory different from the data, you need to use the APIs explicitly to tell it where the original external data was.

@stellaraccident New code with external data fails in MLIR verify. Considering that the original code works, this is a regression. :-( Debugging it

Ok, I think I'm ready for you again. After tweaking the export size, now everything is working, small models and llama. Added unit tests to test the command line on two small models in both internal and external data scenarios.

Unfortunately, I had to sacrifice running the checker on large models, as I couldn't find a way of pointing the file-based checker to the external data that llama does seem to generate. In-memory checker fails in the same way as in-memory shape inference due to the 2 GB limit.

We could do away with the temp directory for the shape-inferred model and just put the inferred model wherever the user requests it. If that location defaults to the same dir as the original model, then I would be able to run the checker. Seems a bit beyond what's needed for now, but your call.

…port-big

stellaraccident

Thanks for sweating through this. A couple of comments that are optional.

python/torch_mlir/tools/import_onnx/__main__.py

stellaraccident · 2024-01-31T23:04:32Z

test/python/onnx_importer/command_line_test.py

+        onnx.save(onnx_model, model_file)
+        temp_dir = run_path / "temp"
+        temp_dir.mkdir(exist_ok=True)
+        p = subprocess.run([


Pro tip: For these cases, I often don't use a literal sub-process but just call into the main entrypoint from importing the model, giving it explicit command line arguments.

(can ignore / I might fix in a followup)

Little more clarity since you asked offline:

from torch_mlir.tools.import_onnx import __main__ ... args = __main__.parse_arguments([1, 2, 3]) __main__.main(args)

Something like that. Often if I'm doing it, I'll have an entry point in the module just for testing instead of needing to poke in in two steps like that.

daveliddell · 2024-02-01T06:05:41Z

@stellaraccident You're too fast for me! :-D Since the ideal features of tempfile.TemporaryDirectory weren't available in 3.11, I was thinking of dropping in tempfile.mkdtemp as a good substitute for hard-coded temp dir name with equivalent behavior. Anyway, good to get this in quickly. Thanks for the magic formula for calling main. Works great and way faster, too!

stellaraccident · 2024-02-01T06:09:16Z

Year, that's the way. But we can follow-up with it. What you have works and we can improve it as we go. Thanks for the work!

Dave Liddell added 17 commits January 12, 2024 18:27

Implemented ONNX Flatten

6085a8d

Merge remote-tracking branch 'origin/main' into dliddell-onnx-flatten

85e5b77

Removed headers auto-added by VSCode

1d02610

Reformatted with clang-format

b37ce20

Fixes for code review

0c3249f

clang format

7287f3e

More code cleanup

f12e406

Made Flatten tests more robust to reordering by using CHECK-DAG

1772ef6

Merge remote-tracking branch 'origin/main' into dliddell-onnx-flatten

b158ea9

WIP: handling of indirect shape for ConstantOfShape

9a06403

WIP: fixes for Constant, ConstantOfShape

4615651

Merge remote-tracking branch 'origin/main' into dliddell-2764-const

7c90ebc

Cleaned up changes

4eaa51d

Formatting fixes

a63359a

Merge remote-tracking branch 'origin/main' into dliddell-onnx-raw-value

9fdf92f

Now using file-based shape inferrence

43522f8

Fix comment

1059bce

daveliddell marked this pull request as ready for review January 24, 2024 07:19

daveliddell requested a review from stellaraccident January 24, 2024 07:19

stellaraccident requested changes Jan 24, 2024

View reviewed changes

WIP: enhancements for non-default dirs

86c6f62

kumardeepakamd mentioned this pull request Jan 26, 2024

Shark FE : Support bfloat16/int8 opt/laama2-7b Fx and ONNX model nod-ai/SHARK-ModelDev#364

Open

Dave Liddell added 3 commits January 30, 2024 16:29

Working command line importer test

68b19ef

New args, disabled in-memory fallback

6d00016

Fixes to loading and shape inference, added cmd line test

016c07a

daveliddell requested a review from stellaraccident January 31, 2024 05:47

Merge remote-tracking branch 'origin/main' into dliddell-2765-onnx-im…

957f604

…port-big

stellaraccident approved these changes Jan 31, 2024

View reviewed changes

stellaraccident changed the title ~~Dliddell 2765 onnx import big~~ Make the onnx importer more robust for internal/external and large models Feb 1, 2024

changes requested in review

08c0ae4

stellaraccident approved these changes Feb 1, 2024

View reviewed changes

stellaraccident merged commit 04be6ba into llvm:main Feb 1, 2024
3 checks passed

daveliddell deleted the dliddell-2765-onnx-import-big branch February 2, 2024 04:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the onnx importer more robust for internal/external and large models #2794

Make the onnx importer more robust for internal/external and large models #2794

daveliddell commented Jan 24, 2024

daveliddell commented Jan 24, 2024

stellaraccident Jan 24, 2024

daveliddell Jan 25, 2024

daveliddell Jan 25, 2024

stellaraccident Jan 25, 2024

stellaraccident Jan 25, 2024

daveliddell Jan 26, 2024

daveliddell Jan 31, 2024

stellaraccident left a comment

stellaraccident Jan 31, 2024

stellaraccident Feb 1, 2024

daveliddell commented Feb 1, 2024

stellaraccident commented Feb 1, 2024

Make the onnx importer more robust for internal/external and large models #2794

Make the onnx importer more robust for internal/external and large models #2794

Conversation

daveliddell commented Jan 24, 2024

daveliddell commented Jan 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stellaraccident left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daveliddell commented Feb 1, 2024

stellaraccident commented Feb 1, 2024