Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the onnx importer more robust for internal/external and large models #2794

Merged
merged 23 commits into from
Feb 1, 2024

Conversation

daveliddell
Copy link
Collaborator

Fix for #2765

The onnx docs say that you can't do shape inference using the in-memory API for models > 2 GB. This fix replaces that API with the file-based API. Since the new API generates an intermediate file, also added a --keep switch to keep that file, which I delete by default.

@daveliddell daveliddell marked this pull request as ready for review January 24, 2024 07:19
@daveliddell
Copy link
Collaborator Author

Also tested on my two-node test case. It works on small models, too! :-D

# Do shape inference via files instead of in memory in order to handle
# models > 2 GB. See https://github.com/onnx/onnx/blob/main/docs/PythonAPIOverview.md#shape-inference-a-large-onnx-model-2gb
# for details about this technique.
inferred_path = file_path.with_stem(file_path.stem + '-inferred-shape')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is likely to interplay badly with automation and setups where the source directory is not read-write. We ultimately likely need to be a bit more switchy and have a flag for --with-external-data (as described here: https://github.com/onnx/onnx/blob/main/docs/PythonAPIOverview.md#loading-an-onnx-model-with-external-data)

Thinking of how to not boil the ocean. How about something like this:

  • Use a with tempfile.TemporaryDirectory(dir=flags.temp_dir, delete=not flags.keep_temps) as td:
  • temp_path = Path(td.name, "inferred.onnx")
  • onnx.shape_inference.infer_shapes_path(file_path, temp_path)
  • inferred_model = onnx.load(temp_path, load_external_data=False)
  • if flags.external_data: load_external_data_for_model(inferred_model, td.name)

Or something like that. It can be made a lot better if anyone cares, but I'm not thrilled to do that without a reason. I think this will at least avoid some of the copies.

I hate proto based languages. Such a poor design decision.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good idea! I thought about this problem briefly but promptly forgot to deal with it. On a previous product, we had the same issue with a temp index file. We tried /tmp, user's home dir, and cwd, but had unhappy customers coming to us each time. I think we finally gave them a temp_dir flag and the the wailing quieted down to a low grumbling. :-D I'll make these changes tomorrow when I'm feeling better. Thanks for the feedback!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I think I'm still unclear on what we're doing with external data and whether this is related to shape inference or just something else we should be supporting (for large models). This statement in particular would seem to imply that we don't have a choice with external data location when doing shape inference:

Current shape_inference supports models with external data, but for those models larger than 2GB, please use the model path for onnx.shape_inference.infer_shapes_path and the external data needs to be under the same directory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really support it yet (I think) but I was trying to leave the door opened since we'll need to get it right soon.

There are a few modes for transferring back and forth and we'll probably need to elaborate them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is implicated here because if you save the graph to a directory different from the data, you need to use the APIs explicitly to tell it where the original external data was.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stellaraccident New code with external data fails in MLIR verify. Considering that the original code works, this is a regression. :-( Debugging it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I'm ready for you again. After tweaking the export size, now everything is working, small models and llama. Added unit tests to test the command line on two small models in both internal and external data scenarios.

Unfortunately, I had to sacrifice running the checker on large models, as I couldn't find a way of pointing the file-based checker to the external data that llama does seem to generate. In-memory checker fails in the same way as in-memory shape inference due to the 2 GB limit.

We could do away with the temp directory for the shape-inferred model and just put the inferred model wherever the user requests it. If that location defaults to the same dir as the original model, then I would be able to run the checker. Seems a bit beyond what's needed for now, but your call.

Copy link
Collaborator

@stellaraccident stellaraccident left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sweating through this. A couple of comments that are optional.

onnx.save(onnx_model, model_file)
temp_dir = run_path / "temp"
temp_dir.mkdir(exist_ok=True)
p = subprocess.run([
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pro tip: For these cases, I often don't use a literal sub-process but just call into the main entrypoint from importing the model, giving it explicit command line arguments.

(can ignore / I might fix in a followup)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little more clarity since you asked offline:

from torch_mlir.tools.import_onnx import __main__

...


args = __main__.parse_arguments([1, 2, 3])
__main__.main(args)

Something like that. Often if I'm doing it, I'll have an entry point in the module just for testing instead of needing to poke in in two steps like that.

@stellaraccident stellaraccident changed the title Dliddell 2765 onnx import big Make the onnx importer more robust for internal/external and large models Feb 1, 2024
@stellaraccident stellaraccident merged commit 04be6ba into llvm:main Feb 1, 2024
3 checks passed
@daveliddell
Copy link
Collaborator Author

@stellaraccident You're too fast for me! :-D Since the ideal features of tempfile.TemporaryDirectory weren't available in 3.11, I was thinking of dropping in tempfile.mkdtemp as a good substitute for hard-coded temp dir name with equivalent behavior. Anyway, good to get this in quickly. Thanks for the magic formula for calling main. Works great and way faster, too!

@stellaraccident
Copy link
Collaborator

Year, that's the way. But we can follow-up with it. What you have works and we can improve it as we go. Thanks for the work!

@daveliddell daveliddell deleted the dliddell-2765-onnx-import-big branch February 2, 2024 04:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants