Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run reports in a separate Python process. #41

Merged
merged 1 commit into from
May 3, 2024
Merged

Run reports in a separate Python process. #41

merged 1 commit into from
May 3, 2024

Conversation

plietar
Copy link
Member

@plietar plietar commented Apr 9, 2024

The runpy module we've been using to run reports provides very little isolation. In particular, "any side effects (such as cached imports of other modules) will remain in place after the functions have returned". This means that if a user splits their report into multiple files, with one file importing another, the latter will be cached and never reloaded by Python, even if it is modified. Similarly, if the module was already loaded before, the report runs against old code that does not match the files included in the packet.

Using a separate subprocess gives us a much stronger isolation, including all global variables and not sharing any of previously loaded modules. It means a user can run a report, modify some of the imported code, run the report again and have this behave as expected.

The Packet class is refactored a little and its scope is reduced to just creating the packet and its metadata, without inserting it into the repository. The insertion is now handled by a separate function. This is done as a way of distinguishing what happens in the child process and what happens in the parent.

@plietar plietar marked this pull request as ready for review April 9, 2024 18:00
@plietar plietar requested a review from richfitz April 9, 2024 18:00
@plietar plietar force-pushed the mrc-5228 branch 2 times, most recently from 0053fd1 to cedb387 Compare April 9, 2024 18:04
@plietar plietar changed the base branch from mrc-5228 to main April 10, 2024 14:19
@plietar

This comment was marked as outdated.

@plietar plietar force-pushed the mrc-5227 branch 3 times, most recently from 92c0ecf to 7f85351 Compare April 16, 2024 17:20
Copy link
Member

@richfitz richfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked through this a couple of times now, and am glad you've taken the time to sort this all out

from outpack.util import openable_temporary_file


def run_in_sandbox(target, args=(), cwd=None, syspath=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use callr for this sort of manoeuvre in R - does python not have something we can pull in to do this for us? This way is fine though!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All I could find is the multiprocessing module, but it's a bit inconvenient to use and puts a burden on the caller to setup their script in the right way.

except BaseException as e:
# This allows the traceback to be pickled and communicated out of the
# the sandbox.
pickling_support.install(e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice job finding this

The runpy module we've been using to run reports provides very little
isolation. In particular, "any side effects (such as cached imports of
other modules) will remain in place after the functions have returned".
This means that if a user splits their report into multiple files, with
one file importing another, the latter will be cached and never reloaded
by Python, even if it is modified. Similarly, if the module was already
loaded before, the report runs against old code that does not match the
files included in the packet.

Using a separate subprocess gives us a much stronger isolation,
including all global variables and not sharing any of previously loaded
modules. It means a user can run a report, modify some of the imported
code, run the report again and have this behave as expected.

The sandbox is implemented by pickling the function and its argument
into a file, starting a new Python process which unpickles the file and
calls the target. Similarly, the function's return value, or any
exception it may throw, it pickled to an output file, which is read by
the parent.

I tried to use the multiprocessing module to implement this, but it puts
inconvenient requirements on the way the top-level source file is
written. The module re-evaluates the file in the subprocess, meaning its
contents must check for `__name__ == "__main__"` to avoid doing the work
again. It also doesn't work too well with the REPL. Using our own
implementation fixes these issues.

The Packet class is refactored a little and its scope is reduced to just
creating the packet and its metadata, without inserting it into the
repository. The insertion is now handled by a separate function. This is
done as a way of distinguishing what happens in the child process and
what happens in the parent.
@plietar plietar merged commit c42fca9 into main May 3, 2024
7 checks passed
@plietar plietar deleted the mrc-5227 branch May 3, 2024 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants