Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider how editing environments works #886

Open
krassowski opened this issue Sep 19, 2024 · 11 comments
Open

Reconsider how editing environments works #886

krassowski opened this issue Sep 19, 2024 · 11 comments
Labels
area: user experience 👩🏻‍💻 Items impacting the end-user experience impact: high 🟥 This issue affects most of the conda-store users or is a critical issue needs: discussion 💬 This item needs team-level discussion before scoping needs: investigation 🔎 Someone in the team needs to look into this issue before scoping type: enhancement 💅🏼

Comments

@krassowski
Copy link

Context

Currently editing environment:

  • creates a new build directory
  • swaps the symlink

This means that autoreloading does not work. For example, when using with Jupyter/IPython:

  • kernelspec needs to be reloaded hence need for back-and-forth switching between environments to see changes applied; in my experience this wastes about 3 minutes per each change in an environment (compared to workflow without conda-store)
  • newly installed packages do not become available in a running kernel which means that possibly hours of computation may be lost as the kernel needs to be restarted to pickup smallest change in the env
  • changed packages cannot be picked up by the IPython autoreloader either

If instead it worked like:

  • copy existing build to a new folder
  • the copied folder becomes the archivial build version
  • update the environment:
    • ideally by updating in place rather than rebuilding from scratch
    • if needs be by rebuilding (which will still work but likely change some versions inadvertently if new versions were released, which is another issue)
  • there is no need to swap the symlink so autoreloading and everything else works!

Value and/or benefit

Many minutes to hours in productivity gained (or rather not lost) for the use case of interactive environment creation by a senior data analyst.

Anything else?

No response

@krassowski krassowski added the needs: triaging 🚦 Someone needs to have a look at this issue and triage label Sep 19, 2024
@krassowski
Copy link
Author

@kcpevey mentioned to me that this may be a foot gun for shared environments:

The problem with autoreloading the environment is that the environment can change underneath you - other people could have updated the environment without your knowledge.

I somewhat agree, but ultimately if shared env is changed by someone else, activating it after the change will cause the same issue.

And questions the UX for user awareness:

What if you are running a notebook, stop to kick off a rebuild the env which takes 20 minutes, while that's going you keep working in the notebook. At some point, the env build is complete - What happens to your running notebook? The kernel remains as the old env until you restart the kernel? The user gets a warning that the kernel has been replaced?

Here I would mention that auto-reloading is not enabled by default, and users who enable it know what they are doing. Also, rebuilding and env should not take 20 minutes (but it does). I do however, agree that a notification that an environment building has completed should be shown when conda-store is used with JupyterLab, which is tracked in:

@dharhas
Copy link
Member

dharhas commented Sep 19, 2024

As an fyi, historically, updating in place rather than rebuilding from scratch has been a really bad idea and has ended up with folks having non-reproducible bespoke / broken environments because to recreate the environment you have to recreate every update step and that is not tracked anywhere.

@dharhas
Copy link
Member

dharhas commented Sep 19, 2024

But this does go with another discussion I had had about the packaging at pycon, we actually have multiple target audiences (devs, end users etc) for environment management and we are using the same tools for all of them.

@dharhas
Copy link
Member

dharhas commented Sep 19, 2024

newly installed packages do not become available in a running kernel which means that possibly hours of computation may be lost as the kernel needs to be restarted to pickup smallest change in the env

Is this actually a valid use case? How reliably does it work? For pure python packages maybe. To me it seems if you change the underlying environment all bets are off on whether your python objects are even valid if an install changed something under the hood. Seems like a better option would be to make sure you serialized your results.

@krassowski
Copy link
Author

Is this actually a valid use case?

Yes, in IPython installation/updating of packages via %pip and %conda magics is supported and valid use case. These magics warn that for some packages restarting kernel may be required, but when autoreload is on it is rarely the case (it is the case for updating non-pure Python).

How reliably does it work? For pure python packages maybe.

Very well in my experience. And Databricks considers it a valid use case too, they are contributing enhancements in ipython/ipython#14500.

To me it seems if you change the underlying environment all bets are off on whether your python objects are even valid if an install changed something under the hood.

This is my call as an experienced user to make. I can tell if I will need to restart the kernel or not, and I often know what specific changes will be made. It is not for updating numpy from 1.x to 2.x, it is for grabbing patches for very specific bugfixes.

Seems like a better option would be to make sure you serialized your results.

No, I disagree here. It sounds like blaming user here but in fact even with best serialization and caching, there are operations that always take time like loading up large files or training small/medium models. I don't think that using conda-store should be incompatible with data scientists analysing big data or training baseline/statistical models in notebooks. But maybe I misunderstand the target audience of conda-store.

@krassowski
Copy link
Author

historically, updating in place rather than rebuilding from scratch has been a really bad idea and has ended up with folks having non-reproducible bespoke / broken environments because to recreate the environment you have to recreate every update step and that is not tracked anywhere.

On the other hand, the current conda-store approach leads to broken notebooks for data scientists who are not used to working with conda-store:

  • one time my Python version was updated when I added a small unrelated package because new Python version has been released; it broke half of the packages I had installed
  • my numpy was upgraded to 2.0 when I was doing an unrelated change in env as it was just released and broke my code
  • another time when I upgraded one dependency it downgraded another dependency without telling me

Why not give advanced users a choice on whether to update in place or not? If the old environment is copied as an archival build that has 0 risk, right?

Thinking about it, what I really miss is:

  • a) hot-reloading support (as discussed in this issue)
  • b) a confirmation screen saying what changes will be made so that I can adjust pins
  • c) a way to pin all packages to current versions (like if I have auto-installed numpy 1.2 I would want to apply numpy>=1.2.0,<2.0" pin when modifying the environment easily
    • right now the version information is actually a bit hidden

@krassowski
Copy link
Author

Part of the delay is that even after environment is built we need to wait for up to a minute for it to be refreshed on the nb_conda_kernels side: https://github.com/anaconda/nb_conda_kernels/blob/04c5fc605c08a4ced0cc45d2a6507dea40897600/nb_conda_kernels/manager.py#L18

So I as a user keep restarting the environment until it clicks. If my new/edited dependency is used lower in the notebook I can waste many minutes there.

For the interactive use case we need to somehow rewrite nb_conda_kernels to watch kernelspecs changes on disk, or emit an event from conda-store and make nb_conda_kernels refresh. Interaction with nb_conda_kernels appears in scope as this is included in the Dockerfile:

python=${python_version} nb_conda_kernels nodejs=18 yarn constructor \

@peytondmurray
Copy link
Contributor

peytondmurray commented Oct 3, 2024

Part of the goal of the 2024 conda-store roadmap is to eliminate some of the painfully slow tasks that users regularly encounter, and from the description this seems like a major UX annoyance. From what I can tell this is particularly irritating because

  • There's no user feedback to indicate when an environment is updated
  • Once it is updated, the new environment may have versions of dependencies that are different than what the user might expect (as in the case where numpy 2.0 released and updated) with no feedback to the user
  • Swapping environments back and forth takes a long time because the kernelspec needs to be reloaded to see the applied changes
  • Running kernels can't take advantage of new packages that are installed, so if you're partway through executing a notebook and you realize you need another package, you have to reexecute everything
  • The IPython autoreloader doesn't work with the current symlink swapping scheme

So it's a combination of slow iteration and not enough feedback to the user. It sounds like there are downsides to both symlinking and update-in-place. However I feel like I don't have the full context, especially with regards to:

historically, updating in place rather than rebuilding from scratch has been a really bad idea and has ended up with folks having non-reproducible bespoke / broken environments because to recreate the environment you have to recreate every update step and that is not tracked anywhere.

In conda-store, each build is a separate specification with a corresponding lockfile, though. The idea of adding a new package to an existing environment with a conda install <package> doesn't apply. Or am I missing something?

@kcpevey mentioned to me that this may be a foot gun for shared environments:

The problem with autoreloading the environment is that the environment can change underneath you - other people could have updated the environment without your knowledge.

I somewhat agree, but ultimately if shared env is changed by someone else, activating it after the change will cause the same issue.

This really sounds like a problem that could be fixed by notifying the user. We could also possibly give them the option to stay on the old build or bump their own build to the latest version as well, although if we do opt for hot-reloading by eliminating the symlinking mechanism, users who stick with the old build would need to be the ones who reload (to target the old build)?

@trallard trallard added type: enhancement 💅🏼 area: user experience 👩🏻‍💻 Items impacting the end-user experience impact: high 🟥 This issue affects most of the conda-store users or is a critical issue and removed needs: triaging 🚦 Someone needs to have a look at this issue and triage labels Oct 14, 2024
@krassowski
Copy link
Author

From what I can tell this is particularly irritating because

I agree with your summary. Just one more thing:

In conda-store, each build is a separate specification with a corresponding lockfile, though. The idea of adding a new package to an existing environment with a conda install <package> doesn't apply. Or am I missing something?

How can a user achieve the closest possible thing to "add a new package to environment without updating changing dependencies of anything else unless necessary", like in conda install <package>. Is my only choice to manually add pins for every single of ~30 packages that I have in "Requested packages" section?

As I user I now have a fear of adding anything to an environment (but I have to!). What is the safe path?

@peytondmurray peytondmurray added needs: discussion 💬 This item needs team-level discussion before scoping needs: investigation 🔎 Someone in the team needs to look into this issue before scoping labels Oct 25, 2024
@peytondmurray
Copy link
Contributor

How can a user achieve the closest possible thing to "add a new package to environment without updating changing dependencies of anything else unless necessary", like in conda install . Is my only choice to manually add pins for every single of ~30 packages that I have in "Requested packages" section?

With conda-store you can't do this at the moment because the environment gets re-solved when a new specification is submitted. This comes from previous experiences with incremental updates:

As an fyi, historically, updating in place rather than rebuilding from scratch has been a really bad idea and has ended up with folks having non-reproducible bespoke / broken environments because to recreate the environment you have to recreate every update step and that is not tracked anywhere.

I have trouble understanding how we'd be able to reproduce the environment you'd find yourself with if there were an option to add a new package without changing dependencies of anything else unless necessary; isn't this what happens when you pip install <package>? pip downloads the requested version of your package, and if the currently installed dependencies meet the requirements of the package you're trying to install, those dependencies don't change. The only way you'd be able to get back to your particular environment would be either by

  1. Pinning every dependency, or
  2. Recreating the incremental build steps that you followed from the initial state of your environment

But maybe I'm missing something or there's another way to do this?


About user-facing messaging: are we currently passing messages to the jupyterlab-conda-store extension? If not, it sounds like we'll need to do so in order to fix the messaging/notification part of this issue.

@krassowski
Copy link
Author

I have trouble understanding how we'd be able to reproduce the environment you'd find yourself with if there were an option to add a new package without changing dependencies of anything else unless necessary

Well, by having a lock file/pip freeze output committed. Right now conda-store does not solve the problem, just shifts it. Instead of pinning every dependency in a freeze/lock file I have to pin it in the specification file.

The conda-store approach might be fine, but if the only sane way to add a new package is to have everything pinned, then I think it should have a button to populate pins for all packages in spec from the currently installed versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: user experience 👩🏻‍💻 Items impacting the end-user experience impact: high 🟥 This issue affects most of the conda-store users or is a critical issue needs: discussion 💬 This item needs team-level discussion before scoping needs: investigation 🔎 Someone in the team needs to look into this issue before scoping type: enhancement 💅🏼
Projects
Status: New 🚦
Development

No branches or pull requests

4 participants