Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mamba instead of conda in spark images #1351

Merged
merged 3 commits into from
Jun 12, 2021

Conversation

mathbunnyru
Copy link
Member

@mathbunnyru mathbunnyru commented Jun 1, 2021

I decided to try mamba separately and I will take a look what changes when using mamba.

Fix: #1352

@mathbunnyru
Copy link
Member Author

I'm switching to mamba for pyspark and all-spark images, so I will write what changes during install steps for these two images.

@mathbunnyru
Copy link
Member Author

mathbunnyru commented Jun 1, 2021

Current master build: https://github.com/jupyter/docker-stacks/runs/2719734258

pyspark conda install: https://github.com/jupyter/docker-stacks/runs/2719734258#step:5:6072

6126
#11 238.7 
6127
#11 238.7   abseil-cpp         conda-forge/linux-64::abseil-cpp-20210324.0-h9c3ff4c_0
6128
#11 238.7   arrow-cpp          conda-forge/linux-64::arrow-cpp-4.0.0-py39h3f173ad_1_cpu
6129
#11 238.7   aws-c-cal          conda-forge/linux-64::aws-c-cal-0.5.9-h3622835_0
6130
#11 238.7   aws-c-common       conda-forge/linux-64::aws-c-common-0.5.11-h7f98852_0
6131
#11 238.7   aws-c-event-stream conda-forge/linux-64::aws-c-event-stream-0.2.7-h7b1828e_7
6132
#11 238.7   aws-c-io           conda-forge/linux-64::aws-c-io-0.9.14-h8007ed0_1
6133
#11 238.7   aws-checksums      conda-forge/linux-64::aws-checksums-0.1.11-hc0e0e8b_6
6134
#11 238.7   aws-sdk-cpp        conda-forge/linux-64::aws-sdk-cpp-1.8.186-h9ad65fb_2
6135
#11 238.7   blas-devel         conda-forge/linux-64::blas-devel-3.9.0-9_openblas
6136
#11 238.7   gflags             conda-forge/linux-64::gflags-2.2.2-he1b5a44_1004
6137
#11 238.7   glog               conda-forge/linux-64::glog-0.4.0-h49b9bf7_3
6138
#11 238.7   grpc-cpp           conda-forge/linux-64::grpc-cpp-1.37.1-h36de60a_0
6139
#11 238.7   libevent           conda-forge/linux-64::libevent-2.1.10-hcdb4288_3
6140
#11 238.7   liblapacke         conda-forge/linux-64::liblapacke-3.9.0-9_openblas
6141
#11 238.7   libthrift          conda-forge/linux-64::libthrift-0.14.1-he6d91bd_1
6142
#11 238.7   libutf8proc        conda-forge/linux-64::libutf8proc-2.6.1-h7f98852_0
6143
#11 238.7   llvm-openmp        conda-forge/linux-64::llvm-openmp-11.1.0-h4bd325d_1
6144
#11 238.7   orc                conda-forge/linux-64::orc-1.6.7-heec2584_1
6145
#11 238.7   parquet-cpp        conda-forge/noarch::parquet-cpp-1.5.1-2
6146
#11 238.7   pyarrow            conda-forge/linux-64::pyarrow-4.0.0-py39h3ebc44c_1_cpu
6147
#11 238.7   re2                conda-forge/linux-64::re2-2021.04.01-h9c3ff4c_0
6148
#11 238.7   s2n                conda-forge/linux-64::s2n-1.0.9-h9b69904_0
6149
#11 238.7 
6150
#11 238.7 The following packages will be REMOVED:
6151
#11 238.7 
6152
#11 238.7   jbig-2.1-h7f98852_2003
6153
#11 238.7   libgomp-9.3.0-h2828fa1_19
6154
#11 238.7 
6155
#11 238.7 The following packages will be UPDATED:
6156
#11 238.7 
6157
#11 238.7   blas                                         1.1-openblas --> 2.109-openblas
6158
#11 238.7   jupyterlab_server                      2.5.2-pyhd8ed1ab_0 --> 2.6.0-pyhd8ed1ab_0
6159
#11 238.7   python                           3.9.2-hffdb5ce_0_cpython --> 3.9.4-hffdb5ce_0_cpython
6160
#11 238.7 
6161
#11 238.7 The following packages will be DOWNGRADED:
6162
#11 238.7 
6163
#11 238.7   _openmp_mutex                                   4.5-1_gnu --> 4.5-1_llvm
6164
#11 238.7   imagecodecs                      2021.3.31-py39h7572904_1 --> 2021.3.31-py39h559889c_0
6165
#11 238.7   libarchive                               3.5.1-hccf745f_2 --> 3.5.1-h3f442fb_1
6166
#11 238.7   libtiff                                  4.3.0-hf544144_1 --> 4.2.0-hbd63e13_2
6167
#11 238.7   zstd                                     1.5.0-ha95c52a_0 --> 1.4.9-ha95c52a_0```


@mathbunnyru
Copy link
Member Author

Current PR build: https://github.com/jupyter/docker-stacks/pull/1351/checks

pyspark mamba install: https://github.com/jupyter/docker-stacks/pull/1351/checks#step:5:6073

#11 16.32   Install:
6078
#11 16.32 ────────────────────────────────────────────────────────────────────────────────────────
6079
#11 16.32 
6080
#11 16.32   abseil-cpp          20210324.1  h9c3ff4c_0          conda-forge/linux-64     1015 KB
6081
#11 16.32   arrow-cpp                4.0.1  py39h1f788f4_0_cpu  conda-forge/linux-64       22 MB
6082
#11 16.32   aws-c-cal                0.5.9  h3622835_0          conda-forge/linux-64       37 KB
6083
#11 16.32   aws-c-common            0.5.11  h7f98852_0          conda-forge/linux-64      165 KB
6084
#11 16.32   aws-c-event-stream       0.2.7  h7b1828e_7          conda-forge/linux-64       47 KB
6085
#11 16.32   aws-c-io                0.9.14  h8007ed0_1          conda-forge/linux-64      121 KB
6086
#11 16.32   aws-checksums           0.1.11  hc0e0e8b_6          conda-forge/linux-64       50 KB
6087
#11 16.32   aws-sdk-cpp            1.8.186  h9ad65fb_2          conda-forge/linux-64        5 MB
6088
#11 16.32   gflags                   2.2.2  he1b5a44_1004       conda-forge/linux-64      114 KB
6089
#11 16.32   glog                     0.5.0  h48cff8f_0          conda-forge/linux-64      104 KB
6090
#11 16.32   grpc-cpp                1.38.0  h2519f57_0          conda-forge/linux-64        4 MB
6091
#11 16.32   libevent                2.1.10  hcdb4288_3          conda-forge/linux-64        1 MB
6092
#11 16.32   libthrift               0.14.1  he6d91bd_1          conda-forge/linux-64        5 MB
6093
#11 16.32   libutf8proc              2.6.1  h7f98852_0          conda-forge/linux-64       95 KB
6094
#11 16.32   orc                      1.6.8  h58a87f1_0          conda-forge/linux-64      740 KB
6095
#11 16.32   parquet-cpp              1.5.1  1                   conda-forge/linux-64        3 KB
6096
#11 16.32   pyarrow                  4.0.1  py39h3ebc44c_0_cpu  conda-forge/linux-64        3 MB
6097
#11 16.32   re2                 2021.04.01  h9c3ff4c_0          conda-forge/linux-64      218 KB
6098
#11 16.32   s2n                      1.0.9  h9b69904_0          conda-forge/linux-64      432 KB
6099
#11 16.32 
6100
#11 16.32   Upgrade:
6101
#11 16.32 ────────────────────────────────────────────────────────────────────────────────────────
6102
#11 16.32 
6103
#11 16.32   libprotobuf             3.15.8  h780b84a_0          installed                       
6104
#11 16.32   libprotobuf             3.16.0  h780b84a_0          conda-forge/linux-64        2 MB
6105
#11 16.32   protobuf                3.15.8  py39he80948d_0      installed                       
6106
#11 16.32   protobuf                3.16.0  py39he80948d_0      conda-forge/linux-64      342 KB

@mathbunnyru
Copy link
Member Author

After cleaning up the same things:
conda:

Install:
blas-devel         conda-forge/linux-64::blas-devel-3.9.0-9_openblas
liblapacke         conda-forge/linux-64::liblapacke-3.9.0-9_openblas
llvm-openmp        conda-forge/linux-64::llvm-openmp-11.1.0-h4bd325d_1
pyarrow            conda-forge/linux-64::pyarrow-4.0.0-py39h3ebc44c_1_cpu

Remove:
jbig-2.1-h7f98852_2003
libgomp-9.3.0-h2828fa1_19

Upgrade:
blas                                         1.1-openblas --> 2.109-openblas
jupyterlab_server                      2.5.2-pyhd8ed1ab_0 --> 2.6.0-pyhd8ed1ab_0
python                           3.9.2-hffdb5ce_0_cpython --> 3.9.4-hffdb5ce_0_cpython

Downgrade:
_openmp_mutex                                   4.5-1_gnu --> 4.5-1_llvm
imagecodecs                      2021.3.31-py39h7572904_1 --> 2021.3.31-py39h559889c_0
libarchive                               3.5.1-hccf745f_2 --> 3.5.1-h3f442fb_1
libtiff                                  4.3.0-hf544144_1 --> 4.2.0-hbd63e13_2
zstd                                     1.5.0-ha95c52a_0 --> 1.4.9-ha95c52a_0

mamba:

Install:
pyarrow                  4.0.1  py39h3ebc44c_0_cpu  conda-forge/linux-64        3 MB

Upgrade:
libprotobuf             3.15.8  h780b84a_0          installed                       
libprotobuf             3.16.0  h780b84a_0          conda-forge/linux-64        2 MB
protobuf                3.15.8  py39he80948d_0      installed                       
protobuf                3.16.0  py39he80948d_0      conda-forge/linux-64      342 KB

I have no idea, why conda updates python (I noticed it today and created an issue #1352).
Also, conda uses pyarrow 4.0.0, though 4.0.1 is available.

@mathbunnyru
Copy link
Member Author

@wolfv maybe you know, why are there differences between conda and mamba?

@wolfv
Copy link

wolfv commented Jun 1, 2021

There are usually a bunch of "working" solutions and the optimization criteria is not exactly clear.

From the installation list I would suspect that both solutions would work almost equally well?

@mathbunnyru
Copy link
Member Author

There are usually a bunch of "working" solutions and the optimization criteria is not exactly clear.

From the installation list I would suspect that both solutions would work almost equally well?

I will check (I have never used pyarrow package, so I will try to find some good example).

Also, there is a reproducible example now (you asked for that in another PR) and the diff is almost like the one I gave:

FROM jupyter/scipy-notebook:eb2f7453798b

# Install pyarrow
RUN conda install --quiet --yes --satisfied-skip-solve \
    'pyarrow=4.0.*' && \
    conda clean --all -f -y && \
    fix-permissions "${CONDA_DIR}" && \
    fix-permissions "/home/${NB_USER}"

@wolfv
Copy link

wolfv commented Jun 4, 2021

I've just analyzed a similar "issue" over here: mamba-org/mamba#924
As you can see, it's often hard to say which solution is really "better". I think for this use-case, getting a more minimal installation is desired. If you'd run mamba update you should see a python + openblas etc. update.

@mathbunnyru
Copy link
Member Author

mathbunnyru commented Jun 4, 2021

I've just analyzed a similar "issue" over here: mamba-org/mamba#924
As you can see, it's often hard to say which solution is really "better". I think for this use-case, getting a more minimal installation is desired. If you'd run mamba update you should see a python + openblas etc. update.

I think you're right and I do like more minimal installation of mamba.
I also pin the python version in another review, but conda installs the same packages (except not updating python version).

Copy link
Collaborator

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks excellent to me! I absolutely support this change based on previous experience of doing this change in other projects, especially now that you have covered the differences explicitly.

Thank you so much for your work on this @mathbunnyru and thank you @wolfv for mamba and your input now!

Copy link
Collaborator

@romainx romainx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,

Thanks for this change. My opinion is let's try it since the tests are OK.
We have tests loading packages and also a test involving arrow that has been written in the frame of the issue #1198.
If issues are reported we will add corresponding tests and rollback to conda it's pretty easy to do.

Best

@mathbunnyru
Copy link
Member Author

Great, thanks, let's try it out!

@mathbunnyru mathbunnyru merged commit 7291526 into jupyter:master Jun 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Python version changes in pyspark-notebook (and all-spark-notebook)
4 participants