Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyscf: 1.7.6.post1 -> 2.0.1 #144253

Merged
merged 6 commits into from
Nov 28, 2021
Merged

pyscf: 1.7.6.post1 -> 2.0.1 #144253

merged 6 commits into from
Nov 28, 2021

Conversation

sheepforce
Copy link
Member

@sheepforce sheepforce commented Nov 2, 2021

Motivation for this change

This updates PySCF to its latest release. The build system has changed somewhat and now uses a lot of CMake. I've enabled the tests, as inspired by an older commit of @drewrisinger .
The CPPE library for polarisable embedding has been added, which is an optional dependency of PySCF. I am not entirely sure if i did the python binding stuff correctly there. CPPE is at least not in the PYTHONPATH from within a nix-shell.

Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • 21.11 Release Notes (or backporting 21.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@ofborg ofborg bot added 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 labels Nov 2, 2021
@sheepforce
Copy link
Member Author

As written I've enabled the test suite. However, it is huge and runs for at least 2 h on my 10 cores and some single test are failing and need to be disabled. I guess it might be better to disable the full test suite? What do you think?

@markuskowa
Copy link
Member

As written I've enabled the test suite. However, it is huge and runs for at least 2 h on my 10 cores and some single test are failing and need to be disabled. I guess it might be better to disable the full test suite? What do you think?

Let's disable the test suite for now. It is too big and not reliable. We still have a test in NixOS-QChem to catch basic errors.

@drewrisinger
Copy link
Contributor

As written I've enabled the test suite. However, it is huge and runs for at least 2 h on my 10 cores and some single test are failing and need to be disabled. I guess it might be better to disable the full test suite? What do you think?

Even PySCF mainline doesn't run all tests that they include in their source: https://github.com/pyscf/pyscf/blob/master/.github/workflows/run_tests.sh

Here are the tests that I excluded, feel free to use it as a starting point. I think tests typically ran for about 20-30 mins with all these disabled? https://github.com/drewrisinger/nur-packages/blob/77d8267ff1b432c76ae90fc57db9e9787640137f/pkgs/python-modules/pyscf/default.nix#L94-L255

Copy link
Contributor

@drewrisinger drewrisinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments about structure of cppe derivation, and a few details on the hash types & build variables

@drewrisinger
Copy link
Contributor

As written I've enabled the test suite. However, it is huge and runs for at least 2 h on my 10 cores and some single test are failing and need to be disabled. I guess it might be better to disable the full test suite? What do you think?

Let's disable the test suite for now. It is too big and not reliable. We still have a test in NixOS-QChem to catch basic errors.

That might be true, but the issue is that any packaging errors should ideally be caught before merge, not after. Do you expect a potential reviewer of future pyscf updates to know to go to NixOS-QChem, figure out how to run tests there against a PR, and then review? Vs the standard nixpkgs review process of nix-review pr ...?

@markuskowa
Copy link
Member

As written I've enabled the test suite. However, it is huge and runs for at least 2 h on my 10 cores and some single test are failing and need to be disabled. I guess it might be better to disable the full test suite? What do you think?

Let's disable the test suite for now. It is too big and not reliable. We still have a test in NixOS-QChem to catch basic errors.

That might be true, but the issue is that any packaging errors should ideally be caught before merge, not after. Do you expect a potential reviewer of future pyscf updates to know to go to NixOS-QChem, figure out how to run tests there against a PR, and then review? Vs the standard nixpkgs review process of nix-review pr ...?

I agree with you, but unreliable test suites are worse than (or as good as) not having a test at all. I would not expect that some reviewer uses an external test. I look at the NixOS-QChem tests as a fallback option, which catches the error at some point later.

@sheepforce sheepforce force-pushed the pyscf branch 2 times, most recently from b7126e3 to f5d9be5 Compare November 4, 2021 09:56
@sheepforce
Copy link
Member Author

Thank you for the valuable feedback @drewrisinger , this solved some problems. I am doing final tunings on the test suite to get it running properly, and then it should be done.

@ofborg ofborg bot requested a review from drewrisinger November 4, 2021 10:07
@ofborg ofborg bot added 10.rebuild-darwin: 1-10 and removed 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin labels Nov 4, 2021
@drewrisinger
Copy link
Contributor

drewrisinger commented Nov 4, 2021

FWIW, release 2.0.1 was posted a few hours ago. Might want to roll that into this PR if possible, since it's not finalized? https://github.com/pyscf/pyscf/releases/tag/v2.0.1

Copy link
Contributor

@drewrisinger drewrisinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor suggestions

@sheepforce sheepforce changed the title pyscf: 1.7.6.post1 -> 2.0.0 pyscf: 1.7.6.post1 -> 2.0.1 Nov 8, 2021
@sheepforce
Copy link
Member Author

sheepforce commented Nov 11, 2021

I get a few errors when building pyscf, even when single-threaded. The most alarming is `` ... File "/build/source/pyscf/gto/basis/parse_nwchem.py", line 219, in search_seg with open(basisfile, 'r') as fin: OSError: [errno 24] Too many open files '/build/source/pyscf/gto/basis/ano.dat'

Rank 1820 tests in 2536.987 s FAILED (SKIP=4, errors=6)

I am afraid I cannot reproduce any of those 😐 I have it running now on all of my three different CPUs (i7-7700K, Xeon W2155 and the old Xeon E5420). Unfortunately I do not have access to others, such as the AMD Epyc machines of @markuskowa . But in general numerical effects this large are worrying. However, I don't see what we could change about it, except disabling those tests.

I am forcing the test to be always single core in the preCheck and the only parallel part is BLAS and FFT. Running them multi-core gives even less stable results.

Regarding your failing NWChem parser test; I suppose this might actually depend on how many other files you have open during your build? I don't think we can change this. ulimit -n $LARGENUMBER is not allowed and the number of open file handles would need to be set in the sysctl configuration. I've increased this number on my machines from the defaults some time ago.

Upstream builds with netlib BLAS in the CI apparently. I mean we could force blas to be blas-reference, but I guess this it not the way forward ...

@sheepforce
Copy link
Member Author

@ofborg build python3.pkgs.pyscf

@sheepforce
Copy link
Member Author

ofBorg built the current version just fine without any errors. 🙂

@drewrisinger
Copy link
Contributor

Ok, I'm willing to chalk the errors up to a local build issue, I'd never seen them on my previous builds of my NUR package version.

libcint: formatting and features


libcint: platforms
libxc: formatting


libxc: platforms
fields: expose package


fields: formatting


fields: platforms


fields: platforms


fields: remove redundant platform
polarizationsolver: expose


polyrizationsolver: formatting


polarizationsolver: platforms


polarizationsolver: platforms


polarizationsolver: license


polarizationsolver: remove redundant platform
cppe: move pytestCheckHook to checkInputs


cppe: hash


cppe: license and hash


cppe: formatting


python3.pkgs.cppe: more tests


cppe: formatting


cppe: formatting


cppe: platforms


cppe: platforms
pyscf: hash


pyscf: limit test suite to single core


pyscf: adapting test suite


pyscf: fix pythonpath for tests


pyscf: formatting


pyscf: platforms


remove log


pyscf: enable uadc module


pyscf: platforms


pyscf: formatting


pyscf: disable instable N3 CI test


pyscf: formating


pyscf: increase ulimit


pyscf: ulimit files


pyscf: remove ulimit -n
@markuskowa
Copy link
Member

I am still getting random errors from the test suite:

  MemoryError: Insufficient memory! MP2 memory usage 0 MB (currently available -1 MB)
  -------------------- >> begin captured stdout << ---------------------
  output file: /dev/null
  
  --------------------- >> end captured stdout << ----------------------
  
  ----------------------------------------------------------------------
  Ran 1820 tests in 2249.433s
  
  FAILED (SKIP=4, errors=2)

Note, that I ran this on a machine that had over 100 GB of free RAM!

Copy link
Member

@markuskowa markuskowa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to come up with a solution for the test suite.

Copy link
Member

@markuskowa markuskowa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't reproduce the errors anymore. Let's merge it and give it try.

@markuskowa markuskowa merged commit c0d9398 into NixOS:master Nov 28, 2021
@github-actions
Copy link
Contributor

Successfully created backport PR #147740 for release-21.11.

@markuskowa
Copy link
Member

markuskowa commented Nov 30, 2021

@sheepforce
Copy link
Member Author

Hm this is similar to this problem. Neither my machines, nor ofBorg had problems with those. The number of open files can unfortunately not be increased from within the job, but is a kernel setting. Interestingly they all fail by the NWChem parser: File "/build/source/pyscf/gto/basis/parse_nwchem.py", line 219, in search_seg

I will take a look if I can spot something there.

@sheepforce
Copy link
Member Author

I've tried to figure out what's going on in the code there. The nwchem basis parser opens a file for each atom (or at least element) for each basis set, but it does so safely in a bracket pattern and closes them immediately. So there is no apparent reason why especially the nwchem basis set parser (with which one of the hydra jobs fails) should have too many open files. I've no idea why this is not working on Hydra and for this build.

The other Hydra jobs fails with numerical issues (huge ones), which very much looks like some faulty behaviour with respect to BLAS/LAPACK - numpy interaction or something like this. I still can't reproduce any of these errors on any of my Intel machines. Hydra has some AMD Epyc instances, correct? I've also built with mkl as the blas and lapack provider for all packages (global overlay) and can also not reproduce any of the errors with MKL.
@markuskowa , could you try building on your AMD machines with MKL for everything just once?
I've also tried to use blas-reference and lapack-reference (this is what upstream does) but numpy refuses to build with those, due to an attribute conflict in its derivation.

If MKL also does not solve the problem I will open an issue upstream and try to get an idea from the developers.

@markuskowa
Copy link
Member

markuskowa commented Dec 3, 2021

I can give it a try with MKL, but my suspicion is that this will not solve all the problems. It fails also on aarch64 (which we could deactivate?), where MKL does not work. I think the problem is somewhere else: pyscf-2.0.1 builds just fine on my internal Hydras on x86_64-linux with AMD Epyc as well as on Intel Xeon with the Openblas default.
Looking at logs of the build failures, it looks more like resource limit of Hydra's builders (it complains about too many open files).

@markuskowa
Copy link
Member

FYI: building pyscf with MKL leads to the following errors. Note that they fail just above the test's threshold.

======================================================================
FAIL: test_c_ragf2 (test_c_agf2.KnownValues)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/build/source/pyscf/agf2/test/test_c_agf2.py", line 45, in test_c_ragf2
    self.assertAlmostEqual(np.max(np.absolute(vv1-vv2)), 0.0, 10)
AssertionError: 2.000888343900442e-10 != 0.0 within 10 places (2.000888343900442e-10 difference)

======================================================================
FAIL: test_c_uagf2 (test_c_agf2.KnownValues)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/build/source/pyscf/agf2/test/test_c_agf2.py", line 66, in test_c_uagf2
    self.assertAlmostEqual(np.max(np.absolute(vv1-vv2)), 0.0, 10)
AssertionError: 2.346496330574155e-10 != 0.0 within 10 places (2.346496330574155e-10 difference)

----------------------------------------------------------------------
Ran 1820 tests in 2344.443s

FAILED (SKIP=4, failures=2)
builder for '/nix/store/07675hcr0ylcm4x7nkjddcrwqba23r7q-python3.9-pyscf-2.0.1.drv' failed with exit code 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants