Adding support for sanitised builds #182

csegarragonz · 2021-11-25T10:54:22Z

In this PR I introduce a task to build the tests with different sanitisers.

I also add a workflow that runs in parallel to the tests that checks for newly added sanitising errors. It checks against a set of files I commit in this PR containing the output of running the sanitising as of today.

After offline discussion, I think the changes to allow for builds with sanitising are to be kept. The actual workflow file is a different discussion. Before using it, the codebase would have to be in a state where we trigger no warnings. The worflow file is usable and ready to be merged in: it only runs on: workflow_dispatch i.e., if manually triggered from the UI.

The low-hanging fruit to reduce a bunch of warnings is:

Address Sanitiser: Redis::dequeueBase memory leak #183
Thread Sanitiser: disable ZeroMQ-related code and look into our latch implementation.
Undefined Sanitiser: most warnings come unaligned ints in snapshots.
Leak Sanitiser: leak sanitiser is a subset of address sanitiser, so we could consider not using it at all.
Memory Sanitiser: not use in any hypothetical workflow and add errors on -Wuninitialized and -Wmaybe-uninitialized.

eigenraven · 2021-11-25T11:02:43Z

I'd have a couple of general comments on this:

In cmake, add an explicit None/empty string elseif branch, and use a message(FATAL_ERROR) on wrong values to catch typos
Just check the return code from the programs, the logs will sometimes have things like total bytes allocated, etc. so there's no point in trying to diff them
If you are capturing logs into a file, use tee to redirect it instead of > so they also show up in the actions log, otherwise debugging will be difficult - you can also use a always-condition to run a specific action like cat regardless of whether the build succeeds or fails

csegarragonz · 2021-11-25T11:08:09Z

@KubaSz thanks for your points, I add some replies:

In cmake, add an explicit None/empty string elseif branch, and use a message(FATAL_ERROR) on wrong values to catch typos

We usually run CMake through Python tasks, and I add the typo checking there.

Just check the return code from the programs, the logs will sometimes have things like total bytes allocated, etc. so there's no point in trying to diff them

Oh so you are saying the output of sanitised runs is non-deterministic? That can be an issue. The problem is that there are now many warnings. What I was trying to do was to check for just newly added warnings.

I was deliberatly not using tee for that same reason. The output at the moment is intelligible.

eigenraven · 2021-11-25T11:14:58Z

In cmake, add an explicit None/empty string elseif branch, and use a message(FATAL_ERROR) on wrong values to catch typos
We usually run CMake through Python tasks, and I add the typo checking there.

True, I just like having stringly-typed values checked at every step if it doesn't impact performance. The scripts might evolve at some point and the Cmake or python version might become out of sync, and without the sanitizer it could just silently pass tests.

Just check the return code from the programs, the logs will sometimes have things like total bytes allocated, etc. so there's no point in trying to diff them

Oh so you are saying the output of sanitised runs is non-deterministic? That can be an issue. The problem is that there are now many warnings. What I was trying to do was to check for just newly added warnings.

I don't remember the exact format of those outputs, but at least leak checking did output general memory statistics when I last used it, so it would be mostly deterministic, but change with any code changes due to added/removed allocations.

I was deliberatly not using tee for that same reason. The output at the moment is intelligible.

Even if it is hard to read, it's useful to have it - and in the actions view you can just collapse that section if you don't want to look at it at this time.

.github/workflows/sanitisers.yml

…ned up on exceptions and any scope exit

eigenraven · 2021-11-26T21:34:29Z

This is now ready for review, the only thing that I would potentially look into before merging is the MPI TSan failures (currently disabled in the ignore list file), send/recv do some things with buffers that I don't quite understand and TSan sees it as a data race - it might be a false positive (if the synchronization happens through sockets which is invisible for tsan), or not.

eigenraven · 2021-11-29T16:09:02Z

@csegarragonz It would be good if you review this too, I can't ask you via github as you opened the pr

Shillaker

@KubaSz @csegarragonz awesome work guys, this looks great. A few nitpicky comments mixed in with some small changes/ questions.

.github/workflows/sanitisers.yml

Shillaker · 2021-11-29T15:30:40Z

.github/workflows/sanitisers.yml

+      - uses: actions/cache@v2
+        with:
+          path: '~/.conan'
+          key: ${{ runner.os }}-${{ steps.get-conan-version.outputs.conan-version }}-${{ hashFiles('cmake/ExternalProjects.cmake') }}


How large is the resulting cache file? I know there's a limit on GHA caches, at which point it will evict the oldest keys, just want to make sure this isn't causing us to hit this limit.

430 MB, the limit is 10GB per repo (https://github.com/actions/cache), and this should be a cache hit most times as the dependency versions don't change all that often

I also think this will yield a great speedup to the runtime of general tests if we use the caching there (#187), but @KubaSz it looks as though we may using the wrong key: https://github.com/faasm/faabric/runs/4337385997?check_suite_focus=true#step:7:4

(if the logs are to be trusted of course)

could you double-check?

Huh, seems that hashFiles doesn't work, let me look into it

I had to replace it with manual hashing in bash+sha256sums, it works in the most recent commit. @csegarragonz This change will probably need porting over to #187

tasks/dev.py

src/util/testing.cpp

Shillaker · 2021-11-29T15:59:51Z

.github/workflows/sanitisers.yml

+      - name: "Build dependencies to be shared by all sanitiser runs"
+        run: inv dev.cmake -b Debug
+
+  address-sanitiser:


Do we need a job for each different sanitizer? This file is quite bloated as a result, so I'd be interested to know the time difference in running them sequentially in a single job, repeating the call to dev.sanitise and faabric_tests each time. The GHA docs say each job in a workflow runs in a "fresh instance" of the VM, but I don't know whether this means a completely separate VM on a different host, or if it's actually just cramming lots of VMs together on the same allocated physical resources (in which case it might be counterproductive).

It all runs in parallel, and each sanitizer requires a full rebuild of the source (but not dependencies), so this saves about 20 mins of time on tests

Ok cool, is that in theory or in practice? I'm just surprised that GHA gives out resources that freely. I don't think we should change it, am just curious.

src/redis/Redis.cpp

src/scheduler/Executor.cpp

src/scheduler/Scheduler.cpp

src/state/StateKeyValue.cpp

…try a different key

… of GHA builtins

csegarragonz · 2021-11-29T18:17:31Z

thread-sanitizer-ignorelist.txt

+# Catch2 allocates in its signal handler, this prevents showing the wrong crash report
+signal:*
+
+# TODO: Remove: There's something weird going on with MPI code I don't understand


So for me to see the races tsan is complaining about I'd just have to re-run commenting out the next line right?

Yes, it would terminate on the first detected race. For me that was in the AllGather implementation

csegarragonz · 2021-11-29T19:08:11Z

The only thing I would add to Simon's comments is to see if you can bump the faabric dependency in faasm as well; just to double check nothing is broken.

Shillaker

Awesome LGTM 👍

eigenraven · 2021-11-30T08:52:23Z

It looks like a faasm distributed test fails with this merged (on top of PR faasm/faasm#540):
@csegarragonz Does this look like it could be caused directly by my changes, or is it something to do with the races in distributed tests?

[20:44:04:619] [1] [info] =============================================
[20:44:04:619] [1] [info] TEST: Test running an MPI function spanning two hosts
[20:44:04:619] [1] [info] =============================================
[20:44:04:619] [1] [debug] Resetting scheduler
[20:44:04:619] [1] [debug] Resetting scheduler thread-local cache for thread 1
[20:44:04:620] [1] [debug] Resetting dirty tracking after pushing diffs mpi/mpi_bcast
[20:44:04:620] [1] [debug] Resetting dirty tracking
[20:44:04:620] [1] [debug] Scheduling 1/1 calls to mpi/mpi_bcast locally
[20:44:04:622] [1] [debug] Scaling mpi/mpi_bcast from 0 -> 1
[20:44:04:622] [1] [debug] Starting executor 172.18.0.7_966818832
[20:44:04:622] [1] [debug] WAVM module cache initialising mpi/mpi_bcast
[20:44:04:622] [1] [debug] Loading main module mpi/mpi_bcast
[20:44:04:631] [1] [debug] Instantiating module mpi/mpi_bcast  
[20:44:04:633] [1] [debug] Finished instantiating module mpi/mpi_bcast  
[20:44:04:633] [1] [debug] Opened preopened fd at /
[20:44:04:633] [1] [debug] Opened preopened fd at .
[20:44:04:633] [1] [debug] Creating 2 thread stacks
[20:44:04:633] [1] [debug] Successfully executed __wasm_call_ctors for mpi/mpi_bcast
[20:44:04:633] [1] [debug] Successfully executed zygote for mpi/mpi_bcast
[20:44:04:633] [1] [debug] heap_top=11075584 initial_pages=169 initial_table=5
[20:44:04:648] [1] [debug] Wrote snapshot mpi/mpi_bcast_reset to fd 96
[20:44:04:648] [37] [debug] Thread pool thread 172.18.0.7_966818832:0 starting up
[20:44:04:648] [37] [debug] Added thread id 37 to /sys/fs/cgroup/cpu/faasm/tasks
[20:44:04:648] [37] [debug] Not using network ns, support off
[20:44:04:649] [37] [debug] S - args_sizes_get - 4194296 4194300
[20:44:04:649] [37] [debug] S - sbrk - 0
[20:44:04:649] [37] [debug] S - sbrk - 65536
[20:44:04:649] [37] [debug] S - args_get - 11075632 11075600
[20:44:04:649] [37] [debug] S - MPI_Init (create) 0 0
[20:44:04:649] [37] [debug] Initialising world 966818837
[20:44:04:650] [37] [debug] Adding new function call client for 172.18.0.6
[20:44:04:652] [37] [debug] Resetting dirty tracking after pushing diffs mpi/mpi_bcast
[20:44:04:652] [37] [debug] Resetting dirty tracking
[20:44:04:652] [37] [debug] Scheduling 2/3 calls to mpi/mpi_bcast on 172.18.0.6
[20:44:04:653] [37] [debug] Scheduling 1/3 calls to mpi/mpi_bcast locally
[20:44:04:653] [37] [debug] Scaling mpi/mpi_bcast from 1 -> 2
[20:44:04:653] [37] [debug] Starting executor 172.18.0.7_966818849
[20:44:04:662] [37] [debug] MPI-0 S - MPI_Comm_rank 4196616 4194284
[20:44:04:662] [37] [debug] MPI-0 S - MPI_Comm_size 4196616 4194280
[20:44:04:662] [37] [debug] MPI-0 S - MPI_Bcast 4194240 4 4196620 2 4196616
[20:44:04:663] [38] [debug] Thread pool thread 172.18.0.7_966818849:0 starting up
[20:44:04:663] [38] [debug] Added thread id 38 to /sys/fs/cgroup/cpu/faasm/tasks
[20:44:04:663] [38] [debug] Not using network ns, support off
[20:44:04:663] [38] [debug] S - args_sizes_get - 4194296 4194300
[20:44:04:663] [38] [debug] S - sbrk - 0
[20:44:04:663] [38] [debug] S - sbrk - 65536
[20:44:04:663] [38] [debug] S - args_get - 11075632 11075600
[20:44:04:663] [38] [debug] S - MPI_Init (join) 0 0
[20:44:04:663] [38] [debug] MPI-1 S - MPI_Comm_rank 4196616 4194284
[20:44:04:663] [38] [debug] MPI-1 S - MPI_Comm_size 4196616 4194280
[20:44:04:663] [38] [debug] MPI-1 S - MPI_Bcast 4194240 4 4196620 2 4196616
[20:44:04:820] [37] [debug] S - fd_fdstat_get - 1 4194200 (/dev/stdout)
Broadcast succeeded
[20:44:04:821] [37] [debug] MPI-0 S - MPI_Finalize
[20:44:04:821] [37] [debug] Resetting after mpi/mpi_bcast:966818830 (snap key mpi/mpi_bcast_reset)
[20:44:04:823] [38] [debug] S - fd_fdstat_get - 1 4194200 (/dev/stdout)
Broadcast succeeded
[20:44:04:823] [38] [debug] MPI-1 S - MPI_Finalize
[20:44:04:824] [37] [debug] Memory already correct size for snapshot (11075584)
[20:44:04:824] [38] [debug] Resetting after mpi/mpi_bcast:966818840 (snap key mpi/mpi_bcast_reset)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dist_tests is a Catch v2.13.7 host application.
Run with -? for options

-------------------------------------------------------------------------------
Test running an MPI function spanning two hosts
-------------------------------------------------------------------------------
/usr/local/code/faasm/tests/dist/mpi/test_remote_execution.cpp:13
...............................................................................

/usr/local/code/faasm/tests/dist/mpi/test_remote_execution.cpp:44: FAILED:
  REQUIRE( hosts.size() == 2 )
with expansion:
  3 == 2

csegarragonz · 2021-11-30T09:37:43Z

@KubaSz yes, this should be fixed by #181. but wait, you said you patched on top of faasm/faasm#540, that should contain #181 in a way already 🤔

eigenraven · 2021-11-30T09:41:44Z

@csegarragonz It won't contain it as this is a couple of commits behind on faabric, and the way faabric replacement works is it fully replaces the tree with this one. I will merge master and rerun the tests just to make sure, then it'll be ready to merge (I can't as I don't have perms on this repo)

eigenraven · 2021-11-30T09:44:10Z

This is a moment where a tool like https://github.com/bors-ng/bors-ng is useful, but I don't think faasm is at a scale where it's worth setting up yet, it's just a temporary influx of PRs with all of us working on some changes right now

csegarragonz · 2021-11-30T09:44:31Z

@KubaSz yes, this is ready to go; once it's ready ping me on here and i will merge it. thanks again, this looks great

eigenraven · 2021-11-30T09:50:46Z

Huh Warning: Docker pull failed with exit code 143, back off 2.786 seconds before retry. when fetching containers for ubsan

… be readable

eigenraven · 2021-11-30T10:34:47Z

@csegarragonz I think it's ready to merge, it fails on the same things that your PR was failing in faasm, and only in dist-tests, so I don't think it needs to be blocked by that

adding sanitiser flags and workflow file

285d89b

csegarragonz self-assigned this Nov 25, 2021

csegarragonz changed the title ~~Sanitisers~~ Adding support for sanitised builds Nov 25, 2021

adding files to diff against

949cb34

csegarragonz force-pushed the sanitisers branch 2 times, most recently from 135b66d to 495b26a Compare November 25, 2021 16:08

cleaning up workflow file

1e98852

csegarragonz force-pushed the sanitisers branch from 495b26a to 1e98852 Compare November 25, 2021 16:25

csegarragonz added 3 commits November 25, 2021 16:35

run only on workflow dispatch

aa45f75

properly use caching

7e9a4f2

setting flag as default to false

1c8cc22

eigenraven reviewed Nov 25, 2021

View reviewed changes

.github/workflows/sanitisers.yml Outdated Show resolved Hide resolved

csegarragonz mentioned this pull request Nov 25, 2021

Fix race condition in distributed tests + Syntax changes for new faabric faasm/faasm#540

Merged

csegarragonz force-pushed the sanitisers branch from 4c35084 to 7620977 Compare November 26, 2021 10:27

cache the right folder

51b7fac

csegarragonz force-pushed the sanitisers branch from 7620977 to 51b7fac Compare November 26, 2021 10:29

eigenraven added 2 commits November 26, 2021 11:17

Fix ASAN_OPTIONS typo

43b71d4

Try using cachev2 instead of artifact for conan dependencies

64bcf5e

csegarragonz assigned eigenraven Nov 26, 2021

eigenraven added 8 commits November 26, 2021 11:49

Use a local conan cache folder when using the cli, just like in faasm

b95e2b6

UBSan fix: Make sure created int pointer are 4-aligned

a5b8c51

ASan,TSan: Latch destruction race-condition fix

815c4ec

ASan: MPI World test buffer too small

f5e077b

ASan: fix mismatched new[]-delete

be1b08d

ASan: out-of-bound read, this was a simple copy-paste error

45bf978

ASan: Copy temporary strings in RapidJSON

11c3a77

fix #183: Use unique_ptr for all redis reply objects so they get clea…

d75741c

…ned up on exceptions and any scope exit

eigenraven added 5 commits November 26, 2021 19:42

Disable two tests that break in TSan builds

7e0a55a

Stop gid generation from tripping TSan using relaxed atomics

a6ae467

Avoid racy destruction on FlagWaiter just like the Latch

f7436b5

Missing lock on a queue push

d77f4ae

Fix formatting

e8ac90c

eigenraven requested a review from Shillaker November 29, 2021 10:13

csegarragonz mentioned this pull request Nov 29, 2021

Re-use conan cache for built dependencies #187

Merged

Shillaker requested changes Nov 29, 2021

View reviewed changes

eigenraven added 4 commits November 29, 2021 16:54

Address review comments

d1ac721

Remove mode checking from python as it's already done by cmake

279c9a8

Fix gha cache key, add restore-keys to fall back on cache misses and …

eb58bba

…try a different key

Another attempt at fixing checksum hashes, this time use bash instead…

fd98afe

… of GHA builtins

csegarragonz commented Nov 29, 2021

View reviewed changes

eigenraven mentioned this pull request Nov 29, 2021

Bump faabric to check sanitiser fixes impact on faasm faasm/faasm#541

Closed

Shillaker approved these changes Nov 30, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into sanitisers

d0f3ff8

Show full test and sanitiser logs, it halts on errors so still should…

d5341e8

… be readable

csegarragonz merged commit 27ffd58 into master Nov 30, 2021

csegarragonz deleted the sanitisers branch November 30, 2021 11:00

csegarragonz mentioned this pull request Feb 23, 2022

Add task to generate release body #233

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for sanitised builds #182

Adding support for sanitised builds #182

csegarragonz commented Nov 25, 2021 •

edited

Loading

eigenraven commented Nov 25, 2021

csegarragonz commented Nov 25, 2021

eigenraven commented Nov 25, 2021 •

edited

Loading

eigenraven commented Nov 26, 2021

eigenraven commented Nov 29, 2021

Shillaker left a comment

Shillaker Nov 29, 2021

eigenraven Nov 29, 2021

csegarragonz Nov 29, 2021 •

edited

Loading

eigenraven Nov 29, 2021

eigenraven Nov 29, 2021

Shillaker Nov 29, 2021

eigenraven Nov 29, 2021

Shillaker Nov 30, 2021 •

edited

Loading

csegarragonz Nov 29, 2021

eigenraven Nov 29, 2021

csegarragonz commented Nov 29, 2021

Shillaker left a comment

eigenraven commented Nov 30, 2021

csegarragonz commented Nov 30, 2021 •

edited

Loading

eigenraven commented Nov 30, 2021

eigenraven commented Nov 30, 2021

csegarragonz commented Nov 30, 2021

eigenraven commented Nov 30, 2021

eigenraven commented Nov 30, 2021

Adding support for sanitised builds #182

Adding support for sanitised builds #182

Conversation

csegarragonz commented Nov 25, 2021 • edited Loading

eigenraven commented Nov 25, 2021

csegarragonz commented Nov 25, 2021

eigenraven commented Nov 25, 2021 • edited Loading

eigenraven commented Nov 26, 2021

eigenraven commented Nov 29, 2021

Shillaker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csegarragonz Nov 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shillaker Nov 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csegarragonz commented Nov 29, 2021

Shillaker left a comment

Choose a reason for hiding this comment

eigenraven commented Nov 30, 2021

csegarragonz commented Nov 30, 2021 • edited Loading

eigenraven commented Nov 30, 2021

eigenraven commented Nov 30, 2021

csegarragonz commented Nov 30, 2021

eigenraven commented Nov 30, 2021

eigenraven commented Nov 30, 2021

csegarragonz commented Nov 25, 2021 •

edited

Loading

eigenraven commented Nov 25, 2021 •

edited

Loading

csegarragonz Nov 29, 2021 •

edited

Loading

Shillaker Nov 30, 2021 •

edited

Loading

csegarragonz commented Nov 30, 2021 •

edited

Loading