Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

Graphene doesn't run under Valgrind #1919

Closed
pwmarcz opened this issue Oct 26, 2020 · 24 comments
Closed

Graphene doesn't run under Valgrind #1919

pwmarcz opened this issue Oct 26, 2020 · 24 comments

Comments

@pwmarcz
Copy link
Contributor

pwmarcz commented Oct 26, 2020

Steps to reproduce

I was trying to profile the code using Callgrind.

  1. Install Valgrind
  2. Compile Graphene with DEBUG=1
  3. Go to LibOS/shim/test/regression, make
  4. Run the boostrap example using valgrind --tool=callgrind:
     valgrind --tool=callgrind -- .../graphene/Runtime/pal-Linux \
         .../graphene/Pal/src/host/Linux/libpal.so init ./bootstrap
    
    (you need to refer to pal-Linux binary directly, not through pal_loader)

Tested on current master (fbb2fca) and Ubuntu 18 (Valgrind 3.13.0)

Expected results

The program runs and exits with 0 status.

Actual results

The code fails initialization:

[P4003:T1:bootstrap] loading file:../../../../Runtime/ld-linux-x86-64.so.2: failed to find an address for shared object
[P4003:T1:bootstrap] shim_init() in init_loader (-22)

Investigation

Digging a bit deeper, it seems that _DkGetAvailableUserAddressRange returns a really small range of addresses (0x10000 to 0x108000), not enough to load ld-linux.so into that range.

This is because the code uses text section address (TEXT_START) as an end of the range. When running Graphene directly, this is a large pointer (somewhere below 0x7fff_ffff_ffff). However, Valgrind always loads the text section at a fixed address of 0x108000:

There is a patch attached, but as far as I can tell, it doesn't necessarily help (it does not care if the address is high enough, only tries to find an area that is free).

Possible fix

Can the user address range (for Linux host) be determined in some other way? I don't know enough about either Graphene or Linux internals to understand if there is any harm in mapping memory above the text section.

Can we detect Valgrind's behaviour and do something different? For what it's worth, the following patch makes the above example (bootstrap) work with Valgrind:

--- a/Pal/src/host/Linux/db_main.c
+++ b/Pal/src/host/Linux/db_main.c
@@ -105,6 +105,12 @@ void _DkGetAvailableUserAddressRange(PAL_PTR* start, PAL_PTR* end) {
     void* end_addr = (void*)ALLOC_ALIGN_DOWN_PTR(TEXT_START);
     void* start_addr = (void*)USER_ADDRESS_LOWEST;
 
+    if (TEXT_START == (void*)0x108000) {
+        printf("Running under Valgrind, adjusting end_addr\n");
+        end_addr = (void*)ALLOC_ALIGN_DOWN_PTR((void*)0x7fffffffffff);
+    }
+
+    printf("%p %p\n", start_addr, end_addr);
     assert(IS_ALLOC_ALIGNED_PTR(start_addr) && IS_ALLOC_ALIGNED_PTR(end_addr));
 
     while (1) {
@dimakuv
Copy link

dimakuv commented Oct 27, 2020

Some explanations.

Graphene has two parts: LibOS and PAL. (Below I'm talking about Linux PAL; Linux-SGX PAL is quite different.)

PAL starts first (serves as a bootloader) -- in case of Linux (non-SGX) PAL, this is the pal-Linux (== ../Pal/src/host/Linux/libpal.so) executable. The TEXT_START macro is a wrapper around __text_start which is a global variable containing the address at which the code segment of pal-Linux is put by the Linux/Valgrind loader.

Then PAL starts LibOS. LibOS has its own memory allocator. So LibOS must consult PAL as to "give me a huge contiguous memory region that I can use for my memory allocations". This region should be vast -- 100s of GBs -- because LibOS must serve huge applications like giant databases.

Now this function _DkGetAvailableUserAddressRange() is used to give the LibOS the memory region as USER_ADDRESS_LOWEST .. TEXT_START == 0x10000 .. pal-Linux.code_segment_start_addr.

So here we have a bad design assumption. LibOS assumes that the Linux-PAL executable is loaded at a high address. This design decision assumes that the Linux loader (or in this case the Valgrind loader) maps the main executable (which in this case is pal-Linux) to some very high address. Which is clearly not the case with Valgrind.

@dimakuv
Copy link

dimakuv commented Oct 27, 2020

@pwmarcz Your proposed patch will break Graphene.

What your patch forces Linux PAL to do: "Hey, LibOS, use 0x10000 .. 0x7fffffffffff as you wish, it is free memory".

Whereas in reality, the Linux PAL executable itself occupies the space 0x108000 .. 0x121000 (upper limit is approximate based on my quick look at the size ll -h Pal/src/host/Linux/libpal.so). 1

So LibOS may overwrite the range 0x108000 .. 0x121000 (because it thinks it is free), and then the PAL layer contains garbage. As soon as the application on top of LibOS wants to do some syscall or smth and it goes down to PAL code, we will see segfaults.

1 Actually, the upper limit is even more because it doesn't count the current 64MB of static arrays (used for initial memory management).

@pwmarcz
Copy link
Contributor Author

pwmarcz commented Oct 27, 2020 via email

@dimakuv
Copy link

dimakuv commented Oct 27, 2020

I see two solutions.

  1. Introduce holes in the LibOS memory range. _DkGetAvailableUserAddressRange() can return an array containing contiguous memory ranges, instead of the current single range. This is painful though, requires some re-implementation at both LibOS and PAL layers. (IIRC, we even had holes before, but removed them since it was too cumbersome and ugly.)

  2. Hack around this: find a huge and still available memory region (see the rest of _DkGetAvailableUserAddressRange() with a while loop -- it's ugly legacy code but it checks if the start address of this range is free, and if is not, then the start address is adjusted upwards). I think simply trying mmap(*start, *end - *start, MAP_FIXED) and do some kind of binary search if it fails will do.

@dimakuv
Copy link

dimakuv commented Oct 27, 2020

Summoning @mkow @boryspoplawski @yamahata .

@pwmarcz
Copy link
Contributor Author

pwmarcz commented Oct 27, 2020

And how do you feel about parsing /proc/self/maps to find that largest area?

@dimakuv
Copy link

dimakuv commented Oct 27, 2020

We can also parse /proc/self/maps but this sounds like a harder thing to do (we'll have to use raw syscalls to open/read/etc). Whatever you feel is easier to implement.

@mkow
Copy link
Member

mkow commented Oct 27, 2020

Small correction: it should be MAP_FIXED_NOREPLACE, not MAP_FIXED.

@boryspoplawski
Copy link
Contributor

boryspoplawski commented Oct 27, 2020

Imo ideal (seems doable and that was the plan iirc) solution would be to cede all memory management to LibOS. Pal would then mark some regions as used by it (early in the init, something like complement of @dimakuv option 1) and could request more memory if needed by some callbacks to LibOS.
There are also issues like with SGX, where we want to compress Graphene internals (both Pal and LibOS), to leave as much continuous memory for the user app as possible.

@yamahata
Copy link
Contributor

The ideal solution would be something like option 1 with twist. Pal also report the regions used for PAL.
It would be something similar to e820 or EFI memory map. In our case it would be much more simpler. only two memory type.
Free, reserved for PAL.

In early LibOS initialization, somehow LibOS should takes over of physical memory management.

@pwmarcz
Copy link
Contributor Author

pwmarcz commented Oct 28, 2020

Okay, looks like it's not that simple...

Even if the user address range is changed to some static area (I used 0x20_0000_0000 .. 0x500_0000_0000), loading a program such as Python crashes. This is because the Python executable is loaded at 0x400000, which actually overlaps libpal_so (specifically the 64M of memory in g_mem_pool).

As far as I can tell, this is because Python executable is not PIE, and that would suggest there's no good way of using Valgrind+Graphine with such executables except patching Valgrind. Am I right?

@dimakuv
Copy link

dimakuv commented Oct 28, 2020

This is because the Python executable is loaded at 0x400000, which actually overlaps libpal_so (specifically the 64M of memory in g_mem_pool).

Heh, that's sad. This means that without fixing this "Valgrind puts main executable at address 0x108000" issue, we cannot support any non-PIE executable bigger than 0x400000 - 0x108000 = ~3MB.

@dimakuv
Copy link

dimakuv commented Oct 28, 2020

But then how can Valgrind relocate a non-PIE python executable to 0x108000 and not break Python? Because Python code would assume that it was put at 0x400000, and all offsets are wrong. Shouldn't Python-under-Valgrind just fail?

@pwmarcz
Copy link
Contributor Author

pwmarcz commented Oct 28, 2020

I think the difference it that libpal.so is PIE, and Valgrind selects 0x108000 as an address for PIE executables only, and loads non-PIE ones under its specified address.

Heh, that's sad. This means that without fixing this "Valgrind puts main executable at address 0x108000" issue, we cannot support any non-PIE executable bigger than 0x400000 - 0x108000 = ~3MB.

The other way around: the main executalble is libpal.so, so it would need to be no larger than 3MB...

@mkow
Copy link
Member

mkow commented Oct 28, 2020

I guess 0x108000 is used as a default base only for PIE binaries, non-PIEs has to be loaded at their selected address.

[edit] seems I raced with Paweł with the reply :)

@pwmarcz
Copy link
Contributor Author

pwmarcz commented Oct 29, 2020

I added a comment to the bug above, since I guess we would be happy with a command-line override in Valgrind.

Also, would it make sense create a non-PIE executable for Graphene to ensure it's loaded high, just for this purpose? Is that even possible?

@dimakuv
Copy link

dimakuv commented Oct 29, 2020

Also, would it make sense create a non-PIE executable for Graphene to ensure it's loaded high, just for this purpose? Is that even possible?

Can you create a non-PIE executable with hard-coded address that is not 0x400000? If yes, then this would solve our problem with Valgrind. In the worst case, we'll have smth like this:

  1. Graphene (pal-Linux executable) is built as non-PIE with some high address like 0x7fffffff00000000.
  2. Valgrind has no option but to load Graphene at 0x7fffffff00000000.
  3. Graphene then proceeds to load the application (non-PIE Python) at 0x400000.

@pwmarcz
Copy link
Contributor Author

pwmarcz commented Nov 4, 2020

After overriding the hardcoded address in Valgrind to a high enough pointer, some simple workloads work (I tried Python). But when I tried testing something more complicated (specifically Tensorflow), I ran into another bug: Running signal handler with alternate stack allocated on current stack crashes callgrind.

The approach with non-PIE Graphene binary sounds promising, I'll try it out sometime.

@dimakuv
Copy link

dimakuv commented Nov 4, 2020

I ran into another bug: Running signal handler with alternate stack allocated on current stack crashes callgrind.

Signal handling and alternate stacks are not correctly implemented in Graphene. This is a well-known problem that was almost fixed by this PR: #1218. Parts of that PR were merged as other PRs, but the alternate stack thingy is still biting us.

But it warms my heart that Graphene is not the only project that keeps postponing the fix of this issue :)

@mkow
Copy link
Member

mkow commented Feb 8, 2021

@pwmarcz Could you check if this issue is still present after merging #2090?

@pwmarcz
Copy link
Contributor Author

pwmarcz commented Feb 11, 2021

Unfortunately, yes.

  • The memory issues are still there: valgrind maps PAL at a low address, which prevents Graphene from finding a big enough chunk of memory to load the application. For testing, I added the following hack to Valgrind:
diff --git a/coregrind/m_ume/elf.c b/coregrind/m_ume/elf.c
index 21eb52bcb..07c961be6 100644
--- a/coregrind/m_ume/elf.c
+++ b/coregrind/m_ume/elf.c
@@ -586,7 +586,7 @@ Int VG_(load_ELF)(Int fd, const HChar* name, /*MOD*/ExeInfo* info)
          ebase = 0x100000;
 #     else
       vg_assert(VKI_PAGE_SIZE >= 4096); /* stay sane */
-      ESZ(Addr) hacky_load_address = 0x100000 + 8 * VKI_PAGE_SIZE;
+      ESZ(Addr) hacky_load_address = 0x60000000000;
       if (ebase < hacky_load_address)
          ebase = hacky_load_address;
 #     endif
  • A Tensorflow workload that I mentioned earlier still crashes Valgrind in the same way (both before and after [LibOS] Rework signal handling #2090, also with newest Valgrind git version):
Callgrind: threads.c:249 (vgCallgrind_post_signal): Assertion 'sigNum == vgCallgrind_current_state.sig' failed.

host stacktrace:
==101369==    at 0x5802C90A: show_sched_status_wrk (m_libcassert.c:388)
==101369==    by 0x5802CA27: report_and_quit (m_libcassert.c:459)
==101369==    by 0x5802CBB9: vgPlain_assert_fail (m_libcassert.c:525)
==101369==    by 0x580299AF: vgCallgrind_post_signal (threads.c:249)
==101369==    by 0x58094777: vgSysWrap_amd64_linux_sys_rt_sigreturn_before (syswrap-amd64-linux.c:238)
==101369==    by 0x58083CC1: vgPlain_client_syscall (syswrap-main.c:1863)
==101369==    by 0x58080602: handle_syscall (scheduler.c:1209)
==101369==    by 0x5808227E: vgPlain_scheduler (scheduler.c:1531)
==101369==    by 0x580EE0C0: run_a_thread_NORETURN (syswrap-linux.c:103)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable syscall 15 (lwpid 101369)
==101369==    at 0x6000001055D: pipe_read (db_pipes.c:293)
client stack range: ??????? client SP: 0x5FFFFFB89C0
valgrind stack range: [0x100287E000 0x100297DFFF] top usage: 13608 of 1048576

Thread 2: status = VgTs_WaitSys syscall 271 (lwpid 101370)
==101369==    at 0x6000000F8A9: _DkStreamsWaitEvents (db_object.c:115)
==101369==    by 0x60000003EAF: DkStreamsWaitEvents (db_object.c:85)
==101369==    by 0x4847115: shim_ipc_helper (shim_ipc_helper.c:752)
client stack range: ??????? client SP: 0x5FFFFD42D90
valgrind stack range: [0x1004C01000 0x1004D00FFF] top usage: 5368 of 1048576

Thread 18: status = VgTs_WaitSys syscall 271 (lwpid 101394)
==101369==    at 0x6000000F8A9: _DkStreamsWaitEvents (db_object.c:115)
==101369==    by 0x60000003EAF: DkStreamsWaitEvents (db_object.c:85)
==101369==    by 0x480C60C: shim_async_helper (shim_async.c:285)
==101369==    by 0x60000016B7D: pal_thread_init (db_threading.c:118)
==101369==    by 0x60000018665: clone (clone-x86_64.S:91)
client stack range: [0x6000022B000 0x60000453FFF] client SP: 0x60000453810
valgrind stack range: [0x101EE7B000 0x101EF7AFFF] top usage: 2856 of 1048576
  • I have an interesting result for one of the regression tests (sighandler_divbyzero). It triggers an assertion in shim_context-x86-64.c (before [LibOS] Rework signal handling #2090, the test got stuck in an infinite loop when running under Valgrind):
$ ./pal_loader ./sighandler_divbyzero
Got signal 8
Hello, world
Got 1 SIGFPE signal(s)
TEST OK

$ ~/valgrind/vg-in/place --tool=callgrind ~/graphene/Runtime/pal-Linux ~/graphene/Runtime/libpal-Linux.so init ./sighandler_divbyzero
...
assert failed shim_context-x86_64.c:113 IS_ALIGNED_PTR(xstate, SHIM_XSTATE_ALIGN)

@mkow
Copy link
Member

mkow commented Sep 15, 2021

@pwmarcz: Is this still relevant? I remember that some recent changes might have fixed Valgrind. Also, will we actually want Valgrind support once we have UBSan+ASan?

@pwmarcz
Copy link
Contributor Author

pwmarcz commented Sep 15, 2021

I think the issue with load address is still there?

Anyway, yes, I don't think this is a promising avenue. Initially I wanted to use Callgrind for profiling, but perf works pretty well. Our UBSan and ASan integration will probably not cover everything Valgrind has to offer (we have no tracebacks, no stack instrumentation, etc.) but seem less invasive, and should be easier to expand than getting Valgrind to work.

@mkow
Copy link
Member

mkow commented Sep 15, 2021

Thanks for the details! I'll close this issue then.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants