Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pointer position is not valid without pointer screen, so validate pointer screen before obtaining pointer position on screen. #1894

Conversation

MarkMielke
Copy link
Contributor

@MarkMielke MarkMielke commented Jan 11, 2025

This is also important to prevent miPointerGetPosition() from being called when the pointer is disabled, as this could result in a segmentation fault.

In Xfce, it is really easy to disable the mouse and a few users have done it. Without this fix, VNC would fail immediately with a segmentation fault, and this would persist after VNC restart. The traceback:

(EE) 
(EE) Backtrace:
(EE) 0: /usr/bin/Xvnc (OsLookupColor+0x13d) [0x55e5397b59cd]
(EE) 1: /lib64/libpthread.so.0 (funlockfile+0x50) [0x7f30306a2d70]
(EE) 2: /usr/bin/Xvnc (miPointerGetPosition+0x47) [0x55e5397a19e7]
(EE) 3: /usr/bin/Xvnc (vncGetPointerPos+0x36) [0x55e5397d2486]
(EE) 4: /usr/bin/Xvnc (_ZN14XserverDesktop12blockHandlerEPi+0x101) [0x55e5397d0561]
(EE) 5: /usr/bin/Xvnc (vncCallBlockHandlers+0x31) [0x55e5397c3a61]
(EE) 6: /usr/bin/Xvnc (BlockHandler+0x40) [0x55e539765ac0]
(EE) 7: /usr/bin/Xvnc (WaitForSomething+0xd9) [0x55e5397afd09]
(EE) 8: /usr/bin/Xvnc (Dispatch+0xab) [0x55e539760e4b]
(EE) 9: /usr/bin/Xvnc (dix_main+0x376) [0x55e539765136]
(EE) 10: /lib64/libc.so.6 (__libc_start_main+0xe5) [0x7f302e4fb8a5]
(EE) 11: /usr/bin/Xvnc (_start+0x2e) [0x55e539678f2e]
(EE) 
(EE) Segmentation fault at address 0x2c
(EE) 
Fatal server error:
(EE) Caught signal 11 (Segmentation fault). Server aborting
(EE) 

The issue is that miPointerGetPosition will try to dereference NULL if the pointer is disabled.

I have adjusted the code to be more correct, and no longer fail. I didn't see a way to specifically check if the mouse was disabled, but I didn't need to as miPointerGetScreen validates that both the pointer is not null before dereferencing the pointer to the pointer screen, returning NULL if either of these are NULL. It is also more correct to only fetch the position on the screen if there is a valid screen set. This change is just more correct, and doesn't fail with segmentation fault.

However, the user symptoms go from VNC failing with segmentation fault, to the user being unable to use the mouse. This is exactly what they asked for, and it should be more self-evident as they will see that their mouse is now listed as not enabled in the UI, but it's difficult to re-enable mouse without an enabled mouse. I do not know why the users think this is a good idea - but at least with this fix it will behave exactly as they configured it to.

The prior code happened to fail in miPointerGetPosition first, but it is also worth noting that were that prior code to "work", by picking arbitrary position information like (0, 0) in the case that the pointer is disabled, the TigerVNC code following that dereferences the screen to get the screen (x, y) would have also failed with segmentation fault, due to attempt to dereference NULL. This change addresses this later issue as well.

pointer screen before obtaining pointer position on screen.

This is also important to prevent miPointerGetPosition() from
being called when the pointer is disabled, as this could result
in a segmentation fault.
@CendioOssman
Copy link
Member

Thanks for the report!

This does smell a bit like a bug in Xorg, though. I can see some calls internally in Xorg that don't do this check. So I wonder if we're squashing the right bug.

What are the steps to reproduce this?

@MarkMielke
Copy link
Contributor Author

MarkMielke commented Jan 12, 2025

There may be something recent in Xorg that has raised this (i.e. problem introduced in Xorg), but I am doubtful since Xorg development has reduced substantially in the last few years, and this problems occurs on both the 1.20 and 21.1 releases.

The underlying change is correct in any case, since miPointerGetScreen() is defined to return NULL in some cases:

/**
 * @return The current screen of the given device or NULL.
 */
ScreenPtr
miPointerGetScreen(DeviceIntPtr pDev)
{
    miPointerPtr pPointer = MIPOINTER(pDev);

    return (pPointer) ? pPointer->pScreen : NULL;
}

The line which can return NULL dates back to 2008-05-22.

This prior code in TigerVNC does not take NULL into account:

               /* Pointer coordinates are screen relative */
               ptrScreen = miPointerGetScreen(vncPointerDev);
               cursorPosX += ptrScreen->x;
               cursorPosY += ptrScreen->y;

There is no validation that ptrScreen != NULL, so the prior code is not correct.

In the 5 cases we observed, we found the backtrace to fail on miPointerGetPosition(), which has the same problem, just it happens earlier:

void
miPointerGetPosition(DeviceIntPtr pDev, int *x, int *y)
{
    *x = MIPOINTER(pDev)->x;
    *y = MIPOINTER(pDev)->y;
}

This unconditionally dereferences the device to a pointer, and then dereferences this to the x and y member fields. This method does not return any indication of success or failure. So, the only ways this method can exit are:

  1. Signal, like segmentation fault, due to dereference of NULL.
  2. Silently not updating x or y, which could cause the caller a problem.
  3. Updating x and y to some arbitrary value such as (0, 0), which could cause the caller a problem.

The implementation clearly chooses the segmentation fault path, which I do not think is necessarily a wrong choice. The API is limited. If the API were altered to return a boolean as to whether or not there is a pointer position available, this could be an alternative option, but the code is not written this way, and it is unlikely to change.

My first conclusion for miPointerGetPosition(), was that I must find a way to validate that MIPOINTER(pDev) != NULL, and since MIPOINTER(pDev) is an internal macro, not available to the caller, I found another method that would have the same effect, and I found that miPointerGetScreen() works correctly, validating MIPOINTER(pDev) != NULL or returning NULL. By moving the ptrScreen != NULL check prior to calling miPointerGetPosition and also prior to calling cursorPosX += ptrScreen->x; this solves both problems. It also aligns with the comment:

               /* Pointer coordinates are screen relative */

It is logical to validate that the screen is not NULL, before obtaining coordinates that are screen relative.

Questions about if Xorg API is "good" or "correct" are academically interesting - but this change seems to be clearly correct in any case, and also makes it work with the current Xorg API.

For reproducing, it is as simple as:

  1. Use Xfce. I am sure you can do this with GNOME or KDE, but all of the users who reported this, as well as myself, only use Xfce.
  2. Open "Settings" ... "Mouse and Touchpad"
  3. Under "Devices" (first tab), and "Device: TigerVNC pointer" toggle the button from ON to OFF
  4. The session immediately fails, and this change is persistent so later sessions also fail every time it reaches this code.

After the fix, the behaviour is:

  1. The session remains up (Time in control panel, etc.), and continues to take keyboard update (if you can get a terminal in focus, you can interact with it).

I think disabling the TigerVNC pointer is silly - but at least with the problem fixed the user can clearly see what mistake they made and may be able to use keyboard accelerators to navigate and re-enable the pointer, or they can raise the right question to the Linux support team to get help, rather than the previous behaviour of it failing and them trying to guess what happened, and the Linux support team trying to diagnose why TigerVNC is failing with segmentation fault.

I have considered if Xfce can be configured to not allow the TigerVNC pointer to be disabled - but have not found a way yet, and in any case - this does not justify the TigerVNC code being wrong. This fix should still be applied, it would just avoid users encountering the new 4. behaviour.

@MarkMielke
Copy link
Contributor Author

MarkMielke commented Jan 12, 2025

... also worth pointing out that my proposed change does not call any new API, and does not change how frequently the API methods are called. It solely changes the order, and adds one additional NULL check. It is the absolute minimum change to make that is still correct, and additionally it reads nicely. This minimizes risk of any new breakage. Before the change, was an edge case that was not handled. After the change, the edge case is properly handled.

@CendioOssman
Copy link
Member

There may be something recent in Xorg that has raised this (i.e. problem introduced in Xorg), but I am doubtful since Xorg development has reduced substantially in the last few years, and this problems occurs on both the 1.20 and 21.1 releases.

Are you sure you've seen the issue with 1.20? I'm seeing the problem with 21.1.15, but not with 21.1.8. Which would suggest that this is a recent change.

In the earlier version, we never get a NULL in those places.

Still, I agree we could have more robustness here.

I'm concerned that this is just a band-aid, though. The larger issue seems to be that we make assumptions about the active screen. I think more changes might be needed. Let me have a closer look.

@CendioOssman
Copy link
Member

The crash seems to be caused by the device becoming "floating". Which seems to happen implicitly when you disable a device on newer Xorg, but not older.

Still, devices can become floating for various reasons. So we should be prepared for that either way.

It's probably best we explicitly check for that, rather than guess via the associated screen.

I also checked the multi screen behaviour, and it seems to be fine as we are somewhat prepared for coordinates outside the current screen.

@CendioOssman
Copy link
Member

I suggest just this diff instead:

diff --git a/unix/xserver/hw/vnc/vncInput.c b/unix/xserver/hw/vnc/vncInput.c
index a705a85a6..7569c3187 100644
--- a/unix/xserver/hw/vnc/vncInput.c
+++ b/unix/xserver/hw/vnc/vncInput.c
@@ -167,7 +167,7 @@ void vncPointerMove(int x, int y)
 
 void vncGetPointerPos(int *x, int *y)
 {
-       if (vncPointerDev != NULL) {
+       if ((vncPointerDev != NULL) && !IsFloating(vncPointerDev)) {
                ScreenPtr ptrScreen;
 
                miPointerGetPosition(vncPointerDev, &cursorPosX, &cursorPosY);

@CendioOssman
Copy link
Member

After digging a bit more, it does seem to be some kind of bug in the Xorg code, as the behaviour isn't really consistent:

https://gitlab.freedesktop.org/xorg/xserver/-/issues/1782

We still need a workaround on our end, though, but it isn't as apparent what the best one is. I think IsFloating() should work just fine, but we could also consider looking at the enabled or on attributes.

@CendioOssman CendioOssman added the bug Something isn't working label Jan 13, 2025
@CendioOssman
Copy link
Member

I've committed ba73536 as a fix for this now. We'll see if upstream provide any more insight later.

Thanks for reporting this and investigating!

@MarkMielke
Copy link
Contributor Author

If de-referencing ptrScreen, and miPointerGetScreen() is documented to sometimes return NULL, then I think it is important to check for NULL. It could fail for many reasons current and future - not just the new one discovered.

If the one new case understood can be detected correctly with vncPointerDev->public.on, then it allows for a better logical decision to be made earlier, which is your change - so this makes sense to me to call it out explicitly as a known condition that we should be aware of.

But, I think it still leaves the possibility of other unknown factors causing a segmentation fault. :-)

As the code base is pretty stable / end of life at this point, it may not matter which approach - or if both approaches are taken. Maybe there will never be another case added where miPointerGetScreen() returns NULL, and either approach yields the correct result in all cases that could occur for real, so it doesn't matter which one is taken.

I will try to test it out with next iteration. Right now I have deployed the fix I posted and it is a bit of effort to set it up again. If your method works, I will probably drop my patch to keep it as close to upstream as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants