-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threaded OpenGL API calls #2014
Conversation
@@ -30,7 +30,6 @@ void ColorBufferReaderWithBufferStorage::_initBuffers() | |||
// Initialize Pixel Buffer Objects | |||
for (u32 index = 0; index < m_numPBO; ++index) { | |||
m_bindBuffer->bind(Parameter(GL_PIXEL_PACK_BUFFER), ObjectHandle(m_PBO[index])); | |||
m_fence[index] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer using fences since they can hurt performance and this is not really trying to be that accurate in async mode.
@@ -152,9 +173,7 @@ void CachedTextureUnpackAlignment::setTextureUnpackAlignment(s32 _param) | |||
/*---------------CachedFunctions-------------*/ | |||
|
|||
CachedFunctions::CachedFunctions(const GLInfo & _glinfo) | |||
: m_bindFramebuffer(GET_GL_FUNCTION(glBindFramebuffer)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to get any function pointers directly.
Performance improvements vary by settings, but with my phone and synced color buffer copies to RDRAM enabled, performance is 30% to 40% faster in the best case scenario. With async, I'm seeing about a 10-15% performance improvement. |
Performance could also be further improved. I'm still newing things up a lot. I also found a thread safe memory pool implementation which helped significantly, but if I used it too much, it actually slowed things down due to synchronization. I think eventually I will replace that thread safe memory pool with a lock free ring buffer which should further improve performance since allocating in the heap is so expensive. |
I just tested it. 32bit mupen64plus build crashes on start of any rom. |
Ok, thanks, I'll investigate. Android works correctly though. I have the feeling that it is probably something wrong with the initialization of GL function pointers. |
I started debugging this in Windows. It doesn't crash for me, but it does hang at startup. I'll try to figure out what that's about. |
4a1405e
to
50d57ff
Compare
Ok, it now correctly runs in Windows and there are no warnings in the 64 bit build. I am encountering a strange issue that I didn't see in the previous version of this. Disabling the speed limiter in Project 64 actually makes it run slower.... Trying to debug. |
Ok, it was running slow because I was doing a debug build instead of release build. In Windows though, threaded mode seems to run slower then not threaded. I have the feeling it's the synchronized memory pool I'm using. I'm going to have to to move a lock free ring buffer memory pool. I couldn't find one around anywhere, so I will have to make my own unfortunately. (Although I do realize that synchronization is required even with a "lock-free" implementation, but a ring buffer big enough should almost never actually pause to lock). |
50d57ff
to
7079108
Compare
Even after introducing a lock free ring buffer pool, performance still seems to be about 10-20% worse in Windows when threaded mode is enabled.... I don't quite understand why. It seems to be the opposite in Android. @gonetz Can you check if it helps N64 depth compare? My GPU doesn't seem to support that very well. |
Sure, I'll check it. |
@fzurita How do you test performance? |
Turn threaded mode on and off. It can be done using the GUI or the GLideN64.ini. I usually just compare FPS.
|
It is not reliable. FPS counter constantly jumps. |
I guess a more reliable way would be total frames rendered over a longer period of time, like total frames over a minute since the start of the game. There is no easy way to get that though. Also, I know the source of the slow down on windows though. The opengl command queue consumer is slow to respond to the producer. It's fast in Linux. It seems like mutexes and condition variables are not as efficient in Windows (in my use case). I'm going see if I can figure out a workaround. |
Also, for N64 depth compare, which seem really slow according to you in Banjo Tooie, you should be able to perceive a difference if the change in performance is large enough. I think it could help this scenario since it's a GPU limited case. |
7dfa17c
to
a12732e
Compare
Performance seems to be ok in threaded mode in Windows now, only around 5% performance loss by just looking at the FPS counter with no limiter. It should help though in cases where GPU drivers have to wait, which doesn't seem to happen in Windows at all it seems. |
I believe some OpenGL drivers are already threaded, so this kind of thing would probably be mostly duplicated in the driver, which is why in some instances you probably won't see much difference or even slightly worse results |
Yeah, I'm aware of that. Which is probably why I don't see much of a performance improvement in Windows. I was hoping it would help in cases where the GPU driver is stalling, like N64 depth compare. I'm surprised there is no slow down at all in Windows with an NVidia Geforce 980 GTX with sync color buffer to RDRAM. |
Ok, it would help if I had frame buffer emulation enabled. I'm seeing a significant performance improvement in windows with sync color buffer to RDRAM enabled. About a 25% improvement. |
With frame buffer emulation enabled, enabling threading is definitely faster in all cases I have seen so far in Windows. |
After finding the bottleneck in Windows and fixing it there, it has vastly improved android performance. It went from a 10-15% performance improvement to about a 40% performance improvement in Android. |
Ok. I have experienced similar things in Android. That's the reason I automatically kill the emulation process when a game is exited. I can do this because mupen64plus-ae consists of two processes, one for the gallery and another for the emulation. |
52f9f4d
to
3bfbef9
Compare
Rebasing with master |
I did some testing in the Nexus Player. I'm seeing a 78% performance increase with threaded mode. Edit: In my devices, it seems to be running faster or close to Glide64mk2 now, it's pretty impressive. |
I haven't forgotten this :) I'll try and test it on the rpi3 B+ tonight. I still don't have a great way of benchmarking differences in performance (I know many games with slowdowns on the pi3, but it's not always clear what bottleneck is at play) |
@dankcushions you can disable the frame limiter and enable the fps display in GLideN64. |
34d8347
to
bc03282
Compare
Some additional fixes, added GL_LUMINANCE as a valid color format, aligned the Ring Buffer to 4 bytes, it turns out that some devices don't like OpenGL commands with data that is not 4 byte aligned. |
5640caf
to
994ccc4
Compare
Added fix for #2029. The fix is to increase the buffered drawer buffer size from 4MB to 8MB. |
There seems to be some kind of issue with |
Figured out the problem... In cases where the OpenGL command queue gets rather big due to the GPU driver falling behind, the device is literally running out of memory. |
994ccc4
to
3e2a074
Compare
Ok, all known issues should be fixed. |
@@ -0,0 +1,675 @@ | |||
// ©2013-2016 Cameron Desrochers. | |||
// Distributed under the simplified BSD license (see the license file that | |||
// should have come with this header). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a copy of BSD license file in our repo?
Ok, I can pull it in. |
3e2a074
to
cc242ec
Compare
The license file has been added for that library. |
Thanks! I think it is ready for merge now. |
Great, thanks! :-) |
I'm trying to get some real benchmarking data on the raspberry pi 3 for this one so I can justify turning it on by default for everyone in RetroPie, however i can't seem to get --nospeedlimit working via mupen64plus. we don't use a GUI so it's sent as a command line argument. i am using the dummy audio plugin, VerticalSync is False in my mupen64plus.cfg, but still the games will never exceed 60VI/sec, even in title screens (eg Mario Kart 64), or staring at a wall in goldeneye. If I switch to gles2n64 video plugin, the speed of MK64 is around double. Weirdly, if I try the latest beta (ie, on their respective github releases pages) versions of mupen64plus and gliden64 for windows, it runs without a speed limit. Either this is broken in the latest master, or there's something strange going on with RPI ? |
It could be vertical sync being enabled somehow at the OS level. |
I built with the verbose flag, hoping that might shed some light:
not very useful, although i can confirm at least that a) the --nospeedlimit is being respected at the mupen64plus, as all of the 'waiting' messages have dissappeared from the mupen64plus consoleoutput and b) that on my pi vsync is always on - i switched the display mode to 1080 24Hz, and mariokart runs at 24fps. I think the reason i saw speed ups on gles2n64 plugin, even with vsync locked on, is because it had enough headroom to go from 30fps to 60fps, but gliden64 does not. I'll get my pi fully updated and see if I can do some sort of standalone gl test to see if vsync is functional. |
so i finally figured out my issue! the problem was that VSync was always on because of this issue: #2070 with that fix, I'm able to exceed vsync with --nospeed limit, so here's some figures for an rpi3B+: Threaded True: False: |
Great! Sounds like a good improvement. |
This is a redo of the implementation of #1452. This version has almost no impact on the main code and it just replaces the gl calls with the threaded version since the API is now the same.
This was specifically tailored to work with GLideN64, but it could theoretically work with other projects with minor modification.
Downsides are that not all OpenGL calls are implemented, only the ones the GLideN64 uses. If new OpenGL calls are needed, they will have to be added, but that is no different than before. Adding calls does become a little more difficult though. Also, if threading is disabled, it will behave identically as before.
Most everything is in the src/Graphics/OpenGLContext/ThreadedOpenGl folder, so it's kept out of the way.
And one last thing, I only checked compilation in Windows. I haven't tested this version in Windows, but it should work the same as the older pull request.