Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threaded OpenGL API calls #2014

Merged
merged 1 commit into from
Apr 8, 2019
Merged

Conversation

fzurita
Copy link
Contributor

@fzurita fzurita commented Feb 23, 2019

This is a redo of the implementation of #1452. This version has almost no impact on the main code and it just replaces the gl calls with the threaded version since the API is now the same.

This was specifically tailored to work with GLideN64, but it could theoretically work with other projects with minor modification.

Downsides are that not all OpenGL calls are implemented, only the ones the GLideN64 uses. If new OpenGL calls are needed, they will have to be added, but that is no different than before. Adding calls does become a little more difficult though. Also, if threading is disabled, it will behave identically as before.

Most everything is in the src/Graphics/OpenGLContext/ThreadedOpenGl folder, so it's kept out of the way.

And one last thing, I only checked compilation in Windows. I haven't tested this version in Windows, but it should work the same as the older pull request.

@@ -30,7 +30,6 @@ void ColorBufferReaderWithBufferStorage::_initBuffers()
// Initialize Pixel Buffer Objects
for (u32 index = 0; index < m_numPBO; ++index) {
m_bindBuffer->bind(Parameter(GL_PIXEL_PACK_BUFFER), ObjectHandle(m_PBO[index]));
m_fence[index] = 0;
Copy link
Contributor Author

@fzurita fzurita Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer using fences since they can hurt performance and this is not really trying to be that accurate in async mode.

@@ -152,9 +173,7 @@ void CachedTextureUnpackAlignment::setTextureUnpackAlignment(s32 _param)
/*---------------CachedFunctions-------------*/

CachedFunctions::CachedFunctions(const GLInfo & _glinfo)
: m_bindFramebuffer(GET_GL_FUNCTION(glBindFramebuffer))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to get any function pointers directly.

@fzurita
Copy link
Contributor Author

fzurita commented Feb 23, 2019

Performance improvements vary by settings, but with my phone and synced color buffer copies to RDRAM enabled, performance is 30% to 40% faster in the best case scenario. With async, I'm seeing about a 10-15% performance improvement.

@fzurita
Copy link
Contributor Author

fzurita commented Feb 23, 2019

Performance could also be further improved. I'm still newing things up a lot. I also found a thread safe memory pool implementation which helped significantly, but if I used it too much, it actually slowed things down due to synchronization.

I think eventually I will replace that thread safe memory pool with a lock free ring buffer which should further improve performance since allocating in the heap is so expensive.

@gonetz
Copy link
Owner

gonetz commented Feb 24, 2019

I haven't tested this version in Windows

I just tested it. 32bit mupen64plus build crashes on start of any rom.
Compilation of 64bit version gives me tons of warnings like:
src\graphics\openglcontext\threadedopengl\opengl_WrappedFunctions.h(991): warning C4244: 'initializing': conversion from '__int64' to 'unsigned long', possible loss of data (compiling source file ..\..\src\Graphics\OpenGLContext\ThreadedOpenGl\opengl_WrappedFunctions.cpp)
The build crashes too.

@fzurita
Copy link
Contributor Author

fzurita commented Feb 24, 2019

Ok, thanks, I'll investigate. Android works correctly though. I have the feeling that it is probably something wrong with the initialization of GL function pointers.

@fzurita
Copy link
Contributor Author

fzurita commented Mar 5, 2019

I started debugging this in Windows. It doesn't crash for me, but it does hang at startup. I'll try to figure out what that's about.

@fzurita fzurita force-pushed the threaded_GLideN64_v2 branch 2 times, most recently from 4a1405e to 50d57ff Compare March 7, 2019 06:43
@fzurita
Copy link
Contributor Author

fzurita commented Mar 7, 2019

Ok, it now correctly runs in Windows and there are no warnings in the 64 bit build.

I am encountering a strange issue that I didn't see in the previous version of this. Disabling the speed limiter in Project 64 actually makes it run slower.... Trying to debug.

@fzurita
Copy link
Contributor Author

fzurita commented Mar 7, 2019

Ok, it was running slow because I was doing a debug build instead of release build. In Windows though, threaded mode seems to run slower then not threaded. I have the feeling it's the synchronized memory pool I'm using. I'm going to have to to move a lock free ring buffer memory pool. I couldn't find one around anywhere, so I will have to make my own unfortunately. (Although I do realize that synchronization is required even with a "lock-free" implementation, but a ring buffer big enough should almost never actually pause to lock).

@fzurita
Copy link
Contributor Author

fzurita commented Mar 14, 2019

Even after introducing a lock free ring buffer pool, performance still seems to be about 10-20% worse in Windows when threaded mode is enabled.... I don't quite understand why. It seems to be the opposite in Android.

@gonetz Can you check if it helps N64 depth compare? My GPU doesn't seem to support that very well.

@gonetz
Copy link
Owner

gonetz commented Mar 14, 2019

Sure, I'll check it.

@gonetz
Copy link
Owner

gonetz commented Mar 14, 2019

@fzurita How do you test performance?

@fzurita
Copy link
Contributor Author

fzurita commented Mar 14, 2019 via email

@gonetz
Copy link
Owner

gonetz commented Mar 14, 2019

I usually just compare FPS.

It is not reliable. FPS counter constantly jumps.

@fzurita
Copy link
Contributor Author

fzurita commented Mar 14, 2019

I guess a more reliable way would be total frames rendered over a longer period of time, like total frames over a minute since the start of the game. There is no easy way to get that though.

Also, I know the source of the slow down on windows though. The opengl command queue consumer is slow to respond to the producer. It's fast in Linux. It seems like mutexes and condition variables are not as efficient in Windows (in my use case). I'm going see if I can figure out a workaround.

@fzurita
Copy link
Contributor Author

fzurita commented Mar 14, 2019

Also, for N64 depth compare, which seem really slow according to you in Banjo Tooie, you should be able to perceive a difference if the change in performance is large enough. I think it could help this scenario since it's a GPU limited case.

@fzurita fzurita force-pushed the threaded_GLideN64_v2 branch 4 times, most recently from 7dfa17c to a12732e Compare March 16, 2019 13:08
@fzurita
Copy link
Contributor Author

fzurita commented Mar 16, 2019

Performance seems to be ok in threaded mode in Windows now, only around 5% performance loss by just looking at the FPS counter with no limiter.

It should help though in cases where GPU drivers have to wait, which doesn't seem to happen in Windows at all it seems.

@loganmc10
Copy link
Contributor

I believe some OpenGL drivers are already threaded, so this kind of thing would probably be mostly duplicated in the driver, which is why in some instances you probably won't see much difference or even slightly worse results

@fzurita
Copy link
Contributor Author

fzurita commented Mar 16, 2019

Yeah, I'm aware of that. Which is probably why I don't see much of a performance improvement in Windows. I was hoping it would help in cases where the GPU driver is stalling, like N64 depth compare.

I'm surprised there is no slow down at all in Windows with an NVidia Geforce 980 GTX with sync color buffer to RDRAM.

@fzurita
Copy link
Contributor Author

fzurita commented Mar 16, 2019

Ok, it would help if I had frame buffer emulation enabled.

I'm seeing a significant performance improvement in windows with sync color buffer to RDRAM enabled. About a 25% improvement.

@fzurita
Copy link
Contributor Author

fzurita commented Mar 16, 2019

With frame buffer emulation enabled, enabling threading is definitely faster in all cases I have seen so far in Windows.

@fzurita
Copy link
Contributor Author

fzurita commented Mar 16, 2019

After finding the bottleneck in Windows and fixing it there, it has vastly improved android performance. It went from a 10-15% performance improvement to about a 40% performance improvement in Android.

@fzurita
Copy link
Contributor Author

fzurita commented Mar 24, 2019

Ok. I have experienced similar things in Android. That's the reason I automatically kill the emulation process when a game is exited.

I can do this because mupen64plus-ae consists of two processes, one for the gallery and another for the emulation.

@fzurita fzurita force-pushed the threaded_GLideN64_v2 branch from 52f9f4d to 3bfbef9 Compare March 26, 2019 02:58
@fzurita
Copy link
Contributor Author

fzurita commented Mar 26, 2019

Rebasing with master

@fzurita
Copy link
Contributor Author

fzurita commented Mar 26, 2019

I did some testing in the Nexus Player. I'm seeing a 78% performance increase with threaded mode.

Edit: In my devices, it seems to be running faster or close to Glide64mk2 now, it's pretty impressive.

@dankcushions
Copy link
Contributor

I haven't forgotten this :) I'll try and test it on the rpi3 B+ tonight. I still don't have a great way of benchmarking differences in performance (I know many games with slowdowns on the pi3, but it's not always clear what bottleneck is at play)

@fzurita
Copy link
Contributor Author

fzurita commented Mar 26, 2019

@dankcushions you can disable the frame limiter and enable the fps display in GLideN64.

@fzurita fzurita force-pushed the threaded_GLideN64_v2 branch from 34d8347 to bc03282 Compare March 30, 2019 03:03
@fzurita
Copy link
Contributor Author

fzurita commented Mar 30, 2019

Some additional fixes, added GL_LUMINANCE as a valid color format, aligned the Ring Buffer to 4 bytes, it turns out that some devices don't like OpenGL commands with data that is not 4 byte aligned.

@fzurita fzurita force-pushed the threaded_GLideN64_v2 branch 3 times, most recently from 5640caf to 994ccc4 Compare April 1, 2019 02:12
@fzurita
Copy link
Contributor Author

fzurita commented Apr 1, 2019

Added fix for #2029.

The fix is to increase the buffered drawer buffer size from 4MB to 8MB.

@fzurita
Copy link
Contributor Author

fzurita commented Apr 1, 2019

There seems to be some kind of issue with GlDrawElementsUnbufferedCommand that seems to manifest only sometimes. I'm trying to debug.

@fzurita
Copy link
Contributor Author

fzurita commented Apr 6, 2019

Figured out the problem... In cases where the OpenGL command queue gets rather big due to the GPU driver falling behind, the device is literally running out of memory. GlDrawElementsUnbufferedCommand just happens to be the command that is called the most. I'm going to have to figure out a way to limit it.

@fzurita fzurita force-pushed the threaded_GLideN64_v2 branch from 994ccc4 to 3e2a074 Compare April 7, 2019 02:15
@fzurita
Copy link
Contributor Author

fzurita commented Apr 7, 2019

Ok, all known issues should be fixed.

@@ -0,0 +1,675 @@
// ©2013-2016 Cameron Desrochers.
// Distributed under the simplified BSD license (see the license file that
// should have come with this header).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a copy of BSD license file in our repo?

@fzurita
Copy link
Contributor Author

fzurita commented Apr 7, 2019

Ok, I can pull it in.

@fzurita fzurita force-pushed the threaded_GLideN64_v2 branch from 3e2a074 to cc242ec Compare April 7, 2019 11:57
@fzurita
Copy link
Contributor Author

fzurita commented Apr 7, 2019

The license file has been added for that library.

@gonetz
Copy link
Owner

gonetz commented Apr 8, 2019

Thanks! I think it is ready for merge now.
Great work, Francisco!

@gonetz gonetz merged commit 5df3f9d into gonetz:master Apr 8, 2019
@fzurita
Copy link
Contributor Author

fzurita commented Apr 8, 2019

Great, thanks! :-)

@dankcushions
Copy link
Contributor

I'm trying to get some real benchmarking data on the raspberry pi 3 for this one so I can justify turning it on by default for everyone in RetroPie, however i can't seem to get --nospeedlimit working via mupen64plus. we don't use a GUI so it's sent as a command line argument. i am using the dummy audio plugin, VerticalSync is False in my mupen64plus.cfg, but still the games will never exceed 60VI/sec, even in title screens (eg Mario Kart 64), or staring at a wall in goldeneye.

If I switch to gles2n64 video plugin, the speed of MK64 is around double.

Weirdly, if I try the latest beta (ie, on their respective github releases pages) versions of mupen64plus and gliden64 for windows, it runs without a speed limit.

Either this is broken in the latest master, or there's something strange going on with RPI ?

@fzurita
Copy link
Contributor Author

fzurita commented Jun 28, 2019

It could be vertical sync being enabled somehow at the OS level.

@dankcushions
Copy link
Contributor

I built with the verbose flag, hoping that might shed some light:

2019/06/28,15:06:39.564,mupen64plus_DisplayWindow.cpp:53,VERBOSE, "[gles2GlideN64]: _setAttributes"
2019/06/28,15:06:39.614,mupen64plus_DisplayWindow.cpp:97,VERBOSE, "[gles2GlideN64]: Create setting videomode 320x240"
2019/06/28,15:06:39.615,opengl_GLInfo.cpp:27,VERBOSE, "OpenGL ES major version: 2"
2019/06/28,15:06:40.510,opengl_GLInfo.cpp:28,VERBOSE, "OpenGL ES minor version: 0"
2019/06/28,15:06:40.510,opengl_GLInfo.cpp:31,VERBOSE, "OpenGL vendor: Broadcom"
2019/06/28,15:06:40.511,opengl_GLInfo.cpp:50,VERBOSE, "OpenGL renderer: VideoCore IV HW"
2019/06/28,15:06:40.512,opengl_GLInfo.cpp:130,WARNING, "Async color buffer copies are not supported on GLES2"
2019/06/28,15:06:40.512,opengl_GLInfo.cpp:136,WARNING, "LOD emulation not possible on this device"
2019/06/28,15:06:40.523,TextDrawer.cpp:191,VERBOSE, "Generated a 747 x 14 (10 kb) texture atlas"
2019/06/28,15:06:47.936,GBI.cpp:372,VERBOSE, "Load microcode type: 0 crc: 0x6932365f romname: SUPER MARIO 64"

not very useful, although i can confirm at least that

a) the --nospeedlimit is being respected at the mupen64plus, as all of the 'waiting' messages have dissappeared from the mupen64plus consoleoutput

and b) that on my pi vsync is always on - i switched the display mode to 1080 24Hz, and mariokart runs at 24fps. I think the reason i saw speed ups on gles2n64 plugin, even with vsync locked on, is because it had enough headroom to go from 30fps to 60fps, but gliden64 does not.

I'll get my pi fully updated and see if I can do some sort of standalone gl test to see if vsync is functional.

@dankcushions
Copy link
Contributor

so i finally figured out my issue! the problem was that VSync was always on because of this issue: #2070

with that fix, I'm able to exceed vsync with --nospeed limit, so here's some figures for an rpi3B+:

Threaded True:
mario64 title screen: 95-105 VI/S

False:
mario64 title screen: 72-85 VI/S

@fzurita
Copy link
Contributor Author

fzurita commented Jul 12, 2019

Great! Sounds like a good improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants