-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threaded OpenGL API calls #1452
Conversation
The performance penalty comes from the unbuffered drawer. Probably because I can't cache the vertices any more. I think I know how I will fix this. I will use the old version of the unbuffered drawer when in single threaded mode. I will use the current version in threaded mode or the buffered drawer if the hardware allows. Edit: Ok, done. I made a "thread safe" version of the UnbufferedDrawer. The plugin runs about ~25% slower using that version. This will be used in threaded mode whenever the hardware doesn't support Buffer Storage extension. I can't think of a clean way to fix that for devices without the extension, hopefully the threading makes up for it for those devices. Edit2: |
942cf85
to
9eca4bf
Compare
Ok, I enabled threaded mode and of course it didn't work right off the bat. I'm getting
Anyone have any ideas? |
https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glMapBufferRange.xhtml There are many cases for INVALID_OPERATION. I'd check "if zero is bound to target" and "is already in a mapped state" first. |
I found at least that problem. There were instances of the old unwrapped gl calls in the code still left. |
Ok, good news, it works! For some reason the FPS display is not working so I can't judge performance. And for some reason the thread is not exiting when core exits. Also, I expect performance to be identical at first since there are many operations that we are doing that are stalling the pipeline. So until I fix those, I don't expect any improvements. Edit: I think a good solution for now would be if there is more than 1 swap buffers call in the queue, then prevent the main thread from executing until the video plugin catches up. I think that will work pretty well. |
Ok, after performing several optimizations on the threading. I'm seeing about a 25% performance improvement in FPS in CBFD. I'm thinking that the percentage improvement in FPS is going to correspond to how much time the emulation spends in the core. |
Is this on desktop or mobile, and a 25% improvement on what baseline? (Unthreaded, or unoptimized threaded?) |
This is in mobile. Unthreaded vs threaded. Threaded has 25% better performance in that game. |
Let me know when you feel it's ready for testing (and how to turn it on and off). I suspect it might be driver dependant. I think Nvidia has threaded optimizations built into the driver, so this might just add extra overhead on top of that, but testing will reveal that. |
Isn't AMD notoriously poor at threading? This might help reduce CPU overheads on AMD. |
Right, I don't think it's ready yet. There are a few bugs I need to work out. For example, in Mario 64, the core runs so fast when fast forwarding, that the OpenGL command queue keeps growing indefinitely. I need to figure out a way to drop non essential commands and skip frames. |
This will only help improve performance by allowing the GPU to keep doing work while the video plugin is not executing. So if the emulator spends 80% of the time in the video plugin, we can get an additional 20% performance. Also, this will not help at all if we have to do read backs from video memory to CPU memory in sync mode. For example, if we need to read the color buffer in sync. If we read the color buffer in async mode though, this will help. |
1d51917
to
c55b2e2
Compare
Isn't this essentially the problem with running the OpenGL calls in their own thread? Won't this de-sync the video from the rest of the emulator? I'm still having a hard time understanding how this would improve performance without causing other issues with the emulation |
@@ -60,7 +62,7 @@ void ContextImpl::init() | |||
} | |||
|
|||
{ | |||
if ((m_glInfo.isGLESX && (m_glInfo.bufferStorage && m_glInfo.majorVersion * 10 + m_glInfo.minorVersion > 32)) || !m_glInfo.isGLESX) | |||
if ((m_glInfo.isGLESX && m_glInfo.bufferStorage) || !m_glInfo.isGLESX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was m_glInfo.majorVersion * 10 + m_glInfo.minorVersion > 32
removed?
BufferedDrawer depends on glDrawElementsBaseVertex, which is only available in GLES 3.2+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. I should put that back.
} | ||
FunctionWrapper::glLineWidth(_width); | ||
FunctionWrapper::glDrawArrays(GL_LINES, 0, 2); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a newline here
@@ -26,4 +26,4 @@ namespace opengl { | |||
std::array<const void*, MaxAttribIndex> m_attribsData; | |||
}; | |||
|
|||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a newline here
namespace opengl { | ||
|
||
std::array<std::shared_ptr<std::vector<char>>, MaxAttribIndex> GlVertexAttribPointerUnbufferedCommand::m_attribsData; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a newline here
::CoreVideo_GL_SwapBuffers(); | ||
} | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a newline here
{ | ||
executeCommand(std::make_shared<GlTextureSubImage2DUnbufferedCommand<pixelType>>(texture, level, xoffset, yoffset, width, height, format, type, std::move(pixels))); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newline
By the way, this should allow us to implement frame skipping based on command queue size. So if there is more than one "swap buffers" command in the queue. We should be able to drop some commands to catch up. I just need to figure out what commands are safe to drop. |
c55b2e2
to
a0b8720
Compare
Ok, I'm making a lot of head way. I fixed numerous bugs. Also, before swapping buffers, we will check if there is already a queued swap buffer ahead in the queue. If there is, we will wait until that is executed before continuing. This prevents the situation where the GL command queue gets too far behind the emulation core. In the long run though, I think we want to skip frames. Here are a few more things I need to do to call this complete:
I may add to this as I see problems. |
@loganmc10 Can you give this a test? You can enable this by setting "ThreadedVideo" to true in mupen64plus. Edit: Also, keep in mind that this only works mupen64plus and I have yet to update the CMake file. |
Here goes! Now, for whatever reason I've never been able to get Mario 64 idling outside castle inside front door idling at start of bob-omb battlefield Goldeneye 64 during bond's gun barrel walk. dam intro pan idling at start of dam I'd say there's no regressions and it seems like a slight improvement, but I feel the benchmarking I am able to do is not definitive :( I wonder if @psyke83 or @gizmo98 can think of a better way? |
I'm glad to hear that at least it's not a performance penalty. I think implementing the object pool helped with that. I think this code really helps when Async color to RDRAM is enabled which the RPI doesn't have. |
Configuration can only be set using mupen64plus for now
4986397
to
bc518a1
Compare
Rebased against latest master. Tested a little bit and the performance boost seems to still be there, specially with Async Color buffer to RDRAM enabled. Around a 10% performance improvement on my device, but it will depend on how fast color buffer copies are. |
d28eb0c
to
c59c65b
Compare
With WGL function creation must be done in the same thread as the context.
The incorrect assumption was made that the pluging would start clean every time.
So that threading mode can be changed on the fly
GL commands must now be created using a "get" method
c59c65b
to
618d3c7
Compare
Merged to fzurita-threaded_GLideN64 branch. |
This is the ongoing work to get a threaded implementation of the OpenGL API going.
This is not threaded yet, but it's the refactoring I have had to do so far that has been needed to support that.
Currently this does seem to have a performance regression, I have to figure out where it's coming from.