-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible solutions to optimize cpu load for HLS #3040
Comments
Would you kindly provide us with the streams you are using to test resource consumption? It would be good for us to know the characteristics of the streams. |
I simply used Bento4 mp42hls to convert tears_h264_high_uhd_30000.mp4 in https://storage.googleapis.com/wvmedia/clear/h264/tears/tears_uhd.mpd to a HLS-TS stream. The m3u8s are as follows: The stream is on our internal server right now and I'm afraid it can't be accessed from outside. |
Think these are some valid points by zhanghuicu on exoplayer's HLS capabillity. Also the proposed enhancements to the exoplayer way of (HLS) operating look very interesting. Do think that at the root of these HLS cpu peaks there is a more generic choice in the way the default exoplayer data loading operates. It works in a peak pattern that tries to fetch data as quickly as possible with reltatively large intervals (15s). And thus you get these kind of cpu utilization peaks at given intervals for Ts based HLS. see #2083 As far as I understand this loading approach is geared towards energy efficiency for battery operated devices (also taking into account also the energy for the network tranfser) . For a none battery powered tv / media box solution like the Mstar6A938 one probably is more concerned just about the keeping the cpu clock low and having as little framedrops as possible. As a result one would instead want to spread the loading over time to create a better visual performance. The cpu utilization side effect of the default loader choice grows with the bitrate as as the loader works in the time domain only. So for a 30mbit stream by default it's trying to load and parse 15 seconds of data (56mb) as quick as possible. Ironically this also means that the higher you're connection speed to the server, the more peak utilization the player will put on one of the cpu's. Which in turn can cause scheduling issues on the system the player runs on. I think that adapting the loader priority and doing concurrent Ts loading would definitly improve experience for an externally powered device as both methods will effectively counter the default loader cpu utilization side effects as the bitrate increases. Not sure though how the splitting of extracting , lower prio will affect the energy consumption though. Maybe the exoplayer team can do some tests for this based on a nexus device. Also not sure how much the splitting will affect the adaptive selection/bandwith measurement. Seeing the increasing use of utilizing exoplayer in AndroidTv style devices, maybe it is an idea to develop different load controls for none battery operated devices or make the default one adapt to higher bitrates as are more common for (high quality) tv oriented delivery. Do think that the current BUFFER_PACKET_COUNT setting (5)in TsExtractor.java , is definitely too small. Seems that this should be a relatively simple improvement, is there any reasoning to why this is set to 5 ? |
Agreed there are some valid points of investigation, and thanks a lot for the analyses. A few thoughts/comments:
Thanks! |
It would also be a good idea to profile the extractor to see if there are any hotspots that we can optimize. If we're able to make it cheaper, that's going to be a better solution than spreading the cost (e.g. by lowering thread priority) :). |
For looking at increasing |
|
Not sure if I managed to get the message across properly before. The current exoplayer implementation requires a single cpu core to parse up to 15 seconds of container data around ~15s intervals. It will do this at the "maximum" speed at which the data source can provide it and it assumes this will not cause a task scheduling conflict with other related systems tasks on the system. If there is a clash with another task that happens to be the audio/video OMX decoder (spawned by the driver/os layer) one will end up with a frame drop eventually. So the issue seems to be about peak cpu loads not cpu load in general. I would argue that this issue is not just related to limited cpu processing power of a device, but more about assuming a relation of cpu processing power vs the device's network IO throughput. With the current loader approach the quicker the network IO the more computation power is required from the device in order to avoid a potential scheduling issue. So if not dealt with, this issue will only worsen for existing devices over time as connections speeds to servers will probably improve, but the device cpu processing power will not. Ts streams are the obvious candidate to hit it first as they are the most expensive to parse, but possibly encrypted containers or streaming over https(heavy cipher suite) for different formats would eventually hit similar behavior when increasing their bitrate.
In general for our project we did not find a more effective approach worth investigating than tuning the exoplayer parameters to our intended use until we managed a reasonable performance. Which off coarse goes directly against the exoplayer's purpose of providing abstraction. We came to a value of 20 for BUFFER_PACKET_COUNT in our expoplayer 1.5.x based experiment. Together with avoiding the android 4.x OS cipherInputstream mess. However in our experiment we use HLS streams that typically do not go beyond 15mbit (with optional AES envelope encryption) and we have a fairly powerful cpu at hand. We did not need complete drop less playback performance at that moment, near fluent was good enough. Drop less would become more important once we can have something like dynamic HDMI output rate switching. Is there any form of support for this in Exoplayer v2 in combination with recent android versions? |
Really low hanging fruit optimization for TS extraction. ParsableBitArray is quite expensive. In particular readBits contains at least 2 if blocks and a for loop, and was being called 5 times per 188 byte packet (4 times via readBit). A separate change will follow that optimizes readBit, but for this particular case there's no real value to using a ParsableBitArray anyway; use of ParsableBitArray IMO only really becomes useful when you need to parse a bitstream more than 4 bytes long, or where parsing the bitstream requires some control flow (if/for) to parse. There are probably other places where we're using ParsableBitArray over-zealously. I'll roll that into a tracking bug for looking in more detail at all extractors. Issue: #3040 ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=161650940
ParsableBitArray.readBit in particular was doing an excessive amount of work. The new implementation is ~20% faster on desktop. Issue: #3040 ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=161666420
Apply the same learnings as in ParsableBitArray. Issue: #3040 ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=161674119
|
OK, I'll do some tests and let you know once I've got any result. |
Hi,
It seems that ojw28's commits are really effective, and lower Loader's thread priority can make some improvement. Please let me know if you need more information and I'm glad to do more tests if you want. Thanks. |
Thanks; this is excellent data! One additional data point I'd be interested in is what happens in case (3) if you drop the thread priority to |
Nice to see some good progress on this, these already look like promising improvements. zhanghuicuc are you using a different 4k video as the uploaded one or are you using a 100mbit link? The BUFFER_PACKET_COUNT result when using 100 is more or less what we experienced on our armv7 android 4.3 based platform. Unfortunately we do not have an Mstar based system to test on, if this could be provided we would be willing to help with some additional analysis for this issue ;). Though our used platform is very different from the Mstar6A938, it would still be interesting to see if our reached value of 20 for the BUFFER_PACKET_COUNT we got to for our platform also works better for the mstar platform. For us 20 was the point after which things did not improve anymore and even further out things (above 40) started to perform worse again. |
@dbedev I'm using a different 4k video with a bitrate of 15Mbps. Actually, it was just transcoded from the uploaded one using ffmpeg and converted to a HLS stream using Bento4. The network link is about 10MBps. I've just done some tests with the Loader thread priority of THREAD_PRIORITY_LESS_FAVORABLE. The video (duration = 1 hour) was played twice. We saw 225 dropped frames per hour. It seems that THREAD_PRIORITY_LESS_FAVORABLE is not low enough. I'll do some tests with THREAD_PRIORITY_BACKGROUND and see what happens. And I'll also do some tests with different BUFFER_PACKET_COUNT. |
We currently read at most 5 packets at a time from the extractor input. Whether this is inefficient depends on how efficiently the underlying DataSource handles lots of small reads. It seems likely, however, that DataSource implementations will in general more efficiently handle fewer larger reads, and in the case of this extractor it's trivial to do this. Notes: - The change appears to make little difference in my testing with DefaultHttpDataSource, although analysis in #3040 suggests that it does help. - This change shouldn't have any negative implications (i.e. at worst it should be neutral wrt performance). In particular it should not make buffering any more likely, because the underlying DataSource should return fewer bytes than are being requested in the case that it cannot fully satisfy the requested amount. Issue: #3040 ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=162206761
Really low hanging fruit optimization for TS extraction. ParsableBitArray is quite expensive. In particular readBits contains at least 2 if blocks and a for loop, and was being called 5 times per 188 byte packet (4 times via readBit). A separate change will follow that optimizes readBit, but for this particular case there's no real value to using a ParsableBitArray anyway; use of ParsableBitArray IMO only really becomes useful when you need to parse a bitstream more than 4 bytes long, or where parsing the bitstream requires some control flow (if/for) to parse. There are probably other places where we're using ParsableBitArray over-zealously. I'll roll that into a tracking bug for looking in more detail at all extractors. Issue: #3040 ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=161650940
ParsableBitArray.readBit in particular was doing an excessive amount of work. The new implementation is ~20% faster on desktop. Issue: #3040 ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=161666420
Apply the same learnings as in ParsableBitArray. Issue: #3040 ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=161674119
Hi, I'll also do some tests with different BUFFER_PACKET_COUNT. Thanks |
Hi |
@zhanghuicuc, PRIORITY_DOWNLOAD is used in a component that is under development, and will be released soon. Have you profiled the provided patch? I wouldn't expect a great difference there, considering the sparsity of PSI data in comparison with 4k PES data. |
@AquilesCanta |
Agreed. I briefly considered this also, but profiling suggested it wasn't really worth it. |
I've noticed that in |
It's arbitrary. I'm not convinced allowing frames to be "more late" is necessarily better than dropping them though, from a visual point of view. Does the result actually look smoother when you increase the value, or does it just decrease the dropped frame count and look just as janky? In any case, it's clear that "minimizing dropped frames" isn't in isolation the goal to be aiming for. It's trivial to get to 0 simply by never dropping and showing frames arbitrarily late instead, but that doesn't make it a good idea :). |
Just my 2 cents: A -30 .. 11ms window for frame dropping is rather a tight for 24/25fps content as every drop would cause a 41.6/40 ms jump in time by dropping the frame. Not sure though if increasing this would have noticeable negative effect on things like av sync. Besides that another thing we have observed with this control loop is that some media codec implementations do use quite some cpu resources (we do use byte buffer out instead of direct surface coupling) . This sometimes causes the media codec to fall behind to a point where our renderer (which uses the same control loop as the default renderer) decides to simply drop everything. As there is no option to drop individual frames at the decoder level. It would probably be better to have a minimum service level to only drop X frames consecutively so a user still sees something in stead of just reporting drop counters. Not sure though if this case ever occurs when decoding to a surface directly . Another thing to consider might be to move away from the deprecated getOutputBuffers() and use the getOutputBuffer(index) approach on API 21+ devices as this should allow the codec run more optimized according the documentation. Is there an explicit reason that Exoplayer still utilizes this "deprecated" method for new platforms? |
|
@zhanghuicuc - Could you clarify the measured frame drops when, all other things kept equal and using the latest |
(a) what it is now : 204 dropped frames per hour For case (a), (c) and (d), the same video played 3 times. For case (b), the same video played 2 times. So there may be some deviation here. |
There's little evidence that this is still a problem, given modern devices and a continued shift toward mire efficient container formats (i.e., FMP4). Closing as obsolete. |
Hi,
As mentioned in #1227 and #1412, Loader:HLS thread cosumes a lot of CPU resources. We have used ARM Streamline to analyze the CPU activity and got the result as follows:
We can see that core 1 and 2 can be more than 60% busy sometimes. On some low performance cpus like Mstar6A938, this can lead to a large amount of dropped frames especially when playing 4K high bitrate streams. The same issue happens on both 1.5.x and 2.x.
Possible solutions to optimize cpu load for HLS may be a) Lower the thread priority of Loader:HLS and b) Load and extract ts packets concurrently.
a)
changeThreadPrio.diff
In this way, OMX may have more resources to do decode related work. The dropped frame numbers can be reduced by 60% according to our 3 hours long time test results. And we have not found any side effects so far.
b)
loadAndparseConcurrently.diff
Loading several ts packets each time can reduce the number of IO operations. By doing so, we need to load and extract ts packets concurrently to prevent the player from getting stuck in buffering state. The cpu activity now is as follows:
(loading 50 packets each time)
(loading 100 packets each time)
We can see that cpu load has been reduces a lot and dropped frame numbers is also reduced.
Maybe switching to fragmented mp4 or DASH is the best solution, but there still exists a lot of HLS-TS streams we need to deal with. I'd appreciate any thoughts on these possible solutions and I'm happy to prepare a PR for any potential fix.
Thanks!
The text was updated successfully, but these errors were encountered: