Inside WebM Technology: The VP8 Alternate Reference Frame
Thursday, May 27, 2010 | 4:45 PM
Labels: inside webm, vp8
Since the WebM project was open-sourced just a week ago, we've seen blog posts and articles about its capabilities. As an open project, we welcome technical scrutiny and contributions that improve the codec. We know from our extensive testing that VP8 can match or exceed other leading codecs, but to get the best results, it helps to understand more about how the codec works. In this first of a series of blog posts, I'll explain some of the fundamental techniques in VP8, along with examples and metrics.
The alternative reference frame is one of the most exciting quality innovations in VP8. Let’s delve into how VP8 uses these frames to improve prediction and thereby overall video quality.
Alternate Reference Frames in VP8
VP8 uses three types of reference frames for inter prediction: the last frame, a "golden" frame (one frame worth of decompressed data from the arbitrarily distant past) and an alternate reference frame. Overall, this design has a much smaller memory footprint on both encoders and decoders than designs with many more reference frames. In video compression, it is very rare for more than three reference frames to provide significant quality benefit, but the undesirable increase in memory footprint from the extra frames is substantial.
Unlike other types of reference frames used in video compression, which are displayed to the user by the decoder, the VP8 alternate reference frame is decoded normally but is never shown to the user. It is used solely as a reference to improve inter prediction for other coded frames. Because alternate reference frames are not displayed, VP8 encoders can use them to transmit any data that are helpful to compression. For example, a VP8 encoder can construct one alternate reference frame from multiple source frames, or it can create an alternate reference frame using different macroblocks from hundreds of different video frames.
The current VP8 implementation enables two different types of usage for the alternate reference frame: noise-reduced prediction and past/future directional prediction.
Noise-Reduced Prediction
The alternate reference frame is transmitted and decoded similar to other frames, hence its usage does not add extra computation in decoding. The VP8 encoder however is free to use more sophisticated processing to create them in off-line encoding. One application of the alternate reference frame is for noise-reduced prediction. In this application, the VP8 encoder uses multiple input source frames to construct one reference frame through temporal or spatial noise filtering. This "noise-free" alternate reference frame is then used to improve prediction for encoding subsequent frames.
You can make use of this feature by setting ARNR parameters in VP8 encoding, where ARNR stands for "Alternate Reference Noise Reduction." A sample two-pass encoding setting with the parameters:
--arnr-maxframes=5 --arnr-strength=3enables the encoder to use "5" consecutive input source frames to produce one alternate reference frame using a filtering strength of "3". Here is an example showing the quality benefit of using this experimental "ARNR" feature on the standard test clip "Hall Monitor." (Each line on the graph represents the quality of an encoded stream on a given clip at multiple datarates. The higher points on the Y axis (PSNR) indicates the stream with the better quality.)

The only difference between the two curves in the graph is that VP8_ARNR was produced by encodings with ARNR parameters and VP8_NO_ARNR was not. As we can see from the graph, noise reduced prediction is very helpful to compression quality when encoding noisy sources. We've just started to explore this idea but have already seen strong improvements on noisy input clips similar to this "Hall Monitor." We feel there's a lot more we can do in this area.
Improving Prediction without B Frames
The lack of B frames in VP8 has sparked some discussion about its ability to achieve competitive compression efficiency. VP8 encoders, however, can make intelligent use of the golden reference and the alternate reference frames to compensate for this. The VP8 encoder can choose to transmit an alternate reference frame similar to a "future" frame, and encoding of subsequent frames can make use of information from the past (last frame and golden frame) and from the future (alternate reference frame). Effectively, this helps the encoder to achieve results similar to bidirectional (B frame) prediction without requiring frame reordering in the decoder. Running in two-pass encoding mode, compression can be improved in the VP8 encoder by using encoding parameters that enable lagged encoding and automatic placement of alternate reference frames:
--auto-alt-ref=1 --lag-in-frames=16Used this way, the VP8 encoder can achieve improved prediction and compression efficiency without increasing the decoder’s complexity:

In the video compression community, "Mobile and calendar" is known as a clip that benefits significantly from the usage of B frames. The graph above illustrates that the use of alternate reference frame benefits VP8 significantly without using B frames.
Keep an eye on this blog for more posts about VP8 encoding. You can find more information on above encoding parameters or other detailed instructions to use with our VP8 encoders on our site, or join our discussion list.
Yaowu Xu, Ph.D. is a codec engineer at Google.


28 comments:
Martin said...
What would the codec do, if the input was 100% black for, let's say 10 seconds?
Would it recognize that the source haven't changed?
May 27, 2010 6:41 PM
Kevin Gadd said...
Most codecs can trivially handle that case, Martin - assuming the input is actually 100% black. Pretty much any input captured from film or a digital sensor is not going to be 100% black unless you process it carefully to make it so. On many televisions it's also the case that 'black' is not encoded as RGB 0,0,0 but instead a value slightly above zero, so that's another reason why you won't necessarily encounter frames that are '100% black'. Luckily, most modern codecs operate by looking at differences between frames, so they don't have to care what color the screen is filled with, just whether it's changing.
May 27, 2010 7:13 PM
Yan said...
I don't understand your last point - using the alternate reference frame to mimic B frames. In AVC the B frames are not necessarily bi-directionally predicted. Instead they can be predicted using, for example, 2 frames both from the past. So the B frames are just a special case of multiple-hypothesis prediction (using 2 hypotheses); and there is no restriction on prediction "direction" at all. So my question is, even if VP8 can form an alternate reference frame using one or more frames from the future, does it allow combination of 2 prediction blocks (as in B frames)? If not, then I don't see how the AR frames are similar to B frames.
May 27, 2010 8:52 PM
Suman said...
AR frames can help achieve results similar to B frames but they should not be confused as similar to B frames. Similar to AVC, where intermediate frames can be predicted bi-directionally or from past 2 frames, VP8 has the ability to predict frames from a combination of last, golden and AR frames. For results similar to bi-directional prediction, the intermediate frame uses a transmitted AR frame similar to future frame and either of last/golden frame for the past frame. On the other hand, VP8 can also use AR frame as one of the past frames and choose between last/golden/AR frames for prediction from 2 past frames.Since AR frame is never displayed, it can be a combination of any blocks/frames.
May 28, 2010 1:34 AM
spsatendra said...
Since AR frames are not displayed, does it mean decoder need to decode more frames to maintain the display frame rate.
May 28, 2010 7:53 AM
kidjan said...
Two comments, mostly on your methods:
1. PSNR? Yuck! Optimizing for that will result in a blurry mess. At the bare minimum you should use an objective measurement method that better correlates with MOS. At a bare minimum, I'd recommend SSIM.
2. Your graphs need some serious work. Is that the average value observed...median value...something else...? The correct way to show this is with side-by-side box plots (which show you median, inter quartile ranges, whiskers and outliers), not a linear line graph. Note that the average is not robust against outliers, so your video could easily have some total garbage scenes in it that are averaged out by more normal values. Average PSNR/SSIM is _wrong_ with video encoding.
May 28, 2010 12:50 PM
kidjan said...
May 28, 2010 12:54 PM
Edchick said...
Quote from Deskwarrior, Which is the same questions I have in mind.
Wondering, is there a plan to release a proper spec? 'vp8_bitstream.pdf' is basically a post-factum description of the reference source code; as a detailed walkthrough of the algo, it's OK, but as as a spec, it is disorganized, incomplete, and has errors and omissions. Just to start on section 9: 'decodeframe.c' should not be part of the spec (and is spelled wrong), ordering of fields in uncompressed data chunk is not given, neither is byte order for storing the chunk; byte order for image dims is bizarrely given as an incomplete and unnecessary 'swap2' function, instead of just saying what the byte order is (and C code contains 16-bit reads of unaligned addresses); not specified if image dims are before or after scaling; in 9.3 some items are not given names by which they could be referred to later (and some which are, have typos); What is a 'segment feature' and what are the predefined lengths; this is not clarified in section 10 as claimed... etc etc...
May 29, 2010 2:36 AM
Ali said...
Thanks for the insightful post.
Keep us updated, I've already bookmarked this blog.
May 29, 2010 4:05 PM
Yaowu Xu, PhD said...
@kidjan,
Yes, you are correct on that a bit more text could explain things clearer.
However, the psnr value of each point in the graphs is the "global" psnr value. It is a single number for any given encoded file, no average operation is involved.
May 30, 2010 2:39 AM
kidjan said...
I don't understand. Could you just post the algorithm for how you compute your "global" PSNR? If it's the total error measured with mean squared error, divided by the number of frames, then yes, this is "average error."
And again, I have to seriously question the use of PSNR (read http://www.ece.uwaterloo.ca/~z70wang/publications/SPM09.pdf ). There are documented metrics that correlate better with MOS, so there is no reason to use PSNR. I'd use SSIM.
May 30, 2010 11:11 PM
Gregory said...
Global PSNR the PSNR you get when you compute PSNR for the whole clip as you would for a single frame. In that sense it's an "average" just as single frame SSIM or PSNR numbers are an "average" over a single frame.
Usually the words average PSNR as opposed to global refers to taking the arithmetic mean of the log domain numbers for each frame.
For this kind of "two techniques in a single format" PSNR is not usually unreasonable, though I might worry about it overstating the usefulness of the de-noised reference.
If you're unhappy with the use of PSNR: Conduct your own test. The software is all available, they gave you the commands. There is no need for the aggressive complaining.
June 1, 2010 11:26 AM
Gregory said...
FWIW, I measured on hall_monitor CIF with and without the arnr at the 250kbps level and saw 0.5027dB PSNR improvement (consistent with the graph here) and 0.7435dB SSIM improvement.
June 1, 2010 12:26 PM
kidjan said...
Gregory, thanks for the PSNR comments. That is functionally equivalent to "average PSNR," which is still deficient for video quality measurement.
As for "aggressive complaining," I believe you're in left field: there is a huge difference between peer review and general critique, and "complaining." You call it the later, and I think this accusation is completely misplaced.
FWIW, I work in this industry, and I responded to the post here:
http://goldfishforthought.blogspot.com/2010/05/hd-video-standard-cont.html
June 1, 2010 3:26 PM
dave said...
kidjan, as long as you're not comparing two competing codecs, or two different clips, PSNR is a totally valid measure. I notice you actually suggest in your blog that not having a H.264 comparison on the PNSR graph is an error, which is of course incorrect. Comparing across codecs as you suggest would be a misleading use of PSNR. But that is not what is shown here.
Here is a reference for you:
Scope of validity of PSNR in image/video quality assessment
by Q Huynh-Thu - 2008 (you can download the PDF if you go via Google to avoid the paywall)
http://www.google.com/url?sa=t&source=web&cd=1&ved=0CBIQFjAA&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F2220%2F4550681%2F04550695.pdf%3Farnumber%3D4550695&ei=lTMGTJPrHIqI0wSK38ilDA&usg=AFQjCNGL_lsAtdpka50LPy9e4Qoui2K-3w&sig2=QhUSSsPZeXpDmq6-6PgZUg
"Figs. 1 and 2 indicates that, for a specified content, PSNR always monotonically increases with subjective quality as the bit rate increases.
The implication is that, within a specified codec and fixed content, the variation of PSNR is a reliable indicator of the variation of quality. Hence, in the context of codec optimisation, PSNR can therefore be used as a performance metric as it correlates highly with subjective quality when the content is fixed. For example, PSNR can be used for testing different codec optimisation strategies designed to maximise the subjective quality of a specified content (i.e. the
content remains the same between the optimisations)."
June 2, 2010 6:44 AM
André said...
@dave
I'm sincerely stupefied with your comment. I'm sorry, but I have no other words for it.
Have you really seriously considered the article you indicated?
The article definitely doesn't prove in any way that PSNR is a "totally valid measure".
It just uses results of simple psnr x rate and subj_quality x rate plots to say that both monotonically increase! And then it remarks that, as PSNR and subj_qual are both increasing along the rate axis, that PSNR is "a reliable indicator"! Oh, come on..
I suggest you read the article kidjan indicated.
PSNR is a very bad objective quality measure to target, and WebM developpers shouldn't be seriously using it.
June 2, 2010 2:55 PM
dave said...
@André: I realise that it's cool to bash PSNR because the x264 devs use SSIM for their psy-optimisations, but do you have any actual issue with PSNR as used in this blog post?
Gregory who commented above reran the test and showed that SSIM had also gone up. Was this genuinely a surprise to you, that PSNR and SSIM (and presumably subjective quality) would show the same general result? Did you really expect that it had remained flat or gone down when you knew that PSNR had gone up?
Or are you fully aware that if you add a B-frame like technique on a clip that benefits from B-frames and see a PSNR increase then that shows that the technique is working as intended i.e. the point of the post.
I know PSNR doesn't tell the whole story, for starters it says nothing about code complexity, encoding or decoding time, potential patent problems etc. etc. before even getting into subjective quality. If Google starts optimising for PSNR and ignores all those other factors then there will be trouble. But why don't we wait until they actually do or say something along those lines, then we can get up on our high horses about it?
June 3, 2010 4:56 AM
Jacky_Wing said...
Would you create the website, like firefogg.org
to help people create webm videos easily, thanks.
June 3, 2010 11:16 AM
kidjan said...
<< Comparing across codecs as you suggest would be a misleading use of PSNR. But that is not what is shown here. >>
It is _absolutely_ what is being shown here. Sure, both are VP8, but it's with and without specific features, which means it's tweaking encoder settings. There is no functional difference between testing the same encoder with different features, and different codecs entirely. I have no idea why you would make the above claim.
Furthermore, you are dead wrong about PSNR, and I am not some rabid x264 fanboy. SSIM is simply a better metric; numerous published and peer reviewed articles have established this as as basic fact: it correlates better with mean opinion scores.
I do not believe PSNR is worthless as a metric--it does give you some notion of quality--but to think it's valid as applied to image data is to make the following assumptions, per the Z. Wang article I linked to in my blog:
1) Signal fidelity is independent of temporal or spatial relationships between the samples of the original signal. In other words, if the original and distorted signals are randomly
re-ordered in the same way, then the MSE between them will be unchanged.
2) Signal fidelity is independent of any relationship between the original signal and the error signal. For a given error signal, the MSE remains unchanged, regardless of which original signal it is added to.
3) Signal fidelity is independent of the signs of the error signal samples.
4) All signal samples are equally important to signal fidelity.
Simply put, each of these assumptions is wrong. And measuring your quality with PSNR is wrong. And expecting to make improvements in your encoder when you're using a bogus metric is dead wrong.
Lastly, there is zero functional difference between "global PSNR" and "mean PSNR". Global avoids the issue of black frames, but it's still a mean, which means it conveys diddly squat about the overall video quality. And FWIW, mean SSIM would have the same problem.
When people start thinking about how to make their measurements better and display it in a legitimate format--instead of defending the same bad method they've been using for the last thirty years--is when you'll make real progress.
June 4, 2010 12:13 AM
Gregory said...
Kidjan, I see you haven't yet made a comparison of these modes yourself.
If you were half as interested in studying the real behaviour as you are in proving how wrong everyone is and how much smarter you are then perhaps we would have learned something interesting by now.
I posted a global SSIM measurement, where is your contribution to the particular question of the effectiveness of the alt-ref?
If you'd temper your position somewhat you wouldn't have to tell us that you're not a rabid fanboy, we'd be able to tell from your dialog.
Your tone and demeanour are all the less forgiveable given the weakness of your arguments.
"no functional difference"? Come on. This is simply not a reasonable position. We should ask ourselves if the differences of the behaviour is one we would expect to invalidate our measurement criteria. Would you really argue that PSNR could not fairly be used to determine if a more or less coarse quantizer produced higher or lower quality in a single codec?
It's important to understand limits, but to reject everything for imperfection is an even greater form of ignorance... particularly that there exist a perfect solution.
The short test clips in question are basically stationary in behaviour, this avoids one of the problems with global numbers (combining scenes with different characteristics).
Amusingly, you tout the demonstrated performance of frame SSIM as related to objective scores... but these are themselves _global_ numbers, averages over single images. And PSNR was also shown to correlate well, though clearly less well than SSIM.
Typical images have much more diversity, leading to opportunities for differences in the perceptual weight, over space than these clips have over time.
No simple but complete solution to combining the performance from many frames has been demonstrated to be effective as far as I'm aware.
Your proposed "box plot" graphing isn't common in the literature in this field. I could see uses for it, but there are obvious limitations. For example, it would misrepresent the performance if you replaced every other frame with the one before it. On most clips the perceptual harm is minor (especially with 50/60fps input!) but the worse case single-frame objective measurements will be terrible. A global number would also understate the performance in this case, but nowhere nearly as bad as showing the minimum or bottom quartile.
Of course, if the codec isn't degrading the image in this way then it might not be a hazard, but the same can be said of your scrambled pixel example— which is why PSNR is useful at all.
For any kind of test you can find blind-spots. Making a fuss over a widely used and well understood methodology without even arguing that the particular usage is walking into the blind spots simply isn't productive.
June 4, 2010 4:14 AM
foxyshadis said...
Gregory, x264 doesn't optimize for SSIM. They spot-optimize small functions for PSNR or SSIM but generally only fully optimize visually. They've repeatedly found that optimizing for either PSNR or SSIM results in a lower visual quality when using psychovisual optimizations that metrics will downgrade. On a detailed scene, PSNR will always rate higher flattening fine texture and blurring blocks over retaining detail. Replacing real grain with synthesized grain, or otherwise synching grain and dct noise, will always make your scores go down even further, but will generally look much better when done right. Metrics have their place as pointers, but if you don't understand the weaknesses, you'll simply continue all of the problems On2 had for so many years with their devotion to PSNR above all else. Simply moving to SSIM won't fix that even if it's closer.
This image is one popular example where the blurry image has much higher PSNR:
http://x264.nl/developers/Dark_Shikari/parkrun_psnr.png
http://x264.nl/developers/Dark_Shikari/parkrun_ssim.png
(Yes, that is from the x264 developer, so take that with a grain of salt if you wish.)
June 7, 2010 3:02 PM
foxyshadis said...
I should mention, though, that I actually think this codec has a lot of real promise, and that it does appear that properly used, the alternate frames could make decent substitute frames. It might look like a bit of a hack, since the encoder still needs to reference all previous frames and then spend bits to populate the "frame" with the best candidate blocks, but if that's what it takes to avoid any patents, it seems like a fairly elegant solution once optimized. It's essentially a pre-digested motion search if I understand it right.
June 7, 2010 3:09 PM
kidjan said...
<< If you were half as interested in studying the real behaviour as you are in proving how wrong everyone is and how much smarter you are then perhaps we would have learned something interesting by now. >>
This is not my intention. My intention is to critique the methods used. You seem to have some trouble with the idea of peer review.
<< Would you really argue that PSNR could not fairly be used to determine if a more or less coarse quantizer produced higher or lower quality in a single codec? >>
Yes, because PSNR does not correlate well with mean opinion scores (i.e. actual human interpretation of video quality). Showing that your video has "higher PSNR" as an indication of some feature's contribution to video quality is simply not as meaningful as showing improved SSIM.
<< but to reject everything for imperfection is an even greater form of ignorance... particularly that there exist a perfect solution. >>
I'm doing no such thing, Gregory. The point is not that SSIM is "perfect"; the point is that it's _better_ than PSNR, markedly so, and this has been objectively shown by numerous peer-review articles (see aforementioned paper).
<< Amusingly, you tout the demonstrated performance of frame SSIM as related to objective scores... but these are themselves _global_ numbers, averages over single images. >>
I have no idea what you're talking about. The SSIM scores are not "global" numbers, nor are any of the representations averages.
<>
So why would anyone use PSNR if it's been show to correlate "clearly less well" than SSIM? That's my whole point, Gregory.
<< For example, it would misrepresent the performance if you replaced every other frame with the one before it. >>
No it wouldn't. To the contrary, using averages you would get misrepresented performance. On this point, you're simply wrong. Using a box-plot, you'd still have a good idea of what your median frame quality, where the quartiles fall, and the existence of any outliers.
And of course my "box plot" graphing isn't "common in the literature in this field," but that's not a good argument. You're also talking about a field that steadfastly uses an inferior metric despite the existence of objectively better ones.
June 11, 2010 6:57 PM
André said...
@dave: ok, you're probably right about this specific issue, that the visual quality is probably increasing in this case.
But the point was that you argued that PSNR can be freely used for comparison for the same codec, citing a specific article that supposedly would prove it...but the article's results do not present anything to support that conclusion. The article is a shame itself, and can't be considered seriously.
I think it is important to state that we can't optimize using PSNR. It is important to have this discussion.
At least On2ers/Googlers who will post on the blog will know that there is intelligent life on the other side.
But the main point is that we agree that "If Google starts optimising for PSNR (..) there will be trouble".
@Gregory: the problem is more about the methods used, and not about this specific alt-ref case.
Of course PSNR is well understood etc, but if everybody knows SSIM is better (as it correlates better), why don't we use it?
It is a tiny simple step that could give more relevant results to the work.
@foxyshadis:
Just to make it clear, I'm evidently not arguing we should optimize for SSIM either.
PS: "I am not some rabid x264 fanboy" either. I just study codecs at a graduate level. :)
June 11, 2010 7:13 PM
Gregory said...
June 13, 2010 5:59 PM
Mickey said...
June 22, 2010 4:51 AM
Mickey said...
I have the same question from spsatendra above:"since AR frames are not displayed, does it mean decoder need to decode more frames to maintain the display frame rate."
If yes, is there any limitation of the number of AR frames in the specification or the current encoder implementation in the SDK?
I couldn't find it. If it could be theoratically 30 frames for 30fps video, I think that then a decoder has to have capability of decoding 60 fps for 30 fps display and 30 fps AR decoding. Is my understanding correct?
June 22, 2010 6:10 AM
Mickey said...
In my question above, "theoratically 30 frames" means the max. number of altref frames at the worst case. Of course, there might be more altref frames which are not actually referred in decoding process. But, the case is not realistic.
June 22, 2010 6:13 AM
Post a Comment