Tuesday, June 15, 2010

VP8 Codec Optimization Update

Since WebM launched in May, the team has been working hard to make the VP8 video codec faster. Our community members have contributed improvements, but there's more work to be done in some interesting areas related to performance (more on those below).


The VP8 encoder is ripe for speed optimizations. Scott LaVarnway's efforts in writing an x86 assembly version of the quantizer will help in this goal significantly as the quantizer is called many times while the encoder makes decisions about how much detail from the image will be transmitted.

For those of you eager to get involved, one piece of low-hanging fruit is writing a SIMD version of the ARNR temporal filtering code. Also, much of the assembly code only makes use of the SSE2 instruction set, and there surely are newer extensions that could be made use of. There are also redundant code removal and other general cleanup to be done; (Yaowu Xu has submitted some changes for these).

At a higher level, someone can explore some alternative motion search strategies in the encoder. Eventually the motion search can be decoupled entirely to allow motion fields to be calculated elsewhere (for example, on a graphics processor).


Decoder optimizations can bring higher resolutions and smoother playback to less powerful hardware.

Jeff Muizelaar has submitted some changes which combine the IDCT and summation with the predicted block into a single function, helping us avoid storing the intermediate result, thus reducing memory transfers and avoiding cache pollution. This changes the assembly code in a fundamental way, so we will need to sync the other platforms up or switch them to a generic C implementation and accept the performance regression. Johann Koenig is working on implementing this change for ARM processors, and we'll merge these changes into the mainline soon.

In addition, Tim Terriberry is attacking a different method of bounds checking on the "bool decoder." The bool decoder is performance-critical, as it is called several times for each bit in the input stream. The current code handles this check with a simple clamp in the innermost loops and a less-frequent copy into a circular buffer. This can be expensive at higher data rates. Tim's patch removes the circular buffer, but uses a more complex clamp in the innermost loops. These inner loops have historically been troublesome on embedded platforms.

To contribute in these efforts, I've started working on rewriting higher-level parts of the decoder. I believe there is an opportunity to improve performance by paying better attention to data locality and cache layout, and reducing memory bus traffic in general. Another area I plan to explore is improving utilization in the multi-threaded decoder by separating the bitstream decoding from the rest of the image reconstruction, using work units larger than a single macroblock, and not tying functionality to a specific thread. To get involved in these areas, subscribe to the codec-devel mailing list and provide feedback on the code as it's written.

Embedded Processors

We want to optimize multiple platforms, not just desktops. Fritz Koenig has already started looking at the performance of VP8 on the Intel Atom platform. This platform need some attention as we wrote our current x86 assembly code with an out-of-order processor in mind. Since Atom is an in-order processor (much like the original Pentium), the instruction scheduling of all of the x86 assembly code needs to be reexamined. One option we're looking at is scheduling the code for the Atom processor and seeing if that impacts the performance on other x86 platforms such as the Via C3 and AMD Geode. This is shaping up to be a lot of work, but doing it would provide us with an opportunity to tighten up our assembly code.

These issues, along with wanting to make better use of the larger register file on x86_64, may reignite every assembly programmer's (least?) favorite debate: whether or not to use intrinsics. Yunqing Wang has been experimenting with this a bit, but initial results aren't promising. If you have experience in dealing with a lot of assembly code across several similar-but-kinda-different platforms, these maintainability issues might be familiar to you. I hope you'll share your thoughts and experiences on the codec-devel mailing list.

Optimizing codecs is an iterative (some would say never-ending) process, so stay tuned for more posts on the progress we're making, and by all means, start hacking yourself.

It's exciting to see that we're starting to get substantial code contributions from developers outside of Google, and I look forward to more as WebM grows into a strong community effort.

John Koleszar is a software engineer at Google.


Polite, on-topic comments are welcomed on the webm-discuss mailing list. Please link to this post when commenting.