x264中的聚合性存取优化

Write combining

聚合性存取

Filed under: gcc,speed,ugly code,x264 ::


Let’s say we need to copy a few variables from one array to another. The obvious way is something like this:

byte array1[4] = {1,2,3,4};
byte array2[4];
int i;
for(i = 0; i < 4; i++) array2[i] = array1[i];

But this is suboptimal for many reasons. For one,we’re doing 8-bit reads and writes,which on 32-bit systems may actually be slower than 32-bit reads and writes;

i.e. a single 32-bit read/write may be faster than a single 8-bit read/write. But the main issue is that we could be doing this:

DECLARE_ALIGNED_4(byte array1[4] = {1,4});
DECLARE_ALIGNED_4(byte array2[4]);
*(uint32_t*)array2 = *(uint32_t*)array1;

In a single operation instead of 4,we just copied the whole array. Faster speed-wise and shorter code-wise,too. The alignment is to ensure that we don’t copy

between unaligned arrays,which could crash on non-x86 architectures (e.g. PowerPC) and would also go slightly slower on x86 (but still faster than the uncombined

write). But,one might ask,can’t the compiler do this? Well,there are many reasons it doesn’t happen. We’ll start from the easiest case and go to the hardest

case.

试想我们要从一个数组拷贝几个变量到另一个数组,通常会这样写代码
byte array1[4] = {1,4};
byte array2[4];
int i;
for(i = 0; i < 4; i++) array2[i] = array1[i];
但这不是一个最优的方案。首先,我们是在做一个8-bit的存取操作,这在一个32-bit系统上是要比32-bit的存取操作要慢的。而更重要的原因是我们可以如此优化
DECLARE_ALIGNED_4(byte array1[4] = {1,4});
DECLARE_ALIGNED_4(byte array2[4]);
*(uint32_t*)array2 = *(uint32_t*)array1;
这样一次操作就可以完成4个字节的拷贝,而不是分成4次操作。这样的代码更快,更简洁漂亮。4字节对齐保证我们操作不会在非x86系统上崩溃,也保证在x86系统上比不对齐时的操作更快,

即使不对齐也比分成4次操作时更快。你可能会问,编译器会帮我们做这个工作吗?答案是,有很多原因导致编译器无法帮你做这个工作。我们从简单的到难的逐个解释。


The easiest case is a simple zeroing of a struct (say s={a,b} where a and b are 16-bit integers). The struct is likely to be aligned by the compiler to begin with and

writing zero to {a,b} is the same as writing a 32-bit zero to the whole struct. But GCC doesn’t even optimize this; it still assigns the zeroes separately! How

stupid.

The second-easiest case is the generalization of this; if you’re dealing with arrays in which the function is directly accessing them (rather than pointers to arrays,

which it might not know whether they’re aligned or not) and assigning zero or constant value,write-combining is trivial. But again,GCC doesn’t do it.

最简单的情况是一个简单的结构体的置0操作 ,s={a,b},a和b都是16位的整数。似乎这种情况下,编译器会让结构体字节对齐,然后和写入一个32位整数一样,一次性置0。但是gcc根本不

会这么做,而是分两次置0,多愚蠢!

较难点的情况是给数组进行赋值的时候,gcc也是如此愚蠢的操作。

Now,we get to the harder stuff. What if we’re copying between two arrays,both of which are directly accessed? Now,we have to be able to detect this sequential

copying and merge it. This basically is a simple form of autovectorization; its no surprise at all that GCC doesn’t do this.

The hardest,and in fact nearly impossible case is the one in which we’re dealing with pointers to arrays as arguments; the compiler really has no reliable way of

knowing that the pointers are aligned (though we as programmers might know that they always are). There are cases where it could make accurate derivations (by

annotating pointers passed between functions) as to whether they are aligned or not,in which case it might be able to do write combining; this would of course be very

difficult. Of course,on x86,its still worthwhile to combine even if there’s a misalignment risk,since it will only go slightly slower rather than crash.

考虑一下更复杂的情况,当我们拷贝两个数组的时候,我们直接存取的是什么?我们必须能够搞清楚这些拷贝操作,想办法把分开的操作合在一起,这就是一个简单的自动矢量化。所以对于

gcc不会自动做优化一点都不用感到惊讶。

最复杂的情况是在处理指针参数时,编译器无法知道指针指向的数据的字节对齐方式

The end result of this kind of operation is a massive speed boost in such functions; for example,in the section where motion vectors are cached (in

macroblock_cache_save) I got over double the speed by converting 16-bit copies to write-combined copies. This of course is only on a 32-bit system; on a 64-bit system

we could do even better. The code of course uses 64-bit so that a 64-bit compiled binary will do it as best it can. The compiler is smart enough to split the copies on

32-bit systems,of course.

这种优化操作会极大的提升速度。例如在运动向量的快速缓存的操作中(macroblock_cache_save函数中),我把16-bit的操作改成聚合拷贝操作,速度就提升了一倍。这还只是在32位系统,

如果在64位系统上,效果会更明显。

We could actually do even better if we were willing to use MMX or SSE,since MMX could be used for 64-bit copies on 32-bit systems and SSE could be used for 128-bit

copies. Unfortunately,this would completely sacrifice portability and at this point the speed boost would be pretty small from the current merged copies.

One of the big tricks currently is the ability to treat two motion vectors as one,and since all motion vectors come in pairs (X and Y,16-bit signed integers each),

its quite easy to manipulate them as pairs. This allowed me to drastically speed up a lot of manipulation involved in motion vector prediction and general copying and

storing. The result of all the issues described in the article is this massive diff.

如果用MMX和SSE指令,效果会更好,因为MMX指令可以在32-bit系统上做64-bit的拷贝操作,而SSE可以做128-bit操作。但这会降低代码的可移植性,而性能的提升却很微小。
现在我们用的一个技巧是把两个移动向量合在一起处理,因为所有的移动向量都是成对出现,所以很容易把他们成对的处理。这样的操作可以在移动向量相关的存取操作中极大的提升速度。

这里讨论的优化工作的最大成果就在于此。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


什么是设计模式一套被反复使用、多数人知晓的、经过分类编目的、代码 设计经验 的总结;使用设计模式是为了 可重用 代码、让代码 更容易 被他人理解、保证代码 可靠性;设计模式使代码编制  真正工程化;设计模式使软件工程的 基石脉络, 如同大厦的结构一样;并不直接用来完成代码的编写,而是 描述 在各种不同情况下,要怎么解决问题的一种方案;能使不稳定依赖于相对稳定、具体依赖于相对抽象,避免引
单一职责原则定义(Single Responsibility Principle,SRP)一个对象应该只包含 单一的职责,并且该职责被完整地封装在一个类中。Every  Object should have  a single responsibility, and that responsibility should be entirely encapsulated by t
动态代理和CGLib代理分不清吗,看看这篇文章,写的非常好,强烈推荐。原文截图*************************************************************************************************************************原文文本************
适配器模式将一个类的接口转换成客户期望的另一个接口,使得原本接口不兼容的类可以相互合作。
策略模式定义了一系列算法族,并封装在类中,它们之间可以互相替换,此模式让算法的变化独立于使用算法的客户。
设计模式讲的是如何编写可扩展、可维护、可读的高质量代码,它是针对软件开发中经常遇到的一些设计问题,总结出来的一套通用的解决方案。
模板方法模式在一个方法中定义一个算法的骨架,而将一些步骤延迟到子类中,使得子类可以在不改变算法结构的情况下,重新定义算法中的某些步骤。
迭代器模式提供了一种方法,用于遍历集合对象中的元素,而又不暴露其内部的细节。
外观模式又叫门面模式,它提供了一个统一的(高层)接口,用来访问子系统中的一群接口,使得子系统更容易使用。
单例模式(Singleton Design Pattern)保证一个类只能有一个实例,并提供一个全局访问点。
组合模式可以将对象组合成树形结构来表示“整体-部分”的层次结构,使得客户可以用一致的方式处理个别对象和对象组合。
装饰者模式能够更灵活的,动态的给对象添加其它功能,而不需要修改任何现有的底层代码。
观察者模式(Observer Design Pattern)定义了对象之间的一对多依赖,当对象状态改变的时候,所有依赖者都会自动收到通知。
代理模式为对象提供一个代理,来控制对该对象的访问。代理模式在不改变原始类代码的情况下,通过引入代理类来给原始类附加功能。
工厂模式(Factory Design Pattern)可细分为三种,分别是简单工厂,工厂方法和抽象工厂,它们都是为了更好的创建对象。
状态模式允许对象在内部状态改变时,改变它的行为,对象看起来好像改变了它的类。
命令模式将请求封装为对象,能够支持请求的排队执行、记录日志、撤销等功能。
备忘录模式(Memento Pattern)保存一个对象的某个状态,以便在适当的时候恢复对象。备忘录模式属于行为型模式。 基本介绍 **意图:**在不破坏封装性的前提下,捕获一个对象的内部状态,并在该
顾名思义,责任链模式(Chain of Responsibility Pattern)为请求创建了一个接收者对象的链。这种模式给予请求的类型,对请求的发送者和接收者进行解耦。这种类型的设计模式属于行为
享元模式(Flyweight Pattern)(轻量级)(共享元素)主要用于减少创建对象的数量,以减少内存占用和提高性能。这种类型的设计模式属于结构型模式,它提供了减少对象数量从而改善应用所需的对象结