Auto-vectorization in VS 2012

Visual Studio offers new performance benefits for C/C++ programmers by automatically applying vectorization where possible.

Visual Studio 2012 might not look as good as Visual Studio 2010 (even though it might be just a matter of taste and acclimatization), however, it comes with some really advanced features. One of those features is the new C++ compiler, which is not only C++11 standard conform, but also does some really important steps for you.

A long time ago I found out that the Intel compiler is the only one that should be used if one cares about performance (on Intel machines and also in general). The Microsoft compiler was certainly more advanced in areas like optimizing algorithms and avoiding memory leaks, but could not match the speed for elementary operations. This issue is now gone with the newly introduced auto-vectorization. This feature makes use of MMX, SSE and other more advanced CPU abilities, which are standardized and result in great performance benefits. If we wanted to use these posibilities before, we either had to use a really advanced compiler (like the one from Intel) or make use of compiler intrinsics.

Those compiler intrinsics work like inline-functions, however, the main difference is that intrinsics can be optimized by the compiler (the compiler decides what to do with the statement and how to resolve the commands), while inline-functions cannot.

The code that has been used for this benchmark can be downloaded. It calls a sub-routine three times with such sizes that they should fit into (lowest level) L1, L2 and RAM. The function executes some elementary operations with a growing number of operators. After each operation is finished the same one is executed with compiler intrinsics.

Let's have a look at the data first:

Intel i3 VS2010 (Normal) Intel i3 VS2010 (Intrinsics) AMD Athlon64 VS2012 (Normal) AMD Athlon64 VS2012 (Intrinsics) Intel i7 VS2012 (Normal) Intel i7 VS2012 (Intrinsics)
L1: c = a + b 12.105 3.822 22.745 18.081 1.024 0.992
L1: c = a2 - b2 13.556 3.963 31.496 30.732 1.293 1.265
L1: c = a4 - b4 40.279 4.96 57.642 65.132 2.218 1.986
L1: c = a8 - b8 93.585 11.606 109.248 133.366 4.65 4.63
L2: c = a + b 12.511 5.71 25.163 19.703 1.68 1.594
L2: c = a2 - b2 15.054 5.631 33.166 33.088 1.719 1.603
L2: c = a4 - b4 39.234 5.257 58.142 65.396 2.419 2.186
L2: c = a8 - b8 92.68 11.154 109.809 134.333 4.626 4.586
RAM: c = a + b 17.082 15.444 27.488 23.322 7.327 7.475
RAM: c = a2 - b2 18.174 15.32 34.18 35.256 7.974 8.016
RAM: c = a4 - b4 38.891 15.195 59.327 68.344 8.461 8.148
RAM: c = a8 - b8 92.399 17.144 110.885 144.207 9.035 9.103

While we see that with the new compiler (included in Visual Studio 2012) we gain a lot (the speedup is always around 1, at most 1.1 with the new i7 and 1.27 with the old AMD). The speedup with the old compiler was quite large, with a factor of 8.3 as maximum.

Additionally we see the performance benefit of using and i7, compared to a really old AMD Athlon64 3200+ with 2 GHz. Here larger L1, L2, faster RAM and an extended set of registers are in favor of modern CPUs.

Download main.cpp (6 kB)

Created . Last updated .

References

Sharing is caring!