Introduction to MMX
Andreas Jönsson, August 2002

In this tutorial I will use NASM (The Netwide Assembler) for the assembler. You should be able to convert any assembly code to whatever assembler you prefer to use without any major difficulties as assembly language is pretty standard. I choose NASM because it suits my purposes and it is free for use. NASM is available for download at nasm.sourceforge.net.

I'm going to assume that you have some experience with assembler as I want to concentrate on MMX and not basic assembly language. You will need to know how to write assembler functions that can be called from C/C++. If you don't have this knowledge already I suggest you read the excellent PC Assembly Tutorial, available from drpaulcarter.com.

Why you would want to use MMX

Normally I wouldn't recommend programming in assembler as todays compilers are very good at optimizing the code using all the rules of pairing, register variables, etc. Writing your own code in assembler might make it a little faster but putting that in contrast to the extra work and possible bugs it just isn't worth it.

The compilers are however not smart enough to know when to use MMX or even SSE. And that is perhaps a good thing because not all processors support these instruction sets. The reason why the compiler don't know when to use MMX is that it isn't just a rearrangment of instructions to gain pipeline performance or using registers instead of memory. What the MMX instructions do is to pack several numbers into one register and then performing operations on all these numbers in parallell. It is not an easy task to automate just how these numbers should be combined in the registers, and that is why it is up to you to optimize for MMX. Some situations where you have a lot to gain by using MMX are when several instructions of the same type can operate in parallel. Examples for this are image processing, voice recognition, etc.

For example combining two images with blending is something that is highly parallell since each output pixels is processed in the exact same way only with different offsets. Programming this in C++ you would process each pixel one by one and a good compiler will be able to speed up the process by pipelining the processing so that the program processes two pixels at a time. With MMX you can write the program to process 4 pixels at once, and with some careful arrangement of instructions you can obtain the same pipelining and parallel execution as the C compiler, thus the CPU is able to process 8 pixels at the same time.

In the example above you have a performance gain of 400%. Of course these numbers are very naive and in practice you will probably not reach these performance gains. This is because a simple add between 2 values needs only one instruction, but to compute 4 adds in parallel you will need to do some juggling of the values to move them into the registers for adding. An increase of 50% to 100% though shouldn't be too difficult to obtain.

Verifying availability of MMX

A complication of MMX is that not all processors support it. Trying to execute an MMX instruction on a CPU that doesn't support it leads to an Illegal Instruction Exception which most likely won't explain to the user why the program didn't work on his computer. Thus it's a good idea to check for support of MMX before using it.

The assembler instruction cpuid can be used on all processors, that follow the standard, for verifying support for MMX. By calling cpuid with EAX = 1 it returns a supported feature set, among other things, in the EDX register. If bit 23 in the EDX register is set after the call MMX is supported on the processor. Verifying support for the extended MMX instructions that where made available after the initial introduction is more complicated and may differ from processor to processor. For your convenience I've written a small assembler function that checks if the processor supports the standard MMX instructions set.

; bool DoesCPUSupportMMX();
global _DoesCPUSupportMMX

_DoesCPUSupportMMX:

    ; Save the registers affected
    push   ebx                  
    push   ecx
    push   edx

    ; Check feature flag 23 in EDX for MMX support
    mov    eax, 1               
    cpuid                       
    mov    eax, edx           
    shr    eax, 23              
    and    eax, 1                   

    ; Restore registers
    pop    edx                  
    pop    ecx
    pop    ebx
    ret

The MMX instruction set was introduced with the later versions of Pentium and should be available on all Pentium II or compatible processors. The instruction set has been expanded with the introduction of Pentium III, and AMD and Cyrix also made their own extensions to the set. Using the expanded set requires a more detailed verification of processor type, and this is even more important than verifying the support for the standard set. It is after all quite likely that a person has a Pentium II or later today, but that the expanded set is supported is much less likely.

The cpuid instruction was introduced with the late edition of 486, and is available on all Pentium and later processors. Checking for support of cpuid is made by checking if bit 21 (ID) in EFLAGS can be toggled.

If you want to know more about how to determine CPU type and feature support I suggest you take a look at the manuals available from the vendors' homepages. Intel has a good manual for checking the type of Intel processor available at intel.com. AMD also has a good manual for their processors available from amd.com.

A dangerous pitfall

A dangerous pitfall of MMX is that the 8 64-bit MMX registers share their space with the floating point registers available for floating point instructions. Interchanging between MMX routines and floating point routines is therefore not to be taken lightly. The switch must also be done manually by calling the assembler routine emms, that clears the MMX registers for use by floating point instructions. Failure to call this instruction before making any floating point operations will most likely lead to unwanted result, such as application crashes.

The emms routine is not cheap to call, up to 50 clock cycles are needed so it is recommended that you group your MMX calls and avoid using floating point instructions between them, possibly exchanging them for fixed point math instead. Fixed point math is where you use a normal integer and defines the lower half of the bits as decimal numbers. Working with fixed point math can be a complicated matter so I will leave that to another tutorial.

I suggest that you write the following routine:

; void EndMMX();
global _EndMMX

_EndMMX:
    emms     ; Allow CPU to use floating point
    ret

And then call this function from your C++ program with EndMMX(); when you are finished with the MMX calculations.

Quick look on MMX instructions

Let's take a look at what instructions are included in MMX. The MMX instructions are recognizable in that they operate on the 8 MMX registers, MM0-MM7. The following MMX instructions where introduced with late Pentium and Pentium II processors.

`emms`	Empties the MMX state so that the MMX registers can be used by floating point operations.
`movd`	Moves a double word (32 bits) between MMX registers or to/from memory.
`movq`	Moves a quad word (64 bits) between MMX registers or to/from memory.
`packssdw packsswb packuswb`	Packs 128 bits of words or double words into 64 bits by removing the top half of each unit with signed or unsigned saturation.
`paddb paddw paddd`	Adds two groups of bytes, words, or double words.
`paddq`	Adds two quad words.
`paddsb paddsw paddusb paddusw`	Adds signed bytes or words with signed or unsigned saturation.
`pand pandn`	Bitwise AND and AND NOT.
`pcmpeqb pcmpeqw pcmpeqd`	Compares each unit in the groups for equality
`pcmpgtb pcmpgtw pcmpgtd`	Compares each unit in the groups for greater than.
`pmaddwd`	Multiply and addition of signed words into double words.
`pmulhw pmullw`	Multiply of signed words and return high or low word.
`por`	Bitwise OR.
`psllw pslld psllq`	Logical left shift per unit.
`psraw psrad`	Arithmetic right shift per unit.
`psrlw psrld psrlq`	Logical right shift per unit.
`psubb psubw psubd`	Subtract integers.
`psubsb psubsw psubusb psubusw`	Subtract integers with signed or unsigned saturation.
`punpckhbw punpckhwd punpckhdq punpcklbw punpcklwd punpckldq`	Double the unit size by interleaving units from two sources.
`pxor`	Bitwise XOR.

With the Pentium III the MMX instruction set was increased with the following instructions.

`maskmovq`	Write bytes to memory from register using a mask to select which bytes to write.
`movntq`	Moves quad word to memory bypassing the cache.
`pavgb pavgw`	Compute the average of unsigned bytes or words.
`pextrw`	Extracts a specified word from a group.
`pinsrw`	Inserts a word at specified location in group.
`pmaxsw`	Compares signed words and stores the largest.
`pmaxub`	Compares unsigned bytes and stores the largest.
`pminsw`	Compares signed words and stores the smallest.
`pminub`	Compares unsigned bytes and stores the smallest.
`pmovmskb`	Creates an 8-bit integer from the most significant bit in each byte.
`pmulhuw`	Multiply unsigned words and return high word.
`psadbw`	Computes the absolute pairwise difference of 8 unsigned bytes and sums them into 1 word.
`pshufw`	Shuffles words.

All the instructions take 1 clock cycle to execute, except for the multiplication that take 3. Most of the instructions can also be paired for execution in parallel. To see a more detailed description of the instructions read the NASM manual or the Intel Reference Manual, that is available from intel.com.

A practical example

To give you a better understanding of what can be done with MMX I've written a small function that blends two 32-bit ARGB pixels using 4 8-bit factors, one for each channel. To do this in C++ you would have to do the blending channel by channel. But with MMX we can blend all channels at once.

The blending factor is a one byte value between 0 and 255, as is the channel components. Each channel is blended using the following formula.

res = (a*fa + b*(255-fa))/255

Writing this in C++ is as easy as it looks, and even writing it in assembler is quite straight forward. However for MMX we have a problem, there is no packed division operation available. We can do a division by shifting the bits to the right by 8, the problem is that this does a division by 256. This small difference might not be too important if we are doing only one blending pass, but as the passes increases the artifacts increases as well. The solution is that we increase the range of the factor to be between 0 and 256. This is quite simple to do by adding 1 if the factor is above 127.

Without further comments here is the assembler function for blending two ARGB pixels.

; DWORD LerpARGB(DWORD a, DWORD b, DWORD f);
global _LerpARGB   

_LerpARGB:

    ; load the pixels and expand to 4 words
    movd        mm1, [esp+4]    ; mm1 = 0 0 0 0 aA aR aG aB
    movd        mm2, [esp+8]    ; mm2 = 0 0 0 0 bA bR bG bB
    pxor        mm5, mm5        ; mm5 = 0 0 0 0 0 0 0 0
    punpcklbw   mm1, mm5        ; mm1 = 0 aA 0 aR 0 aG 0 aB
    punpcklbw   mm2, mm5        ; mm2 = 0 bA 0 bR 0 bG 0 bB

    ; load the factor and increase range to [0-256]
    movd        mm3, [esp+12]   ; mm3 = 0 0 0 0 faA faR faG faB
    punpcklbw   mm3, mm5        ; mm3 = 0 faA 0 faR 0 faG 0 faB
    movq        mm6, mm3        ; mm6 = faA faR faG faB [0 - 255]
    psrlw       mm6, 7          ; mm6 = faA faR faG faB [0 - 1]
    paddw       mm3, mm6        ; mm3 = faA faR faG faB [0 - 256] 

    ; fb = 256 - fa
    pcmpeqw     mm4, mm4        ; mm4 = 0xFFFF 0xFFFF 0xFFFF 0xFFFF
    psrlw       mm4, 15         ; mm4 =   1   1   1   1 
    psllw       mm4, 8          ; mm4 = 256 256 256 256 
    psubw       mm4, mm3        ; mm4 = fbA fbR fbG fbB 

    ; res = (a*fa + b*fb)/256 
    pmullw      mm1, mm3        ; mm1 = aA aR aG aB	
    pmullw      mm2, mm4        ; mm2 = bA bR bG bB
    paddw       mm1, mm2        ; mm1 = rA rR rG rB
    psrlw       mm1, 8          ; mm1 = 0 rA 0 rR 0 rG 0 rB

    ; pack into eax
    packuswb    mm1, mm1        ; mm1 = 0 0 0 0 rA rR rG rB
    movd        eax, mm1        ; eax = rA rR rG rB

    ret

You should note that I've written this function for clarity and not for speed. I have not tried to optimize it for speed by pairing instructions that can be executed in parallel as described by the Intel Optimization Manual, available from intel.com.

Conclusion

With this tutorial I have hopefully been able to inspire some interest into low-level optimizations using assembler and MMX. Now, go ahead and play around with the MMX instructions and see what you can do. Once you feel comfortable using MMX you shouldn't forget that most of the time it just isn't worth it. But for those few times when it is worth it, it is going to show that you know your business.

Questions, comments, and suggestions are as usual more than welcome. After all I'm writing this in hope that I will learn something from you too.

Thanks to Axel Gneiting and Graham Reeds for telling me about some errors in the article. I also thank them for giving me some extra information on AMD and Cyrix processors, even though I ended up not including it in the article.

Introduction to MMXAndreas Jönsson, August 2002