assembly - Why does GCC chose dword movl to copy a long shift count to CL?

Question

Welcome To Ask or Share your Answers For Others

assembly - Why does GCC chose dword movl to copy a long shift count to CL?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

assembly - Why does GCC chose dword movl to copy a long shift count to CL?

In the third chapter of Computer System: A Programmer's Prespective, an example program is given when talking about shift operations:

long shift_left4_rightn(long x, long n)
{
    x <<= 4;
    x >>= n;
    return x;
}

And its assembly code is as follows (reproducible with GCC10.2 -O1 for x86-64 on the Godbolt compiler explorer. -O2 schedules the instructions in a different order but still uses movl to ECX):

shift_left4_rightn:
　　endbr64
　　movq　　　%rdi, %rax　　　　Get x
　　salq　　　$4, %rax　　　　　　x <<= 4
　　movl　　　%esi, %ecx　　　　Get n
　　sarq　　　%cl, %rax　　　　　x >>= n
　　ret

I wonder why the assembly code of getting n is movl　%esi, %ecx instead of movq　%rsi, %rcx since n is a quad-word.

On the other hand, movb　%sil, %cl might be more suitable if the optimation is considered, since the shift amount only use the single-byte register element %cl and those higher bits are all ignored.

As a result, I really fail to figure out the reason for using "movl　%esi, %ecx" when dealing with long integer.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:39:41+0000

Yes, GCC realizes that upper bits are ignored by sar.
Then movl is the natural consequence of applying two simple optimization rules:

Avoid writing partial registers (i.e. 8 or 16-bit, where writing merges into the old value instead of zero-extending). Why doesn't GCC use partial registers? - For various reasons across different microarchitectures, including in this case a false dependency on the old value of RCX.
Prefer 32-bit operand size because it's the default in x86-64 machine code, not needing any prefixes. And it's at least as fast as any other operand-size for any instruction.

Fun fact: even if the arg had been uint8_t, compiles would still hopefully use movl %esi, %ecx. You'd think reading a wider register when the arg value is only in SIL could create a partial-register stall, but an unofficial extension to the x86-64 System V calling convention is that callers should zero or sign extend narrow args to at least 32-bit. So we can assume it was written with at least a 32-bit operation.

The specific downsides of some other choices:

movq %rsi, %rcx - waste of a REX prefix (code-size downside).
movb %sil, %cl - writes a partial register, and still needs a REX prefix to access SIL.
movzbl %sil, %ecx - code size: 2-byte opcode, and needs a REX to read SIL. Also, AMD CPUs only do mov-elimination (zero latency) for movl / movq, not movzx.
movw %si, %cx - zero advantages, needs an operand-size prefix and writes a partial register.
movzwl %si, %ecx - Tied with movq for code-size, but defeats mov-elimination even on Intel CPUs.

Fun fact: if we pad with a dummy arg so n arrives in RDX, GCC still chooses movl %edx, %ecx, even though movb %dl, %cl is the same code-size (no REX needed to access DL). So yes, GCC is definitely avoiding byte operand-size.

Fun fact 2: Clang unfortunately does waste a REX on movq, missing this optimization. https://godbolt.org/z/6GWhMd

But if we make the count arg unsigned char, clang and GCC do both use movl instead of movb, fortunately. https://godbolt.org/z/e95WP8

Categories

assembly - Why does GCC chose dword movl to copy a long shift count to CL?

assembly - Why does GCC chose dword movl to copy a long shift count to CL?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags