assembly - Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

Question

Welcome To Ask or Share your Answers For Others

assembly - Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

assembly - Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

Writing a ZMM register can leave a Skylake-X (or similar) CPU in a state of reduced max-turbo indefinitely. (SIMD instructions lowering CPU frequency and Dynamically determining where a rogue AVX-512 instruction is executing) Presumably Ice Lake is similar.

(Workaround: not a problem for zmm16..31, according to @BeeOnRope's comments which I quoted in Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions? So this strlen could just use vpxord xmm16,xmm16,xmm16 and vpcmpeqb with zmm16.)

How to test this if you have hardware:

@BeeOnRope posted test code in an RWT thread: replace vbroadcastsd zmm15, [zero_dp] with vpcmpeqb k0, zmm0, [rdi] as the "dirtying" instruction and see if the loop after that runs slow or fast.

I assume executing any 512-bit uop will trigger reduced turbo temporarily (along with shutting down port 1 for vector ALU uops while the 512-bit uop is actually in the back-end), but the question is: Will the CPU recover on its own if you never use vzeroupper after just reading a ZMM register?

(And/or will later SSE or AVX instructions have transition penalties or false dependencies?)

Specifically, does a strlen using insns like this need a vzeroupper before returning? (In practice on any real CPU, and/or as documented by Intel for future-proof best practices.) Assume that later instructions may include non-VEX SSE and/or VEX-encoded AVX1/2, not just GP integer, in case that's relevant to a dirty-upper-256 situation keeping turbo reduced.

; check 64 bytes for zero, strlen building block.
    vpxor     xmm0,xmm0,xmm0    ; zmm0 = 0 using AVX1 implicit zero-extension
    vpcmpeqb  k0, zmm0, [rdi]   ; 512-bit load + ALU, not micro-fused
    ;kortestq k0,k0 / jnz or whatever

    kmovq     rax, k0
    tzcnt     rax, rax

  ;vzeroupper  before lots of code that goes a long time before another 512-bit uop?

(Inspired by the strlen in AVX512BW: handle 64-bit mask in 32-bit code with bsf / tzcnt? which would look like this if zeroing its vector reg was properly optimized to use a shorter VEX instead of EVEX instruction.)

The key instruction is the vpcmpeqb k0, zmm0, [rdi] which decodes on SKX or CNL to 2 separate uops (not micro-fused: retire-slots = 2.0): a 512-bit load (into a 512-bit physical register?) and an ALU compare into a mask register.

But no architectural ZMM register is ever written explicitly, only read. So presumably at least an xsave/xrstor would clear any "dirty upper" condition, if one exists after this. (This won't happen on Linux unless there's an actual context switch to a different user-space process on that core, or the thread migrates; merely entering the kernel for interrupts won't cause it. So this is actually still testable under a mainstream OS, if you have the hardware; I don't.)

Possibilities I can imagine for SKX/CNL, and/or Ice Lake:

No long-term effect: max-turbo recovers just as quickly as with vzeroupper
Max turbo limited to 512-bit speed until a context switch. (xrstor or equivalent clears any dirty-upper state flag because the architectural regs are clean).
Max turbo limited to 512-bit speed even across context switches, just like if you'd run vaddps zmm0,zmm0,zmm0. (Dirty upper flag is set in the saved and restored with the architectural state.) Plausible because xsaveopt does skip saving the upper 128 or 256 of vector regs if it's known they're clean.

I assume kmovq won't reduce max turbo or trigger any of the other 512-bit uop effects. The upper 32 bits of mask registers normally only come into play with with AVX512BW for 64-byte vectors, but presumably they don't power-gate the top 32 bits of mask regs separately, only the top 32 bytes of vector regs. There are use-cases like using kshift or kunpack to deal with 64-bit chunks of masks (for load/store or transfer to integer regs) even if you only ever generate or use them 32 bits at a time with AVX512VL with YMM or XMM regs.

PS: Xeon Phi is not subject to these effects; it's not built to upclock beyond heavy AVX512 when running other code because it's made to run AVX512. And in fact vzeroupper is very slow and not recommended on KNL / KNM.

The fact that my example uses AVX512BW is really not relevant to the question, but all mainstream (not Xeon Phi) CPUs with AVX512 have AVX512BW. It just makes a nice real use-case, and the fact that using AVX512BW excludes KNL is irrelevant.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T03:06:52+0000

No, a vpcmpeqb into a mask register does not trigger slow mode if you use a zmm register as one of the comparands, at least on SKX.

This is also true of any of any other instruction (as far as I tested) which only reads the key 512-bit registers (the key registers being zmm0 - zmm15). For example, vpxord zmm16, zmm0, zmm1 also does not dirty the uppers because while it involves zmm1 and zmm0 which are key registers, it only reads from them while writing zmm16 which is not a key register.

I tested this using avx-turbo on a Xeon W-2104, which has a nominal speed of 3.2 GHz, L1 turbo license (AVX2 turbo) of 2.8 GHz, and a L2 license (AVX-512 turbo) of 2.4 GHz. I used the --dirty-upper option to dirty the uppers before each test with vpxord zmm15, zmm14, zmm15. This causes any test that uses any SIMD registers at all (including scalar SSE FP) to run at the slower 2.8 GHz speed, as shown in these results (look at the A/M-MHz column for cpu frequency):

CPUID highest leaf  : [16h]
Running as root     : [YES]
MSR reads supported : [YES]
CPU pinning enabled : [YES]
CPU supports AVX2   : [YES]
CPU supports AVX-512: [YES]
cpuid = eax = 2, ebx = 266, ecx = 0, edx = 0
cpu: family = 6, model = 85, stepping = 4
tsc_freq = 3191.8 MHz (from calibration loop)
CPU brand string: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz
4 available CPUs: [0, 1, 2, 3]
4 physical cores: [0, 1, 2, 3]
Will test up to 1 CPUs
Cores | ID                  | Description                     | OVRLP1 | OVRLP2 | OVRLP3 | Mops | A/M-ratio | A/M-MHz | M/tsc-ratio
1     | pause_only          | pause instruction               |  1.000 |  1.000 | 1.000  | 2256 |      0.99 |    3173 | 1.00       
1     | ucomis_clean        | scalar ucomis (w/ vzeroupper)   |  1.000 |  1.000 | 1.000  |  790 |      1.00 |    3192 | 1.00       
1     | ucomis_dirty        | scalar ucomis (no vzeroupper)   |  1.000 |  1.000 | 1.000  |  466 |      0.88 |    2793 | 1.00       
1     | scalar_iadd         | Scalar integer adds             |  1.000 |  1.000 | 1.000  | 3192 |      0.99 |    3165 | 1.00       
1     | avx128_iadd         | 128-bit integer serial adds     |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx256_iadd         | 256-bit integer serial adds     |  1.000 |  1.000 | 1.000  | 2793 |      0.87 |    2793 | 1.00       
1     | avx512_iadd         | 512-bit integer adds            |  1.000 |  1.000 | 1.000  | 2794 |      0.88 |    2793 | 1.00       
1     | avx128_iadd_t       | 128-bit integer parallel adds   |  1.000 |  1.000 | 1.000  | 8380 |      0.88 |    2793 | 1.00       
1     | avx256_iadd_t       | 256-bit integer parallel adds   |  1.000 |  1.000 | 1.000  | 8380 |      0.88 |    2793 | 1.00       
1     | avx128_mov_sparse   | 128-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx256_mov_sparse   | 256-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx512_mov_sparse   | 512-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 2794 |      0.87 |    2793 | 1.00       
1     | avx128_merge_sparse | 128-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx256_merge_sparse | 256-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx512_merge_sparse | 512-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 2794 |      0.88 |    2793 | 1.00       
1     | avx128_vshift       | 128-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx256_vshift       | 256-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx512_vshift       | 512-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2794 |      0.88 |    2793 | 1.00       
1     | avx128_vshift_t     | 128-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 5587 |      0.88 |    2793 | 1.00       
1     | avx256_vshift_t     | 256-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 5588 |      0.88 |    2793 | 1.00       
1     | avx512_vshift_t     | 512-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2794 |      0.88 |    2793 | 1.00       
1     | avx128_imul         | 128-bit integer muls            |  1.000 |  1.000 | 1.000  |  559 |      0.88 |    2793 | 1.00       
1     | avx256_imul         | 256-bit integer muls            |  1.000 |  1.000 | 1.000  |  559 |      0.88 |    2793 | 1.00       
1     | avx512_imul         | 512-bit integer muls            |  1.000 |  1.000 | 1.000  |  559 |      0.88 |    2793 | 1.00       
1     | avx128_fma_sparse   | 128-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx256_fma_sparse   | 256-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx512_fma_sparse   | 512-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx128_fma          | 128-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  698 |      0.88 |    2793 | 1.00       
1     | avx256_fma          | 256-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  698 |      0.87 |    2793 | 1.00       
1     | avx512_fma          | 512-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  698 |      0.88 |    2793 | 1.00       
1     | avx128_fma_t        | 128-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 4789 |      0.75 |    2394 | 1.00       
1     | avx256_fma_t        | 256-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 4790 |      0.75 |    2394 | 1.00       
1     | avx512_fma_t        | 512-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 2394 |      0.75 |    2394 | 1.00       
1     | avx512_vpermw       | 512-bit serial WORD permute     |  1.000 |  1.000 | 1.000  |  466 |      0.88 |    2793 | 1.00       
1     | avx512_vpermw_t     | 512-bit parallel WORD permute   |  1.000 |  1.000 | 1.000  | 1397 |      0.87 |    2793 | 1.00       
1     | avx512_vpermd       | 512-bit serial DWORD permute    |  1.000 |  1.000 | 1.000  |  931 |      0.87 |    2793 | 1.00       
1     | avx512_vpermd_t     | 512-bit parallel DWORD permute  |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00

The only tests that ran at full speed were Scalar integer adds which has no SSE/AVX register use at all, and scalar ucomis (w/ vzeroupper) which has an explicit vzeroupper before each test so doesn't execute with dirty uppers.

Then, I changed the dirtying instruction to the vpcmpeqb k0, zmm0, [rsp] instruction you are interested in. The new results:

Cores | ID                  | Description                     | OVRLP1 | OVRLP2 | OVRLP3 | Mops | A/M-ratio | A/M-MHz | M/tsc-ratio
1     | pause_only          | pause instruction               |  1.000 |  1.000 | 1.000  | 2256 |      1.00 |    3192 | 1.00       
1     | ucomis_clean        | scalar ucomis (w/ vzeroupper)   |  1.000 |  1.000 | 1.000  |  790 |      1.00 |    3192 | 1.00       
1     | ucomis_dirty        | scalar ucomis (no vzeroupper)   |  1.000 |  1.000 | 1.000  |  790 |      1.00 |    3192 | 1.00       
1     | scalar_iadd         | Scalar integer adds             |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx128_iadd         | 128-bit integer serial adds     |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3190 | 1.00       
1     | avx256_iadd         | 256-bit integer serial adds     |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx512_iadd         | 512-bit integer adds            |  1.000 |  1.000 | 1.000  | 2794 |      0.88 |    2793 | 1.00       
1     | avx128_iadd_t       | 128-bit integer parallel adds   |  1.000 |  1.000 | 1.000  | 9575 |      1.00 |    3192 | 1.00       
1     | avx256_iadd_t       | 256-bit integer parallel adds   |  1.000 |  1.000 | 1.000  | 9577 |      1.00 |    3192 | 1.00       
1     | avx128_mov_sparse   | 128-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx256_mov_sparse   | 256-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx512_mov_sparse   | 512-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx128_merge_sparse | 128-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx256_merge_sparse | 256-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx512_merge_sparse | 512-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx128_vshift       | 128-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx256_vshift       | 256-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx512_vshift       | 512-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2794 |      0.88 |    2793 | 1.00       
1     | avx128_vshift_t     | 128-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 6386 |      1.00 |    3192 | 1.00       
1     | avx256_vshift_t     | 256-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 6386 |      1.00 |    3192 | 1.00       
1     | avx512_vshift_t     | 512-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2794 |      0.88 |    2793 | 1.00       
1     | avx128_imul         | 128-bit integer muls            |  1.000 |  1.000 | 1.000  |  638 |      1.00 |    3192 | 1.00       
1     | avx256_imul         | 256-bit integer muls            |  1.000 |  1.000 | 1.000  |  639 |      1.00 |    3192 | 1.00       
1     | avx512_imul         | 512-bit integer muls            |  1.000 |  1.000 | 1.000  |  559 |      0.88 |    2793 | 1.00       
1     | avx128_fma_sparse   | 128-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx256_fma_sparse   | 256-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx512_fma_sparse   | 512-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 2793 |      0.87 |    2793 | 1.00       
1     | avx128_fma          | 128-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  798 |      1.00 |    3192 | 1.00       
1     | avx256_fma          | 256-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  798 |      1.00 |    3192 | 1.00       
1     | avx512_fma          | 512-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  698 |      0.88 |    2793 | 1.00       
1     | avx128_fma_t        | 128-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 6384 |      1.00 |    3192 | 1.00       
1     | avx256_fma_t        | 256-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 5587 |      0.87 |    2793 | 1.00       
1     | avx512_fma_t        | 512-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 2394 |      0.75 |    2394 | 1.00       
1     | avx512_vpermw       | 512-bit serial WORD permute     |  1.000 |  1.000 | 1.000  |  466 |      0.87 |    2793 | 1.00       
1     | avx512_vpermw_t     | 512-bit parallel WORD permute   |  1.000 |  1.000 | 1.000  | 1397 |      0.88 |    2793 | 1.00

Categories

assembly - Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

assembly - Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

How to test this if you have hardware:

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags