sub rsp, 8 /
add rsp, 8 at the start/end of your function to re-align the stack to 16 bytes before your function does a
Or better push/pop a dummy register, e.g.
push rdx /
pop rcx, or a call-preserved register like RBP you actually wanted to save anyway. You need the total change to RSP to be an odd multiple of 8 counting all pushes and
sub rsp, from function entry to any
8 + 16*n bytes for whole number
On function entry, RSP is 8 bytes away from 16-byte alignment because the
call pushed an 8-byte return address. See Printing floating point numbers from x86-64 seems to require %rbp to be saved,
main and stack alignment, and Calling printf in x86_64 using GNU assembler. This is an ABI requirement which you used to be able to get away with violating when there weren't any FP args for printf. But not any more.
See also Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?
gcc's code-gen for glibc scanf now depends on 16-byte stack alignment
AL == 0.
It seems to have auto-vectorized copying 16 bytes somewhere in
__GI__IO_vfscanf, which regular
scanf calls after spilling its register args to the stack1. (The many similar ways to call scanf share one big implementation as a back end to the various libc entry points like
I downloaded Ubuntu 18.04's libc6 binary package: https://packages.ubuntu.com/bionic/amd64/libc6/download and extracted the files (with
7z x blah.deb and
tar xf data.tar, because 7z knows how to extract a lot of file formats).
I can repro your bug with
LD_LIBRARY_PATH=/tmp/bionic-libc/lib/x86_64-linux-gnu ./bad-printf, and also it turns out with the system glibc 2.27-3 on my Arch Linux desktop.
With GDB, I ran it on your program and did
set env LD_LIBRARY_PATH /tmp/bionic-libc/lib/x86_64-linux-gnu then
layout reg, the disassembly window looks like this at the point where it received SIGSEGV:
│0x7ffff786b49a <_IO_vfscanf+602> cmp r12b,0x25 │
│0x7ffff786b49e <_IO_vfscanf+606> jne 0x7ffff786b3ff <_IO_vfscanf+447> │
│0x7ffff786b4a4 <_IO_vfscanf+612> mov rax,QWORD PTR [rbp-0x460] │
│0x7ffff786b4ab <_IO_vfscanf+619> add rax,QWORD PTR [rbp-0x458] │
│0x7ffff786b4b2 <_IO_vfscanf+626> movq xmm0,QWORD PTR [rbp-0x460] │
│0x7ffff786b4ba <_IO_vfscanf+634> mov DWORD PTR [rbp-0x678],0x0 │
│0x7ffff786b4c4 <_IO_vfscanf+644> mov QWORD PTR [rbp-0x608],rax │
│0x7ffff786b4cb <_IO_vfscanf+651> movzx eax,BYTE PTR [rbx+0x1] │
│0x7ffff786b4cf <_IO_vfscanf+655> movhps xmm0,QWORD PTR [rbp-0x608] │
>│0x7ffff786b4d6 <_IO_vfscanf+662> movaps XMMWORD PTR [rbp-0x470],xmm0 │
So it copied two 8-byte objects to the stack with
movhps to load and
movaps to store. But with the stack misaligned,
movaps [rbp-0x470],xmm0 faults.
I didn't grab a debug build to find out exactly which part of the C source turned into this, but the function is written in C and compiled by GCC with optimization enabled. GCC has always been allowed to do this, but only recently did it get smart enough to take better advantage of SSE2 this way.
Footnote 1: printf / scanf with
AL != 0 has always required 16-byte alignment because gcc's code-gen for variadic functions uses test al,al / je to spill the full 16-byte XMM regs xmm0..7 with aligned stores in that case.
__m128i can be an argument to a variadic function, not just
double, and gcc doesn't check whether the function ever actually reads any 16-byte FP args.