Editing Nopl

[[Category:Research]] {{DISPLAYTITLE:nopl}}
During research for my [[Geode Repair]] projects, I found an amazing saga based around a CPU instruction. Nobody else has written this up from what I can see, so here's my take.

== Background ==
The AMD Geode LX800 CPU is an incredibly strange CPU. Despite being branded AMD, it's a descendant of the Cyrix MediaGX with a ton of features stapled on.

It implements the following instruction sets:

* Pentium (i586)
* Pentium Pro (i686)
* MMX
* Extended MMX
* 3DNow!
* Enhanced 3DNow!
*Extra AMD Geode instructions

Notably the machine does not support PAE (Physical Address Extension). This may be why the CPUID identifies as family 5 (Pentium), not family 6 (Pentium Pro). This is all according to my reading of the [https://www.amd.com/system/files/TechDocs/33234H_LX_databook.pdf AMD Geode LX Processors Data Book].

In contrast, some CPUs like the VIA C3 marked their CPU as family 6 without implementing the instruction set specified in the [https://stuff.mit.edu/afs/sipb/contrib/doc/specs/ic/cpu/x86/pentium-pro/vol2.pdf Pentium Pro Family Developer's Manual Volume 2]. Most notably the 'CMOV' instruction.

But in theory, you could run an i686 system on the Geode without PAE. In practice it's a bit more complicated.

== Undocumented i686 instructions ==
In 1995 Intel released the Pentium Pro and its developer manual documenting the entire instruction set.

By 1997 Christian Ludloff had created a [https://web.archive.org/web/19970411042846/http://www.sandpile.org/80x86/opcodes2.shtml map of 2 byte x86 opcodes]. This confirmed Intel's documentation, but included a few unknown opcodes: 0F 34, 0F 35, as well as 0F 18 through 0F 1F.

In 1997 those first two opcodes 0F 34 and 0F 35 were determined to be Pentium II SYSENTER and SYSEXIT instructions despite Intel only documenting this later in 1999. It turns out the instructions were available on the Pentium Pro but broken. See [https://www.os2museum.com/wp/sysenter-where-are-you/ SYSENTER, Where Are you?] for a good summary of this situation.

Hilariously enough, the Geode LX800 identifies as a Pentium but supports the SYSENTER and SYSEXIT instructions introduced in the Pentium Pro and finally made useful in the Pentium II. Weird.

In 1998 Christian Ludloff documented in his [https://web.archive.org/web/19981205142152/http://sandpile.org:80/80x86/opcodes2.shtml updated map of 2 byte x86 opcodes] that the 0F 18 through 0F 1F range of opcodes were hinting NOPs. The first being the 0F 18 opcode which maps to PREFETCHh instructions. I believe this information was documented first in the [https://www.cs.cmu.edu/afs/cs/academic/class/15213-s01/docs/intel-opt.pdf Intel Architecture Optimization Reference Manual].

Later in 2003 Christian Ludloff clarified in an email thread [http://www.sandpile.org/post/msgs/20004129.htm Undocumented opcodes (HINT_NOP)] that these hinting NOPs were declared by Intel in their 1995 patent [https://patents.google.com/patent/US5701442A/en US5701442]. The idea behind this patent from my reading is that you can encode a program written in another ISA as a series of opcodes that are run as NOPs on older machines and the new ISA on a newer machine.

I'm not sure why, but third party x86 CPUs aside from AMD didn't implement these NOPs. Perhaps Intel kept this patent close to their heart? Or maybe it's just not worth spending silicon and research on NOPs that nobody used?

== The birth of multi-byte NOP ==
In 2006 Intel released an [https://ragestorm.net/downloads/25366719.pdf updated IA-32 Intel Architecture Software Developer's Manual]. In addition to the standard one byte NOP op code, the multi-byte opcode was documented. This NOP could be from 2 to 9 bytes long, much longer than a single byte NOP. This is a pretty useful instruction for aligning code and data in memory. The only thing weird about this instruction is that it was marked as available on Pentium Pro and newer machines despite being documented in 2006. It turned out that Intel had recycled one of their hinting NOPs (0F 1F) as a new instruction.

Someone pointed this out on the Intel forum in the thread [https://community.intel.com/t5/Software-Archive/Multi-byte-NOP-opcode-made-official/td-p/932580 Multi-byte NOP opcode made official]. They had tested and verified the feature and even pointed out that it works on AMD processors despite this instruction not being documented anywhere. They asked a few hard hitting questions to Intel:

* Why was this opcode secret?
* Why does it work on AMD CPUs?
* Why does AMD recommend an opcode of 66 66 66 90 for multi-byte NOPs?

Intel got back to them with a 'this information is Intel Confidential and would require an NDA to discuss' reply. NOPs are definitely serious business.

I'm going to just go out and guess that AMD recommended the 66 90 opcode series because their CPUs optimized it and it worked on older machines. While with Intel their solution seems to be to recycle their trash.

In 2007 Symantec wrote a blog post [https://web.archive.org/web/20070221081630/http://www.symantec.com/enterprise/security_response/weblog/2007/02/x86_fetchdecode_anomalies.html x86 Fetch-Decode Anomalies] showing that Intel's hinting NOP opcode they assigned to the multi-byte NOP actually will attempt to fetch memory (and even page fault) if you instruct it to. This isn't a problem in practice, but it gives more evidence this opcode isn't strictly a NOP.

== Linux fallout ==
In 2006 [https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=1596541188b1a4080ab7bce6578c09626193dfd0 PATCH: Add "nop memory" for i386/x86-64] was committed to the GNU Assembler. It added support for the 'nopl' and 'nopw' assembly instructions that map to multi-byte NOP code.

In 2007 [http://lkml.iu.edu/hypermail/linux/kernel/0709.2/2726.html x86: multi-byte single instruction NOPs] was committed to Linux. This added a set of 'P6 NOPs' that used the multi-byte NOP opcodes and used them for i686 or newer x86 CPUs. Which type of NOPs to use were decided at runtime, so running an i686 kernel on an i586 machine would not cause any issues with this. Strangely on 64-bit systems the NOPs were only used if your CPU vendor was Intel. 

In 2008 the Debian bugs [https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=463606 Linux 2.6.24 fails to boot on MS Virtual PC 2007] and [https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=464962 -686 build uses long noops, that are unsupported by Transmeta Crusoe, immediate crash on boot] were reported. After a lot of discussion, the following CPUs were reported to not support multi-byte NOPs:

* VIA C3 Nehemiah
* Transmeta Crusoe TM5800
* Microsoft Virtual PC 2007
* QEMU 0.9.1
* AMD Geode LX800

Interestingly enough the TM5800 reports its CPUID as family 5 like the LX800 does. But for the TM5800 the kernel promoted its status to i686 by changing the reported family at runtime.

The LX800 reports its family as 5, so shouldn't have failed due to the patch. I looked up other reports from the time and found the bug report [https://linux-kernel.vger.kernel.narkive.com/ICAshwKv/2-6-24-rc8-hangs-at-mfgpt-timer 2.6.24-rc8 hangs at mfgpt-timer] which seems to fit better, especially since the reporter didn't post any logs.

In parallel [https://linux.kernel.narkive.com/9RuAfy4c/bug-x86-kenel-won-t-boot-under-virtual-pc <nowiki>[BUG] x86 kenel won't boot under Virtual PC</nowiki>] was reported to the Linux mailing list. This has a bit of a more focused discussion about how to address this problem. To summarize the discussion:

* GNU Assembler shouldn't generate multi-byte NOPs if not all i686 CPUs support it
* This is all too much effort for NOPs

A flurry of fixes appeared in Linux 2.6.25 to stop the bleeding:

* [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a7ef94e6889186848573a10c5bdb8271405f44de x86: do not promote TM3x00/TM5x00 to i686-class]
* [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7343b3b3a627eb30e24e921f004f659c8ebb91c5 x86: require family >= 6 if we are using P6 NOPs] 
* [https://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=959b3be64cab9160cd74532a49b89cdd918d38e9 x86: don't use P6_NOPs if compiling with CONFIG_X86_GENERIC]

This decided Transmeta Crusoe CPUs weren't i686, and limited use of multi-byte NOPs to i686 non-generic kernels.

A little while later in Linux 2.6.27 this amazing chain of patches happened:

* [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b6734c35af028f06772c0b2c836c7d579e6d4dad x86: add NOPL as a synthetic CPU feature bit]
* [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=14469a8dd23677921db5e7354a602c98d9c6300f x86: disable static NOPLs on 32 bits]
* [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=28f7e66fc1da53997a545684b21b91fb3ca3f321 x86: prevent binutils from being "smart" and generating NOPLs for us]
* [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ba0593bf553c450a03dbc5f8c1f0ff58b778a0c8 x86: completely disable NOPL on 32 bits]

Instead of only adding multi-byte NOPs to i686 and better machines during the kernel build, the plan was to only use multi-byte NOPs at runtime based on whether the CPU supported running multi-byte NOPs.

Unfortunately the developers found the GNU Assembler would add multi-byte NOPs all the time for i686 and Virtual PC would freeze during detection of multi-byte NOP support. In the end they just threw their arms up and said 'no multi-byte NOPs at all on 32-bit x86'.

As a result, the 'nopl' flag found in /proc/cpuinfo which would have shown if the CPU supported multi-byte NOPs is always hidden on 32-bit x86 and always shown on 64-bit x86, making it effectively useless.

== GNU Assembler confusion ==
In 2008 in the midst of the kernel fallout the GNU Assembler bug [https://sourceware.org/bugzilla/show_bug.cgi?id=6957 i386 NOPs must be derived from march not mtune] was reported.

When compiling code you can specify which CPU family to support and which specific CPU to optimize for. The kernel developers found that the GNU Assembler wouldn't add multi-byte NOPs if you targeted the i686 family, instead it only added them if you targeted the i686 CPU (the Pentium Pro). This was confusing for a few reasons:

* Multi-byte NOPs weren't used on all CPUs that supported them by default
* Optimizing for a specific CPU could break compatibility with the family

Reading the bug report, you can see two schools of thought on what the problem is here and how to fix it.

The bug reporters believed:

* Multi-byte NOPs are not part of the i686 family
* Optimizing for the i686 CPU is adding Pentium Pro-only instructions
* The i686 architecture should not emit multi-byte NOPs

The GNU Assembler developers believed:

* Multi-byte NOPs are indeed part of the i686 family
* Optimizing for the i686 CPU is using i686 instructions
* Developers should build against the i586 architecture if they need wider compatibility

Something important to note here is that most software projects didn't ask GNU Assembler to target the i686 CPU, so this bug didn't really affect many projects in practice. A workaround was to optimize for the 'generic32' CPU which didn't use the nopl instructions.

== glibc fallout ==
In 2010 [https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=01f1f5ee Pass -mtune=i686 to assembler when compiling for i686] was committed to glibc. This told GNU Assembler to optimize for i686 CPUs (Pentium Pro), and as I mentioned in the previous section, this used multi-byte NOPs. 

A month later the Arch Linux bug [https://bugs.archlinux.org/task/19733 Update to glibc 2.12-2 on VIA C3 Nehemia makes system unusable] and Fedora bug [https://bugzilla.redhat.com/show_bug.cgi?id=579838 glibc not compatible with AMD Geode LX] were reported. glibc being a core component of most GNU systems meant updating completely crashed people's machines. Oops.

Unlike the Linux and GNU Assembler discussions, the Arch Linux and Fedora discussions were from the perspective of people building and packaging software. Finding out what was broken was a little tricky.

* Was it GNU Assembler for adding nopls to code?
* Was it glibc for tuning for i686 CPUs?
* Was it the Linux distros for running i686 binaries on non-i686 CPUs?

Things were a little tricky for Fedora here as they explicitly supported the AMD Geode LX800 as it was used in millions of laptops for the One Laptop per Child project. While the LX800 isn't i686, it ran i686 binaries fine. They would have to support not just i686 but i586 too for their entire distribution just to support this laptop.

Around this time the GNU Assembler committed [https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=2210942396dab942a86cb6777c705554b84ebb0e Don't generate multi-byte NOPs for i686.] This patch restricted generating multi-byte NOPs to Intel and AMD CPUs. Strangely enough the i586 AMD K6-2 CPU was marked as supporting multi-byte NOPs, which was fixed in the 2013 commit [https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=d56da83e58816c45a4bc70503776a6e62a66bf89 Remove CpuNop from CPU_K6_2_FLAGS].

After a few months of discussion and without a new GNU Assembler release, Arch and Fedora decided to just revert glibc's change. This at least fixed things and made i686 builds of their distributions run on CPUs they supported.

== Kernel emulation ==
In 2010 a kernel developer proposed the patch [https://lists.archive.carbon60.com/linux/kernel/1268554 AMD Geode NOPL emulation for kernel 2.6.36-rc2]. This patch would trap the unknown instruction exception non-i686 CPUs would generate, emulate it, then return back to the program. This was a bit controversial.

Arguments for the patch:

* Distributions aren't going to care long term
* Proprietary software isn't easily fixable
* A similar patch was used to emulate CMOV instructions on i586 CPUs

Arguments against the patch:

* NOP isn't supposed to spend thousands of CPU cycles jumping to the kernel and back
* With the GNU Assembler fix distributions can avoid adding multi-byte NOPs
* That patch wasn't accepted in to Linux

A bit later someone started the mailing list thread [http://lkml.iu.edu/hypermail/linux/kernel/1009.0/02825.html Promoting Crusoe and Geode Processors to i686 Status] which took a look at the overall situation for those two CPUs. It argued that both CPUs supported the full i686 instruction set and that NOPL was not standard i686. As far as I can tell not much was done in response to this.

In 2021 the patch [https://lkml.org/lkml/2021/6/26/132 x86: add NOPL and CMOV emulation] was proposed to the kernel again. As most 32-bit x86 distributions compiled for the i686 architecture this would let i586 or better CPUs run modern day 32-bit Linux distributions. This is especially useful for CPUs still manufactured and used today like Vortex86 CPUs. As it turns out, old machines don't just disappear. They just run out of date software.

Unfortunately a few days later [https://lkml.org/lkml/2021/6/29/687 the author followed up with some bad news]. The Pentium Pro introduced conditional floating point operations and when used on systems that don't support them they silently fail instead of throwing an unknown instruction exception. This makes it effectively impossible to fully emulate the i686 instructions on i586 systems.

== LLVM fallout ==
In 2010 [https://github.com/llvm/llvm-project/commit/c26ddccf3818ddcebc84e98b9310a2aa76692572 r96988] was committed to LLVM. It made the compiler unconditionally output multi-byte NOPs for 32-bit and 64-bit x86 code. This happened regardless if the target architecture supported it, so output could break on systems that weren't even supposed to support multi-byte NOPs, like i586 or i386.

In 2011 someone reported [https://lists.freebsd.org/pipermail/freebsd-current/2011-October/028588.html 9.0 RC1/Clang / illegal instruction (Signal 4) in gengtype while building cc_tools on i586.] to the FreeBSD mailing lists and in 2012 [https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=168253 clang crashes on Geode] was reported to the FreeBSD bug tracker.

After the first bug report, [https://bugs.llvm.org/show_bug.cgi?id=11212 X86AsmBackend::WriteNopData uses long nops unconditionally] was filed upstream to LLVM.

Later in 2012 [https://github.com/llvm/llvm-project/commit/5dd4ccb4020173a569bc54ba559232b5be2cef01 r164132] was committed to LLVM, adding a 'geode' CPU target to LLVM that didn't use multi-byte NOPs. This meant building for i686 without using multi-byte NOPs required building for Geode CPUs. Not very useful for generic i686 releases or for i586 and older machines that weren't supposed to support multi-byte NOPs.

In 2014 [https://github.com/llvm/llvm-project/commit/1b8bfdaae3264efdba964321956965a6ab47540a r195679] was comitted to LLVM to flat out avoid using multi-byte NOPs on i686, i586 and specific non-Intel and non-AMD CPU models that didn't support multi-byte NOPs.