Analyzing Binaries with Hopper’s Decompiler
by abadidea - @0xabad1dea
(this is now also on the corporate blog with an expanded introduction: clickie )
No source code? No problem!
This is aimed at beginners in static analysis. The binary we examine is non-malicious and non-obfuscated, and is not run through the highest optimization settings of the compiler. We will start at line one and proceed linearly, just to get a feel for how to read decompiled code.
For this tutorial you will need a strong knowledge of C, only the slightest familiarity with assembly, the ability to understand Unix man pages, and Hopper Disassembler/Decompiler, which is $29. It is only for OSX, but it is literally cheaper to buy a Mac Mini and Hopper than the (naturally more mature and well-featured) x86 Hex-Rays decompiler. Heck, you could get an iMac and still have change for a coffee.
Hopper is a disassembler with a very-close-to-C “pseudocode” decompiler that does not roundtrip with your C compiler but is quite good for examining other people’s binaries. It supports both 32-bit and 64-bit executables for Windows and OSX (no Linux or iOS/ARM support yet). At the time of writing, the newest version is 2.2.0; the App Store is still holding it hostage for review, but you can also buy the app directly from the creator. It is still under active development, and (like every other decompiler) is not perfect, so learning how to spot when it goes awry is an important facet of making use of it.
(I am not affiliated with Hopper or its creator nor am I receiving a handsome sum to promote it. Honest.)
We will be looking at an ancient piece of C code for Unix called metamail, which was a mimetype helper that received base64-encoded attachments from email clients and opened the appropriate viewing program. Having reviewed it before, I strongly recommend that you do not use it on a production system (not that there is any use for it in this millennium). The source code is yonder, but it is so old that I had to change a few functions to return 0 instead of void to get llvm to compile it for OSX. I also ran the strip utility to remove any debug symbol names of internal functions that may be there, which would make it too easy (you will almost never see these in distributed closed-source programs). Download my metamail binary here.
When you open metamail with the “Read Executable” button in the upper left, you will be dumped here:

This is the prologue before we reach C’s main. Skip down to the first call instruction. Place your cursor over the sub_1000018b0 symbol - the address of the first function called, which presumably is main - and press enter to jump to it.
We are now faced with a wall of inscrutable assembly, but there is no need to panic. In the upper right hand corner of Hopper is the magical Pseudo Code button. It will pop up a C-like reconstruction of the function.
External library functions will show their names just like in source code, and are your waypoints that will make it much easier to understand what’s going on. Internal functions will be named “sub” for “subroutine” followed by their address. Variables are “var” followed by a number. Hopper does not attempt to do much type guessing, nor does it completely write out the use of registers, although it collapses as much as it can manage. In a 32-bit program you will see eax, ebx, ecx, edx, esi, edi, ebp, esp, and in a 64-bit one you will also see rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp and some unnamed registers that look like r<num>. In optimized code, you may also see partial registers ax, bx, cx, dx (16 bits) and ah/al, bh/bl, ch/cl, dh/dl (8 bits).
If you have never looked at assembly before, just know that the registers are named storage locations inside the processor itself, and compilers try to place variables into registers as much as possible because it is much faster than reading and writing RAM. esp/rsp refer to the top of the stack, ebp/rbp to the base of the current stack frame, and eax/rax is normally where a function’s return value is received. In the disassembly view of a function, you will normally see it move the stack pointer. This creates the storage area for the local variables. Rewriting these from relative offsets to the stack base to a name like var_40 is a nice thing that the decompiler view does for us.
The decompiler view does not support in-place text editing, and disassembly-view comments do not carry over to decompiliation view, so in the following screenshots I have copied the decompiled C to a code editor for annotating with explanatory comments. (You can also directly rename symbols in the disassembler view with ‘n’ when you have an idea what they’re for, but we won’t be doing that to keep the symbol name consistent across screenshots.) Let’s focus on this block for now (first screenshot is of the disassembly, second is of the decompilation):

Note that var_144 is set to a function pointer. Switching back to disassembly view, we see that the sub_100001620 pointer is first loaded into rcx and then copied from there to var_144 and then from there to rsi. See how the decompiler abstracted that for us? It’s then passed as the second argument to signal so that it will be set up as the handler for signal 0x2 (stored in var_156), which according to the documentation is SIGINT. Place your cursor on the sub_100001620 reference and press enter to see what function this is. Go ahead and annotate that it’s a signal handler.

Before worrying about anything else, we see several standard functions; another call to signal, followed by getpid and kill on the result of getpid, so we know already that the process is attempting to send itself a signal in its own signal handler.
Hopper does not yet support decoding function arguments (it currently shows “void” as a placeholder), but you can see by inspecting the function body that it assumes there is a value in rdi (on the x86-64 calling convention we are using, arguments in order are rdi, rsi, rdx, rcx, r8, r9) .We know from signaling documentation that handlers receive the integer value of the signal they are handling. We already know that it at the very least will be processing SIGINT, but there’s no reason this handler can’t be attached to other signals as well (spoiler alert: this will happen). Work through the rest of the function assuming var_24 has the value of 0x2.
We see that the handler calls another subroutine. Follow through in disassembler view to sub_100009c50 with enter (backspace will take you to your previous function).

Uh-oh. Things are getting uglier.
What on earth is 0x10000F988? It is a value with a fixed address, i.e. a global variable. If you hover over it in disassembly view, you will see that it has a default value of zero. With it highlighted, press 'x’ to bring up the cross-reference window. Aside from the current spot, where it is being checked for zero, it is set to 1 in sub_100009bd0, which is… well, let us not worry about it for now, as we will quickly get into a mess of function calls. We now know that 0x10000F988 is a boolean flag. If it’s set, we are doing an ioctl with standard output and standard input. Don’t worry too much about what exactly it’s doing, as ioctl takes arbitrary values and does arbitrary things pretty much by definition (one would have to dig into the macros for the platform it was compiled against and do some bitwise math- which isn’t a productive use of our time until we’re sure that it’s important to the application logic and not just boilerplate, which, spoilers, it is). What we should note, though, is that ioctl is a variadic function, and as such will confuse the decompiler and create orphan variables in the disassembly view that may not even show up in the decompiler view. In this case, it loses sight of the fact that 0x10000F83E and 0x10000F838 are global variables (each moved into rdx before the call) being passed as the third arguments to ioctl. If we cross-reference them, we see that they are set in another small function almost identical to this one except it passes a different ioctl value… and sets the boolean flag we found earlier. It is reasonable at this point to assume it is obtaining and later pushing back at exit time the properties of standard input and output to reset the terminal. (If we cheat and check the source code, this is exactly what it is doing, in function RestoreTtyState.)
After annotating the function with what we’ve learned, it is much easier to understand.

Press backspace until we’re back in the signal handler at sub_100001620. We went on quite the little excursion just to determine that it’s pushing a stored terminal state. Picking up where we left off, we see that the handler takes the received signal number and reassigns it to a null handler pointer with signal(var_24, 0x0). This will cause the next raising of that signal to go to the default handler instead of a custom one. We can see it store the return value in var_8 and proceed to do exactly nothing with it. (This would probably disappear altogether on a high optimization setting.) Having changed the signal handler after calling the terminal reset function, it then resends the same signal to itself by calling kill on the value of getpid. The default handler will be raised with the propagated signal, which generally will result in process termination.

Isn’t that quite easy to understand? And we haven’t taken a single peak at any source code!
Alright, we will back up one more level back into main(). The next several lines are quite obvious: it’s just repeatedly setting up our signal handler for several different signals. Bla bla bla, skip down to the next chunk, which is more interesting:

getenv! The program checks for the METAMAIL_TMPDIR variable and, if found, points the global variable 0x10000f278 to the returned buffer. Otherwise, it points it to the constant string "/tmp". Next it checks MM_HEADERS. It gets the strlen of the buffer if it exists; note that in high optimization settings, strlen or strcpy is often inlined and you will not see an explicit function call to it, just a few lines of byte-copying right in the middle of the surrounding code. It mallocs a new buffer into var_160 that is the length of the MM_HEADERS string plus fifteen (0xf) extra bytes. Why? We don’t know yet.
If the malloc fails and returns zero, the program goes to sub_100006130 passing the global pointer variable 0x10000f270. Follow the global pointer and it dumps you to another pointer; follow it again and you get the string "Out of memory!". It is not hard to guess what this function will do.

It flips out and quits.
Backing up one - assuming the malloc was successful, we are going to sprintf something, and hence it becomes obvious that the extra fifteen bytes were to make room for the static string "MM_HEADERS=" and the trailing null. You may have noticed we are actually calling a wrapper that will land you in dynamic library purgatory if you try to follow it. We are not calling the real true sprintf; a theoretically more secure variation (int __sprintf_chk(char * str, int flag, size_t strlen, const char * format);) has automatically been substituted by the compiler. The seemingly mangled call is actually intentional. However… it specifies a security flag of zero and a max string length of MAXINT, so I am not sure how that is supposed to help anyone. Just ignore these flags as noise. That being said, since sprintf is a variadic function, the variadic part (in this case, the buffer stored at 0x10000f930) fell off the end. If you are not intimately familiar with printf format strings, it’s time to brush up- you need to understand them to correctly reconstruct calls and/or spot vulnerabilities in incorrect printf usage.
We haven’t gotten very far into the program at all, but we’ve covered quite a few different C paradigms and how they appear in a disassembly. Hopefully binaries now appear less inscrutable and less magical, and you understand why reverse engineers laugh in the face of programmers who think no-one will never understand their awesome secret keygen without the source code.
I may do some followup articles examining particularly interesting pieces of this moderately large program. With some practice, you can learn to read decompiled code very quickly and learn to spot boilerplate that can be skipped over. Thanks for reading :)