
Explore the internals of applications built using native ahead-of-time (AOT) compilation.
The .NET 7 platform debuted a new deployment model: forward native compilation. When a .NET
application is compiled natively using the AOT method, it becomes a self-contained native executable, equipped with its own minimal runtime to control code execution.
The execution time is very low and in .NET 8 you can create standalone C# applications that are less than 1 MB in size. For comparison, the size of a native AOT Hello World application in C# is closer to the size of a similar application in Rust than in Golang, while being one-sixth the size of a similar application in Java.
In addition, for the first time in history, .NET programs are distributed in a file format other than what is defined in ECMA-335 (i.e., as instructions and metadata for the virtual machine), namely, distributed as native code (file format PE/ELF/Mach-O) with native data structures, just like, for example, in C++. This means that none of the .NET reverse engineering tools created in the last 20 years work with native forward compilation.
Unfortunately, due to these two aspects (compactness and complexity of reverse engineering), native AOT compilation is popular among malware authors, as evidenced, for example, by these articles:
Here we will try to talk a little about how to adapt reverse engineering to new conditions.
Preparing Ghidra and native debuggers
I will repeat the idea from the introduction: native AOT compilation does not work with the file formats that are used in CLR virtual machines to store a program and its metadata. Tools to read the VM file format are useless when working with native executables for AOT. It remains to use tools designed for reverse engineering of arbitrary native code, in particular, native debuggers (WinDBG/VS/x64dbg on Windows, lldb/gdb on Unix-like systems) and code analysis frameworks (Ghidra, IDA, Binary Ninja, etc.) .d.).
Since in native AOT mode the program is compiled into a single executable file without dependencies, the amount of available metadata is significantly reduced, but some metadata still remains (as, for example, in C++).
Consider a binary file
If you want to go deeper, install the .NET 8 SDK (I’m using version RC1, the latest available at the time of this writing). You can skip the installation and simply download the ZIP archive and specify the location of the extracted files in your PATH.
Let’s start with the Hello World application with native AOT:
$ dotnet new console --aot -o TestApp
Create a new TestApp
directory and place the Hello World console application project there, configured for ahead-of-time compilation.
$ cd TestApp
$ dotnet publish
Once the publishing process is complete, you should see the binary in the bin\Release\net8.0\win-x64\publish
folder (I did this on Windows, but it will work on Linux/Mac). The binary file is about 1.2 MB in size, and next to it is a file with native debugging information (PDB on Windows, DBG on Linux, and whatever on Mac). Let’s take a look at what we got.
$ dumpbin bin\Release\net8.0\win-x64\publish\TestApp.exe
Microsoft (R) COFF/PE Dumper Version 14.37.32824.0
Copyright (C) Microsoft Corporation. All rights reserved.
Dump of file bin\Release\net8.0\win-x64\publish\TestApp.exe
File Type: EXECUTABLE IMAGE
Summary
D000 .data
5E000 .managed
B000 .pdata
60000 .rdata
1000 .reloc
1000 .rsrc
64000 .text
1000 _RDATA
31000 hydrated
Nothing unusual in appearance. The section .managedcontains
managed code (in this case, “native code whose memory is managed by the garbage collector”). The section hydrated
is not initialized, but it is populated early in startup with runtime data structures.
The remaining sections also look pretty standard: .text
they contain unmanaged code, in particular, the garbage collector itself, or other native code that the user himself associated with the executable file.
Running the command strings
in the executable brings us to some interesting things, in particular:
8.0.23.41904v8.0.0-rc.1.23419.4+92959931a32a37a19d8e1b1684edc6db0857d7de
(The version of the commit hash from the dotnet/runtime repository, the same one from which the executable file was obtained, may be useful to us later.)
Pay attention to lines such as DivideByZeroException
or get_CanWrite
, if you’re lucky, based on them we can recover useful information about types and methods.
Debug memory allocation and virtual call
An interesting experiment to understand how everything works is to run a small piece of code. Let’s replace Program.cs
with the following listing:
using System.Runtime.CompilerServices;
class Program
{
[MethodImpl(MethodImplOptions.NoOptimization | MethodImplOptions.NoInlining)]
static void Main() => Console.WriteLine(new Program().ToString());
public override string ToString() => "Hello World!";
}
We again execute dotnet publish
and run the program under the debugger. This is where we really need the luxury of debugging symbols in the application. When researching malware, the chances of getting a PDB/DBG are very low. Let’s set a breakpoint in the Main line and see what disassembling gives us:
00007FF730B8FD50 push rbp
00007FF730B8FD51 sub rsp,30h
00007FF730B8FD55 lea rbp,[rsp+30h]
00007FF730B8FD5A xor eax,eax
00007FF730B8FD5C mov qword ptr [rbp-8],rax
00007FF730B8FD60 lea rcx,[TestApp_Program::`vftable' (07FF730BCC688h)]
00007FF730B8FD67 call RhpNewFast (07FF730AF1DE0h)
00007FF730B8FD6C mov qword ptr [rbp-8],rax
00007FF730B8FD70 mov rcx,qword ptr [rbp-8]
00007FF730B8FD74 call TestApp_Program___ctor (07FF730B8FDB0h)
00007FF730B8FD79 mov rcx,qword ptr [rbp-8]
00007FF730B8FD7D mov rax,qword ptr [rbp-8]
00007FF730B8FD81 mov rax,qword ptr [rax]
00007FF730B8FD84 call qword ptr [rax+18h]
00007FF730B8FD87 mov rcx,rax
00007FF730B8FD8A call System_Console_System_Console__WriteLine_12 (07FF730B56190h)
00007FF730B8FD8F nop
00007FF730B8FD90 add rsp,30h
00007FF730B8FD94 pop rbp
00007FF730B8FD95 ret
The code looks pretty standard. There have been additional register/stack shuffles since we turned off optimizations for clarity. Symbolic names are only visible because we had debugging information. If we didn’t have it, TestApp_Program::vftable'
it could have a single value of 07FF730BCC688h
.
Let’s take a closer look:
00007FF730B8FD60 lea rcx,[TestApp_Program::`vftable' (07FF730BCC688h)]
00007FF730B8FD67 call RhpNewFast (07FF730AF1DE0h)
Here you can see how memory allocation is organized: we load the address of the structure vftable
that describes the class Program, and call the helper RhpNewFast
to allocate an instance of this structure from the garbage collector heap. Since .NET is open source, we can look at the details, but essentially we are reading a field from the structure vftable
to determine the size of the allocated memory (the size of the Program
class instance). We cut out a piece of memory that is reset when allocated (“bump allocation”) and write the address vftable
into the first field of the newly allocated instance, thus “identifying” the memory fragment. If the buffer allocator runs out of memory, then everything can be done a little slower, but this option is not so interesting.
The code RhpNewFast
is written in assembly language and rarely changes, so it’s likely that you can find it yourself without even resorting to debugging symbols.
After a new object instance is allocated, the instance constructor is called:
00007FF730B8FD70 mov rcx,qword ptr [rbp-8]
00007FF730B8FD74 call TestApp_Program___ctor (07FF730B8FDB0h)
Since we have debug symbols, we see the symbol name (TestApp_Program___ctor
). If we didn’t have characters, this would be a call 07FF730B8FDB0h
.
After the constructor completes, we make a virtual call ToString
. This is another interesting detail:
00007FF730B8FD81 mov rax,qword ptr [rax]
00007FF730B8FD84 call qword ptr [rax+18h]
First we dereference the object reference. As we saw during the allocation, we still have the address of the structure vftable
in rax
. We then call the address 0x18
bytes into the structure vftable
. Presumably, this is where the method address is stored (Program.ToString
).
The structure vftable
is a table of virtual methods, familiar to us from C++. It lists all addresses of virtual methods implemented by the type. It also contains additional metadata, such as the size of the object instance, whether it is a struct or a class, etc. In the .NET world, the first three slots of vftable
are almost always implementations of object.ToString
, object.GetHashCode
, and object.Equals
(however, the order of these three entities depends on the optimization of the overall program).
Native AOT causes a structure vtable
MethodTable
or EEType
, moreover, one can replace another. You can learn more about it by viewing the write or read implementation. (Keep in mind that the CoreCLR VM also has a MethodTable
function, but its structure is different.)
Although the data structure MethodTable
is informative, the most useful information, particularly type names, is not always available. Other things we also don’t have:
- List all methods (we can at least list virtual method addresses, like in C++ reverse engineering)
- List of all fields (however, the garbage collector information given before
MethodTable
allows you to judge at what offsets within the object instance the garbage collector pointers are located, which is still better than nothing)
- Contents of assembly type
- Etc.
Dehydrogenated data
An additional complication is that the MethodTable
data structures are placed in the hydrated segment of the executable, which is zeroed upon initialization (zero init). At the beginning of the startup sequence there is a small piece of code that fills this segment with actual data. This makes it even more difficult for static analytics tools to interpret the contents of MethodTables
unless they are paged out of memory.
Data dehydrogenation was discussed here, and this pull request describes what happens even better than I could describe in this article. Essentially, this data is stored in a more compact form in a file format and is expanded at runtime. Perhaps this phenomenon could be simulated in static analytics tools by determining which blob contains the relevant data, starting with the RTR header. However, this file format does not conform to any ABI, it is subject to change, and may have to be updated every year for new versions of .NET.
Reflection of data structures
Although naming information is not so easily obtained, it is still present in the row dump, as we can see. Reflection keeps track of all type names because in .NET you can simply call object.GetType
on any object and find out its name.
The data blob that maps MethodTable
data structures to metadata descriptors is associated with an RTR header, as is the metadata blob itself. In theory, you could use the metadata reading APIs to recover the symbolic names for all MethodTables
in a program. However, none of these formats or APIs are intended to be freely usable and will likely change with each major .NET release.
For example, an experienced malware author could also publish his application with the property IlcDisableReflection
set to true, which would allow him to disable reflection and not generate any metadata about reflection. This mode is not supported or documented outside the dotnet/runtime
repository.
Stacktrace data structures
Similarly, as with dump strings, we need information about method names. The only reason it is available is because this information is needed to generate a stack backtrace. When an exception is thrown, the developer can check for it using ToString
or access the property StackTrace
to print a text stack trace. To do this, the addresses of native methods and metadata are constantly compared, so names and signatures can be created. Reflection data is generated in much the same way, and the file formats are the same (they are also referenced in the RTR header).
using System.Runtime.CompilerServices;
class Program
{
[MethodImpl(MethodImplOptions.NoOptimization | MethodImplOptions.NoInlining)]
static void Main() => Console.WriteLine(new Program().ToString());
public override string ToString() => throw new Exception();
}
(We’ve updated the previous program to allow ToStringan
exception to be thrown and still be left unhandled.)
Unhandled Exception: System.Exception: Exception of type 'System.Exception' was thrown.
at Program.ToString() + 0x24
at Program.Main() + 0x37
Note that the application was able to output the names and signatures of the methods involved. This approach will work even if we get rid of the debugging information by deleting the PDB/DBG file.
However, the user can set the StackTraceSupport property to false at the time of publishing their application to disable the generation of this data (stack trace data generation is enabled by default). Then the program will instead output:
Unhandled Exception: System.Exception: Exception of type 'System.Exception' was thrown.
at TestApp!<BaseAddress>+0x9dab4
at TestApp!<BaseAddress>+0x9da77
If the application was built this way, our chances of recovering method names or signatures are almost zero. Some method names may still be available in the reflection metadata, but there are usually very few methods visible through reflection - the compiler aggressively removes them unless an analysis of trimming capabilities recommends otherwise.
Summary
To summarize, parsing .NET binaries compiled natively using AOT requires the same skills as parsing C++, for example. Some information is easy to find (like stack unwinding, some type information, etc.), but we can forget about the luxury of being able to split types into separate fields and control access to them. The fields basically dissolve into access instructions (we can guess that something might be init if the field reads as 4 bytes). Method names will disappear if stack trace data is disabled. Type names may also disappear if you turn off reflection.