Home > IL2CPP > IL2CPP Reverse Engineering Part 2: Structural Overview & Finding the Metadata

IL2CPP Reverse Engineering Part 2: Structural Overview & Finding the Metadata

December 27, 2020 Leave a comment Go to comments

[You can use Il2CppInspector to help automate the techniques outlined in this series]

In part 1 of this series we learned what IL2CPP is, how to setup a build environment, and compared the C#, IL, C++ and disassembly of a simple function.

In this article, you will learn:

  • an overview of the key files in an IL2CPP application from a reverse-engineering perspective
  • how an IL2CPP application loads the metadata we are interested in
  • how to find the application binary’s metadata by hand in a disassembler (x64 and ARM)
  • beginner-level disassembly navigation and tidying in IDA
  • how to interpret C++ function calls in assembly language

Pre-requisites:

  • Basic knowledge of high-level programming
  • Basic knowledge of disassembly (the article uses IDA but Ghidra works equally well)
  • Basic knowledge of what IL2CPP is – I recommend that you read part 1 first if you’re new to IL2CPP

Note: I chose Unity 2019.3.1 more or less at random for this walkthrough. Different versions vary slightly although the overall principles are the same.

The meat

IL2CPP applications are forged from two key components. First, there is the application code itself.

On Windows, the main executable of an IL2CPP application is essentially just a stub that loads UnityPlayer.dll and calls UnityMain. For an IL2CPP game, this will select Unity’s IL2CPP initialization path and load the main application binary; this is usually called GameAssembly.dll in the application’s root path but it can be placed elsewhere and renamed.

On Android, the application binary is libil2cpp.so, and on iOS everything is generally wrapped up into a single executable. Other platforms use different layouts, but all of the binaries can be analyzed in the same way, so the target platform doesn’t matter too much.

The application binary (which I’ll just call “the binary” from hereon) is the output created by taking the regular Mono DLLs for the application (eg. Assembly-CSharp.dll and its dependencies, as if it was shipped without IL2CPP) and running them through the IL2CPP transpiler, and is therefore the main target for reverse engineering since it contains the actual application code.

Besides the application code itself, the binary also contains a vast sea of binary-specific metadata such as a pointer list to every C#-equivalent function, data about every type referenced by method code and so on. Many (most) binaries also expose the IL2CPP API – a large group of exported functions allowing you to query and modify data in the application at runtime – useful for dynamic analysis with a debugger. These APIs can be found in the export table and begin with the prefix il2cpp_.

The gravy

The other main file of interest for analysts is global-metadata.dat (“the metadata”). This file is a platform-independent data file created by IL2CPP containing all of the .NET metadata for the application. This includes definitions (including symbols) for all of the types, methods, properties, fields and so on for the application. Many of the structures within are similar to those used by the actual .NET runtime, but tweaked for IL2CPP. Serge Lidin provides a thorough treatise of the metadata in the excellent book Expert .NET 2.0 IL Assembler.

The metadata file is always a little-endian 32-bit width set of data, with tables linked via indices rather than pointers. Therefore in principle, if you are compiling the same application for multiple platforms, you only need one copy of global-metadata.dat and different executable binaries for each platform. In practice, builds are often customized with platform-specific functions for Windows, Android and so on.

The metadata file format is very simple. It always starts with the signature 0xFAB11BAF (little-endian) followed by 4 bytes containing the metadata version number. This is followed by a long list of offset/length pairs for the various tables of information, directly followed by the tables themselves. Which tables are actually present depends on the version number, and there will also be corresponding changes in the binary for different versions.

Info: The first non-beta IL2CPP version in 2015 was 15, and at the time of writing we are on version 27 (Unity 2020.2). There was a long period of several years where the version number remained at 24, however multiple changes were made to the data format over time and the RE community has named them 24.1 – 24.4.

Version 24.2 (Unity 2019.1) brought substantial changes to the way the data is organized – moving much of the data from global lists to per-module (per-assembly) lists instead, with an extra table pointing to these lists for each assembly. Version 27 moves more global list data to per-assembly lists, and also moves a large block of data describing which methods use which types, other methods and string literals – that were previously in the metadata file – to the binary file.


Additionally, while the metadata and binary have typically moved in lockstep with version advances, a divergence occurred in Unity 2019.3.7-2019.4.14 where the binary’s metadata was changed but the metadata file remained the same. This version is numbered 24.3, but the metadata file format of 24.2 and 24.3 is the same – only the binary changed.

You may wonder why an overzealous publisher would want to ship their product with everything required to reconstruct all of the types and method prototypes in plain sight. Ultimately, this data is required due to .NET’s heavy reliance on reflection (known in other languages as runtime type information or RTTI) and attributed programming, and cannot be easily elided. As is the case with Unity apps built with the Mono scripting backend, some developers choose to use canned obfuscation software such as the popular BeeByte to arbitrarily redefine un-exported symbol names. These tools are useful as a roadblock to thwart the casual attacker, but for anyone used to determining the meaning of code from the code itself rather than its symbols, such obfuscators have limited effect.

On its own, the metadata file can be used to re-construct the entire structure of the application as it was when it was written in C# – with more or less everything except for the actual source code to the methods themselves – however this gives us zero insight into the structure of the actual binary we’re analyzing. To do this, we need to combine the metadata file with the specific binary file we’re looking at, and to do that, we need to first find the location of the binary’s own metadata structures. This is crucial for successful reverse engineering, and is our goal for today.

Starting with a clean slate

As you work through the text below, I recommend you create an empty Unity project targeting IL2CPP and build it as described in part 1 so that you can look at each referenced function in the source code as you follow along. Don’t be afraid to open up the files and explore!

  • global-metadata.dat is usually located at <appname>_Data/il2cpp_data/Metadata/global-metadata.dat regardless of target platform – you can examine it easily with a hex editor like HxD
  • The source code for libil2cpp can be found at C:\Program Files\Unity\Hub\Editor\20xx.x.x\Editor\Data\il2cpp\libil2cpp if you have installed Unity via Unity Hub in the default location on Windows
  • il2cpp.exe – which is the transpiler itself – can be found in the build folder located above the previous folder. There is no source code for this, however it is trivially browsed with your favourite .NET Decompiler and is not obfuscated
  • The actual C++ generated by il2cpp.exe can be found in the il2cppOutput folder of your project’s build output

Tip: When you build the project, tick Copy PDB Files and Development Build. This will generate symbol files for all of the functions in the binary. IDA will load automatically load these files, making it much easier to navigate the disassembly.

How metadata is loaded

The key parts of the startup sequence from a reverse-engineering standpoint are shown in Figure 1. The sequence is convoluted but not particularly difficult to trace.

Figure 1. IL2CPP startup sequence for loading metadata

IL2CPP generates two files in the root of the C++ output called Il2CppCodeRegistration.cpp and Il2CppMetadataRegistration.c. These files define the two key top-level binary metadata tables we are looking for. These tables contain pointers to all of the other binary metadata tables, and allow us to correlate the contents of the metadata file to concrete function addresses and used type references in the binary.

When a DLL (or .so) file loads, it may execute one or more startup functions before returning control to the caller. Il2CppCodeRegstration.cpp generates just such a startup function, which looks something like this:

void s_Il2CppCodegenRegistration()
{
	il2cpp_codegen_register (&g_CodeRegistration, &g_MetadataRegistration, &s_Il2CppCodeGenOptions);
}

When the binary loads, a pointer to this function is passed to il2cpp::utils::RegisterRuntimeInitializeAndCleanup::RegisterRuntimeInitializeAndCleanup() (snappy name I know) which stores it in a function table for later use.

Once control is returned to the UnityPlayer engine, it calls the API export il2cpp_init, which eventually leads to a call to il2cpp::utils::RegisterRuntimeInitializeAndCleanup::ExecuteInitializations(). This function calls every function stored in the previously mentioned function table, thereby calling s_Il2CppCodeGenRegistration() in the process. Notice that this hooking mechanism also enables 3rd party developers to perform dependency injection if they require their own initialization – or decryption – code.

Via a long-winded sequence of nested function calls, s_Il2CppCodegenRegistration() eventually calls il2cpp::vm::MetadataCache::Register() which actually stores the pointers to Il2CppCodeRegistration and Il2CppMetadataRegistration and performs some pre-processing.

Once this dance is completed, control returns and il2cpp::vm::MetadataCache::Initialize() is called. This function is responsible for calling the loader that fetches global-metadata.dat, however the file is not all loaded into memory at once – rather, it is mapped for demand paging via mmap.

This has a couple of consequences. First it means you can’t just dump the memory of a running application to retrieve its entire metadata file should it be obfuscated or encrypted without some trickery. Second, it means that file accesses to the metadata may appear at seemingly non-sensical code locations if you are looking at a stack trace.

Here is a stack trace using ProcMon from when the metadata file is first memory-mapped:

Here is one from later on:

In the second screenshot, reading a string from an array causes the Windows kernel to demand page the metadata file to find the string, since it is actually in the file on disk. I’ll talk more about ProcMon’s role in IL2CPP reverse engineering in a later article.

Examining the binary

With this knowledge in hand, we should be able to fire up our favourite disassembler (or maybe just the one that isn’t grotesquely overpriced for hobbyist users), load up the binary and PDB and take a look. The object of the game here is to start by looking at our known metadata location and work our way back through the chain of references and function calls to the starting point, so that we understand what we’re looking for in a real application.

Let’s navigate to g_MetadataRegistration, which is the Il2CppMetadataRegistration table (one of the two tables we are looking for) just to see what it looks like (IDA: press G and type the symbol name then Enter).

.rdata:0000000181D9DE30 g_MetadataRegistration db 0AFh
.rdata:0000000181D9DE31                 db  21h ; !
.rdata:0000000181D9DE32                 db    0
.rdata:0000000181D9DE33                 db    0
.rdata:0000000181D9DE34                 db    0
.rdata:0000000181D9DE35                 db    0
.rdata:0000000181D9DE36                 db    0
.rdata:0000000181D9DE37                 db    0
.rdata:0000000181D9DE38                 dq offset s_Il2CppGenericTypes
.rdata:0000000181D9DE40                 db  97h ; —
.rdata:0000000181D9DE41                 db    7
.rdata:0000000181D9DE42                 db    0
.rdata:0000000181D9DE43                 db    0
.rdata:0000000181D9DE44                 db    0
.rdata:0000000181D9DE45                 db    0
.rdata:0000000181D9DE46                 db    0
.rdata:0000000181D9DE47                 db    0
.rdata:0000000181D9DE48                 dq offset g_Il2CppGenericInstTable
.rdata:0000000181D9DE50                 db 0BFh ; ¿
.rdata:0000000181D9DE51                 db  2Dh ; -
.rdata:0000000181D9DE52                 db    0
.rdata:0000000181D9DE53                 db    0

We can see some pointers and counts. It’s a bit messy, but if we want to see it more clearly we can just click on a line in IDA and tap D repeatedly to toggle each item between 1, 2, 4 and 8 bytes. If we do that for each item, we get something more readable:

.rdata:0000000181D9DE30 g_MetadataRegistration dq 21AFh
.rdata:0000000181D9DE38                 dq offset s_Il2CppGenericTypes
.rdata:0000000181D9DE40                 dq 797h
.rdata:0000000181D9DE48                 dq offset g_Il2CppGenericInstTable
.rdata:0000000181D9DE50                 dq 2DBFh
.rdata:0000000181D9DE58                 dq offset s_Il2CppGenericMethodFunctions
.rdata:0000000181D9DE60                 dq 4DABh
.rdata:0000000181D9DE68                 dq offset g_Il2CppTypeTable
.rdata:0000000181D9DE70                 dq 306Dh
.rdata:0000000181D9DE78                 dq offset g_Il2CppMethodSpecTable
.rdata:0000000181D9DE80                 dq 0EE8h
.rdata:0000000181D9DE88                 dq offset g_FieldOffsetTable
.rdata:0000000181D9DE90                 dq 0EE8h
.rdata:0000000181D9DE98                 dq offset g_Il2CppTypeDefinitionSizesTable
.rdata:0000000181D9DEA0                 dq 4195h
.rdata:0000000181D9DEA8                 dq offset g_MetadataUsages

If we compare this to Il2CppMetadataRegistration.c in our IL2CPP output, we see that it matches up nicely:

const Il2CppMetadataRegistration g_MetadataRegistration = 
{
	8623,
	s_Il2CppGenericTypes,
	1943,
	g_Il2CppGenericInstTable,
	11711,
	s_Il2CppGenericMethodFunctions,
	19883,
	g_Il2CppTypeTable,
	12397,
	g_Il2CppMethodSpecTable,
	3816,
	g_FieldOffsetTable,
	3816,
	g_Il2CppTypeDefinitionSizesTable,
	16789,
	g_MetadataUsages,
};

(the numbers are the same, it’s just that they are shown in hexadecimal in the disassembly and regular decimal in the C code)

So this is what we’ll be looking for in a real application, although there will be no symbols of course. If we click on the g_MetadataRegistration label and press X to open cross-references (xrefs), we can see everywhere in the binary that references this address. There is only one xref, and it takes us to:

.text:00000001803EC9D0 ?s_Il2CppCodegenRegistration@@YAXXZ proc near
.text:00000001803EC9D0                 push    rdi
.text:00000001803EC9D2                 sub     rsp, 20h
.text:00000001803EC9D6                 mov     rdi, rsp
.text:00000001803EC9D9                 mov     ecx, 8
.text:00000001803EC9DE                 mov     eax, 0CCCCCCCCh
.text:00000001803EC9E3                 rep stosd
.text:00000001803EC9E5                 lea     r8, unk_181D53A70 ; struct Il2CppCodeGenOptions *
.text:00000001803EC9EC                 lea     rdx, g_MetadataRegistration ; struct Il2CppMetadataRegistration *
.text:00000001803EC9F3                 lea     rcx, ?g_CodeRegistration@@3UIl2CppCodeRegistration@@B ; struct Il2CppCodeRegistration *
.text:00000001803EC9FA                 call    ?il2cpp_codegen_register@@YAXQEBUIl2CppCodeRegistration@@QEBUIl2CppMetadataRegistration@@QEBUIl2CppCodeGenOptions@@@Z
.text:00000001803EC9FF                 add     rsp, 20h
.text:00000001803ECA03                 pop     rdi
.text:00000001803ECA04                 retn
.text:00000001803ECA04 ?s_Il2CppCodegenRegistration@@YAXXZ endp

Here we see a function prologue from 1803EC9D0-1803EC9E4 which we ignore, three LEAs which load the addresses of our wanted structs into registers followed by a call to il2cpp_codegen_register, and finally the function epilogue – which we also ignore.

This is the compiled version of the function IL2CPP generated for us in Il2CppCodeRegistration.cpp:

void s_Il2CppCodegenRegistration()
{
	il2cpp_codegen_register (&g_CodeRegistration, &g_MetadataRegistration, &s_Il2CppCodeGenOptions);
}

64-bit binaries on Windows use the x64 calling convention, which states that the first four arguments to a function will be passed in RCX, RDX, R8 and R9. While it is obvious with our symbols which struct is which, there is no guarantee that the compiler will generate code which always loads the registers in this order, and indeed it frequently doesn’t. However, since we know the correct order of the arguments to il2cpp_codegen_register, we know that – in other applications – RCX will always be a pointer to Il2CppCodeRegistration and RDX will always be a pointer to Il2CppMetadataRegistration.

Tip: If you are disassembling ARM binaries, ARMv7’s calling convention uses R0-R3 as the arguments (from left to right), and ARMv8 for 64-bit platforms uses X0-X7.

Let’s step back again by looking at the xrefs to s_Il2CppCodegenRegistration (click on the label and press X). We might expect a pointer to this function to be referenced by one of the startup hooks we discussed re: Figure 1, and sure enough this is what we find:

.text:0000000180040980 code_reg_hook   proc near
.text:0000000180040980                 push    rdi
.text:0000000180040982                 sub     rsp, 20h
.text:0000000180040986                 mov     rdi, rsp
.text:0000000180040989                 mov     ecx, 8
.text:000000018004098E                 mov     eax, 0CCCCCCCCh
.text:0000000180040993                 rep stosd
.text:0000000180040995                 xor     r9d, r9d        ; int
.text:0000000180040998                 xor     r8d, r8d        ; void (*)(void)
.text:000000018004099B                 lea     rdx, ?s_Il2CppCodegenRegistration@@YAXXZ ; void (*)(void)
.text:00000001800409A2                 lea     rcx, unk_181FC626B ; this
.text:00000001800409A9                 call    ??0RegisterRuntimeInitializeAndCleanup@utils@il2cpp@@QEAA@P6AXXZ0H@Z
.text:00000001800409AE                 add     rsp, 20h
.text:00000001800409B2                 pop     rdi
.text:00000001800409B3                 retn
.text:00000001800409B3 code_reg_hook   endp

Indeed we find a function which passes the address of s_Il2CppCodegenRegistration as an argument to RegisterRuntimeInitializeAndCleanup, just as we expected!

This code snippet merits further explanation for newcomers to disassembly. First, you might notice some weird xor instructions where a register is XOR’ed with itself. This is a standard compiler optimization to set a register to zero – if you XOR a number with itself, you always get zero. You can do mov r8d, 0 instead but this uses 5 bytes of memory and takes more cycles (time), whereas the xor is faster and only uses 3 bytes.

Secondly, notice here how a this pointer is passed as the first argument in RCX. Let’s look at the function prototype from the IL2CPP source code in libil2cpp/utils/RegisterRuntimeInitializeAndCleanup.cpp:

RegisterRuntimeInitializeAndCleanup::RegisterRuntimeInitializeAndCleanup(CallbackFunction Initialize, CallbackFunction Cleanup, int order)

There are only three arguments, but the assembly code passes four. This is because in machine code, there are no classes, and all functions are global. Therefore, to know which object (class instance) is being used, every class method must receive a pointer to the instance. By convention, this is always passed as the first argument, and in C++ source code it is completely hidden from view. Therefore, we pass this in RCX and the first declared argument – Initialize – in RDX.

To make it easier to find again, I gave this function the name code_reg_hook. To rename a function, click on its label and press N.

Finally, let’s step back one more time. This time, there are two xrefs:

The second one is a RUNTIME_FUNCTION struct in the .pdata section and you can safely ignore it. This is a list of structs Windows uses for exception handling and is not of interest to us. Clicking on the first item, we see it is part of a long list of function pointers:

; ...
.rdata:0000000181870BB0                 dq offset sub_1800407C0
.rdata:0000000181870BB8                 dq offset sub_180040800
.rdata:0000000181870BC0                 dq offset sub_1800405D0
.rdata:0000000181870BC8                 dq offset sub_180040660
.rdata:0000000181870BD0                 dq offset sub_180040840
.rdata:0000000181870BD8                 dq offset sub_1800408C0
.rdata:0000000181870BE0                 dq offset sub_180040900
.rdata:0000000181870BE8                 dq offset ??__E?wndTop@CWnd@@2V1@B@@YAXXZ
.rdata:0000000181870BF0                 dq offset sub_180040880
.rdata:0000000181870BF8                 dq offset code_reg_hook
.rdata:0000000181870C00                 dq offset sub_180040B40
.rdata:0000000181870C08                 dq offset sub_1800412C0
.rdata:0000000181870C10                 dq offset sub_180040E00
.rdata:0000000181870C18                 dq offset sub_180040C00
.rdata:0000000181870C20                 dq offset sub_180041000
.rdata:0000000181870C28                 dq offset sub_180040BC0
.rdata:0000000181870C30                 dq offset sub_180040B80
; ...

This is in fact what we hoped for. Remember how we discussed earlier that a library can execute initialization functions when it starts up? This is precisely that list! In a C++ application, this can – depending on how it has been compiled – include every static constructor and dynamic initializer in the application – including those in the standard library – which creates a very long list indeed. It’s an important list though, because almost every binary file with executable code has one, and it serves as our starting point: the first breadcrumb in the trail to the metadata.

Info: The init function table has a different location depending on what kind of files you are working with.

For PE files (Windows EXEs and DLLs), the init table is in the .rdata section right after the IAT (Import Address Table), which comes at the start of the section. An easy way to find it in some files it to search for __guard_check_icall_fptr and scroll down until you find a null (zero) pointer. The init table starts at the next address.

For ELF files (Linux, Android etc.), the table is stored in the .init_array section (and finalization functions are in the .fini_array section).

For MachO files (iOS), the table is stored in the __mod_init_func section.

Why are the names all messed up? The long squiggly names are the result of name mangling – a process which guarantees every symbol relating to the binary is unique, and provides additional information to a debugger. By appending the full namespace and an encoded sequence of argument types to each symbol, multiple overloads of the same method still get unique symbols, for example. Not all symbol files use name mangling, but many do. Luckily, it doesn’t have any effect on this kind of reverse engineeringyou will just learn to ignore all of the extra bits after a while.

Finding binary metadata in the wild

Now we’ve examined our disassembly in easy mode, let’s turn our attention to a real application where we have no source code and no symbols, and perform the analysis the opposite way around: start from the initialization function list and drill down to the metadata structs. We can leverage what we’ve learned to have a better idea of what we’re looking for, and if we get stuck, we can turn to the IL2CPP library source code for help.

Note that this technique is not the only way to find the desired data, nor is it often the fastest. However, it is the method that gives the best understanding of the code. We briefly summarize other possible strategies below.

I’ll use a randomly chosen Android game using ARMv8-A (64-bit) for this example, Subway Surfers (this example uses v2.10.2).

Having extracted the APK with 7-Zip, we can find the binary at /lib/arm64-v8a/libil2cpp.so and load it into our disassembler.

Press Ctrl+S to open the segment list and double-click on .init_array to navigate there. We find the following list:

.init_array:0000000002ADA620 ; ELF Initialization Function Table
.init_array:0000000002ADA620 ; ===========================================================================
.init_array:0000000002ADA620
.init_array:0000000002ADA620 ; Segment type: Pure data
.init_array:0000000002ADA620                 AREA .init_array, DATA, ALIGN=3
.init_array:0000000002ADA620                 ; ORG 0x2ADA620
.init_array:0000000002ADA620 off_2ADA620     DCQ sub_B4F89C          ; DATA XREF: LOAD:off_88↑o
.init_array:0000000002ADA620                                         ; sub_1078EC4:loc_10790A0↑o ...
.init_array:0000000002ADA628                 DCQ sub_B4FC64
.init_array:0000000002ADA630                 DCQ sub_B4FD18
.init_array:0000000002ADA638                 DCQ sub_B4FD34
.init_array:0000000002ADA640                 DCQ sub_B50394
.init_array:0000000002ADA648                 DCQ sub_B504C0
.init_array:0000000002ADA650                 DCQ sub_B505B0
.init_array:0000000002ADA658                 DCQ sub_B50624
.init_array:0000000002ADA660                 DCQ sub_B50780
.init_array:0000000002ADA660 ; .init_array   ends
.init_array:0000000002ADA660

Due to the way code is compiled, it’s often best to start the search from the end of the list, though that’s not always the case. We double-click on each function in turn, starting from the end, looking for something that resembles either the init hook or Il2CppCodeGenRegistration() itself.

Most of the functions contain calls to __cxa_atexit which registers a function to be called when the library is unloaded from memory; we can immediately discard all of these, along with anything else that calls internal compiler-related functions, typically starting with __cxa, __gxx and so on. You will soon learn to recognize these from experience.

The function at sub_B4FD18 looks interesting:

.text:0000000000B4FD18 ; __unwind {
.text:0000000000B4FD18                 ADRP            X0, #unk_2FDFD8B@PAGE
.text:0000000000B4FD1C                 ADRP            X1, #sub_D1DB7C@PAGE
.text:0000000000B4FD20                 ADD             X0, X0, #unk_2FDFD8B@PAGEOFF
.text:0000000000B4FD24                 ADD             X1, X1, #sub_D1DB7C@PAGEOFF
.text:0000000000B4FD28                 MOV             X2, XZR
.text:0000000000B4FD2C                 MOV             W3, WZR
.text:0000000000B4FD30                 B               loc_D67FC8
.text:0000000000B4FD30 ; } // starts at B4FD18

If you actually click through every function in the init table you’ll see that this one both looks vastly different to all the others, and is much shorter. Furthermore, its entire behaviour is to load four arguments and jump to another function: an unknown struct pointer in X0, a function pointer in X1, and zeroes in X2 and X3. IDA helps us here by defining function names with the sub_ prefix, so we can easily see that X1 is a function pointer.

Tip: In ARMv8, loading a 64-bit address requires two instructions. ADRP loads the top 32 bits of the address into a register, and then ADD adds the bottom 32 bits to make a complete address. Note that these instructions don’t have to be paired right next to each other, as you can see above.

XZR and WZR represent 64-bit and 32-bit “zero registers”. They are a shortcut and always contain the value zero.

Recall earlier that class methods require a this parameter as the first argument, so it’s a reasonable guess that X0 is an instance pointer. We tap N on the various labels and name them with their suspected meanings:

.text:0000000000B4FD18 code_reg_hook
.text:0000000000B4FD18 ; __unwind {
.text:0000000000B4FD18                 ADRP            X0, #this@PAGE
.text:0000000000B4FD1C                 ADRP            X1, #Il2CppCodeGenRegistration@PAGE
.text:0000000000B4FD20                 ADD             X0, X0, #this@PAGEOFF
.text:0000000000B4FD24                 ADD             X1, X1, #Il2CppCodeGenRegistration@PAGEOFF
.text:0000000000B4FD28                 MOV             X2, XZR
.text:0000000000B4FD2C                 MOV             W3, WZR
.text:0000000000B4FD30                 B               loc_D67FC8
.text:0000000000B4FD30 ; } // starts at B4FD18

Now we double-click the unconditional branch to loc_D67FC8 and see what awaits. What we find looks scary – it could be RegisterRuntimeInitializeAndCleanup – so let’s back out and double-click on Il2CppCodeGenRegistration instead to see if it does what we expect:

.text:0000000000D1DB7C Il2CppCodeGenRegistration
.text:0000000000D1DB7C ; __unwind {
.text:0000000000D1DB7C                 ADRP            X1, #off_2DB5048@PAGE
.text:0000000000D1DB80                 LDR             X1, [X1,#off_2DB5048@PAGEOFF]
.text:0000000000D1DB84                 ADRP            X0, #unk_2D40460@PAGE
.text:0000000000D1DB88                 ADRP            X2, #unk_24DF3DC@PAGE
.text:0000000000D1DB8C                 ADD             X0, X0, #unk_2D40460@PAGEOFF
.text:0000000000D1DB90                 ADD             X2, X2, #unk_24DF3DC@PAGEOFF
.text:0000000000D1DB94                 B               loc_D71E34

Three pointers are loaded into X0-X2 and the code branches to another function. Referring back to the C definition of Il2CppCodeGenRegistration, we see that this is exactly what it does, jumping to il2cpp_codegen_register. So we have probably found our metadata! We name the addresses once again, being careful to use the order matching the signature of il2cpp_codegen_register:

.text:0000000000D1DB7C Il2CppCodeGenRegistration
.text:0000000000D1DB7C ; __unwind {
.text:0000000000D1DB7C                 ADRP            X1, #g_MetadataRegistration@PAGE
.text:0000000000D1DB80                 LDR             X1, [X1,#g_MetadataRegistration@PAGEOFF]
.text:0000000000D1DB84                 ADRP            X0, #g_CodeRegistration@PAGE
.text:0000000000D1DB88                 ADRP            X2, #s_Il2CppCodeGenOptions@PAGE
.text:0000000000D1DB8C                 ADD             X0, X0, #g_CodeRegistration@PAGEOFF
.text:0000000000D1DB90                 ADD             X2, X2, #s_Il2CppCodeGenOptions@PAGEOFF
.text:0000000000D1DB94                 B               il2cpp_codegen_register

We double-click on g_MetadataRegistration to find a slight hiccup:

.got:0000000002DB5038 off_2DB5038     DCQ qword_301CD18       ; DATA XREF: sub_15B58B0+E4↑o
.got:0000000002DB5038                                         ; sub_15B58B0+E8↑r
.got:0000000002DB5040 off_2DB5040     DCQ qword_301CD20       ; DATA XREF: sub_12B5E64+BC↑o
.got:0000000002DB5040                                         ; sub_12B5E64+C0↑r
.got:0000000002DB5048 g_MetadataRegistration DCQ dword_2D41320
.got:0000000002DB5048                                         ; DATA XREF: Il2CppCodeGenRegistration↑o
.got:0000000002DB5048                                         ; Il2CppCodeGenRegistration+4↑r
.got:0000000002DB5050 off_2DB5050     DCQ qword_301CD28       ; DATA XREF: sub_21374A4+88↑o
.got:0000000002DB5050                                         ; sub_21374A4+8C↑r
.got:0000000002DB5058 off_2DB5058     DCQ qword_301CD30       ; DATA XREF: sub_13CA74C+7C↑o
.got:0000000002DB5058                                         ; sub_13CA74C+80↑r
.got:0000000002DB5060 off_2DB5060     DCQ qword_301CD38       ; DATA XREF: sub_148F624+128↑o

Well it turns out that this was not the Il2CppMetadataRegistration struct after all, but rather a pointer to it, so we rename the label pMetadataRegistration to make this clear (note the p at the start – this is traditional naming convention but you can use whatever naming style makes it easiest for you), and give dword_2D41320 the name g_MetadataRegistration, then double-click on it:

.data.rel.ro:0000000002D41320 g_MetadataRegistration DCD 0x89E3
.data.rel.ro:0000000002D41324                 ALIGN 8
.data.rel.ro:0000000002D41328 off_2D41328     DCQ off_2CCDDB0
.data.rel.ro:0000000002D41330 dword_2D41330   DCD 0x1B11
.data.rel.ro:0000000002D41334                 ALIGN 8
.data.rel.ro:0000000002D41338 off_2D41338     DCQ off_2D2DDD8
.data.rel.ro:0000000002D41340                 DCB 0xBD
.data.rel.ro:0000000002D41341                 DCB 0xAC
.data.rel.ro:0000000002D41342                 DCB    0
.data.rel.ro:0000000002D41343                 DCB    0
.data.rel.ro:0000000002D41344                 DCB    0
.data.rel.ro:0000000002D41345                 DCB    0
.data.rel.ro:0000000002D41346                 DCB    0
.data.rel.ro:0000000002D41347                 DCB    0
.data.rel.ro:0000000002D41348                 DCQ unk_2327E58
.data.rel.ro:0000000002D41350                 DCQ stru_10C88.st_info
.data.rel.ro:0000000002D41358                 DCQ off_2B71F20
.data.rel.ro:0000000002D41360                 DCB 0xAB
.data.rel.ro:0000000002D41361                 DCB 0xB9
.data.rel.ro:0000000002D41362                 DCB    0
.data.rel.ro:0000000002D41363                 DCB    0
.data.rel.ro:0000000002D41364                 DCB    0

Bingo, we have found a data structure. In a 64-bit binary we know that every field should be 8 bytes long (DCQ quad-word) so let’s tidy it up using the technique of tapping D from earlier and see what we get:

.data.rel.ro:0000000002D41320 g_MetadataRegistration DCQ 0x89E3
.data.rel.ro:0000000002D41328 off_2D41328     DCQ off_2CCDDB0
.data.rel.ro:0000000002D41330 qword_2D41330   DCQ 0x1B11
.data.rel.ro:0000000002D41338 off_2D41338     DCQ off_2D2DDD8
.data.rel.ro:0000000002D41340                 DCQ 0xACBD
.data.rel.ro:0000000002D41348                 DCQ unk_2327E58
.data.rel.ro:0000000002D41350                 DCQ stru_10C88.st_info
.data.rel.ro:0000000002D41358                 DCQ off_2B71F20
.data.rel.ro:0000000002D41360                 DCQ 0xB9AB
.data.rel.ro:0000000002D41368                 DCQ unk_229CA54
.data.rel.ro:0000000002D41370                 DCQ 0x2942
.data.rel.ro:0000000002D41378                 DCQ unk_2F61B48
.data.rel.ro:0000000002D41380                 DCQ 0x2942
.data.rel.ro:0000000002D41388                 DCQ off_2F76558
.data.rel.ro:0000000002D41390                 DCQ 0xA9D9
.data.rel.ro:0000000002D41398                 DCQ off_2BF8380

This looks a lot like a list of counts and pointers, exactly as expected. There is a slight quirk where IDA has incorrectly mapped the count at 2D41350 to an address. You can fix this by clicking on the label, pressing U to undefine it, and then tapping D four times to turn it from bytes to a qword.

This exact same process can be repeated to find g_CodeRegistration, giving us the two key metadata structures we were looking for.

Note: Some compilers may merge together small functions, or functions that are only called once – which is especially applicable to initialization code that is only called one time at application startup. This process is called “inlining” and it is not uncommon to see the chain of Il2CppCodegenRegistration to il2cpp_codegen_register to il2cpp::vm::MetadataCache::Register as a single inlined function. Keep this in mind if the chain of calls in a real application doesn’t match up with what you expect.

More techniques for finding binary metadata

There are a plethora of other techniques you can used to find the binary metadata besides searching the init function table. Here is a summary of them, starting from the easiest:

  • Check the export table – some binaries define g_CodeRegistration and g_MetadataRegistration (sometimes with a leading underscore) as symbols. If this is the case, you can navigate straight to them from the export table
  • Signature search – you can take the raw hex values that make up the Il2CppCodeGenRegistration function, exclude address-specific values, and then search for them in another binary. This can be done by hand or by using FLIRT signatures in IDA. A tutorial for this can be found on this forum post at unknowncheats.me.
  • Brute-force attack – you can search the data sections for values which correlate with those in global-metadata.dat, then step backwards to find the start of each structure. The way this is done depends on the version of IL2CPP and can get a bit fiddly; check out the source code for ImageScan.cs in Il2CppInspector if you’re interested.

Metadata obfuscation

Automated tooling for IL2CPP binaries such as Il2CppInspector rely heavily on the ability to parse global-metadata.dat and find Il2CppCodeRegistration and Il2CppMetadataRegistration in the application. Therefore, these structures and the breadcrumb trails that lead to them are the prime targets for obfuscation by developers.

Typical forms of obfuscation include:

  • Stripping the export table
  • Encrypting the IL2CPP API export symbols
  • Packing or encrypting the binary
  • Encrypting global-metadata.dat
  • Embedding global-metadata.dat in the binary itself
  • Re-arranging the order of fields of structures in global-metadata.dat and/or the binary metadata
  • Obfuscating the control flow in the assembly code which accesses the binary metadata
  • Encrypting strings in global-metadata.dat
  • Applying a .NET symbol obfuscator to the C# code in the event an attacker is able to extract the metadata

The most common form of encryption is the classical single-byte XOR, for which it is trivial to resolve the key by inspection in a hex editor by looking at an area of the file that would normally contain mostly zeroes. Strings encrypted with single-byte XOR are similarly decrypted by looking at the final byte of the string, which in a null-terminated string should also always be zero – therefore the final byte is the XOR key.

At the time of writing (December 2020), most current obfuscation is trivially defeated by manual analysis, although writing automated tooling to handle it is substantially more difficult. Il2CppInspector can currently resolve stripped exports, encrypted IL2CPP API exports, packed PE files, XOR encryption of ELF binaries, XOR string encryption in global-metadata.dat, rearranged fields in binaries and inlined functions for x86, x64, ARMv7 and ARMv8 automatically. It will search for binary metadata using symbol tables, signatures, code disassembly and brute-force attack. Plugins can be created to add missing functionality.

It’s a jungle out there

Now you know how to find the metadata, what does it mean and what can we do with it? We’ll talk about that in part 3, where we will begin to pick apart the labyrinthine web of metadata now at our disposal and find out how it all connects together.

  1. Anish Ahir
    December 4, 2021 at 16:52

    ERROR: Metadata file supplied is not a supported version

  2. Gibbrysohn
    August 15, 2021 at 07:20

    Hello. In my case Il2CppInspector didn’t properly parse CodeRegistration. Maybe rearrangement applied, but Il2CppInspector couldn’t resolve. The il2cpp version is 24.4, and I finally fixed it manually with IDA and comparison with older versions of the app. The comparison went as follows: Detect fields over 8 digits in decimal(heuristic) in CodeRegistration and MetadataRegistration, arrange them from smallest to biggest, and calculate difference of consecutive fields / 8 (since i was in arm64-v8a), and see if the value is contained in CodeRegistration or MetadataRegistration. Repeat this with older version and compare. This helped finding the correct field name for the values. And also there was dropped values which Il2CppInspector couldn’t fetch. I manually calculated the value from libil2cpp.so with aid of IDA. And then I hardcoded the fixed value to Il2CppInspector and compile. The Il2CppInspector-CLI worked, but Il2CppInspector-GUI got me error(maybe more fix needed). Anyway I now have the .cs, .dll.

    How did Il2CppInspector fail to resolve rearrangement in this case?

    • August 15, 2021 at 18:51

      Without being able to examine the metadata or binary I cannot really have any idea why a particular analysis may succeed or fail. The best thing you can do to help is to file a detailed bug report on the GitHub issue tracker and include both the working and problematic files as an attachment so that we can look into it. Thanks đŸ™‚

      • Gibbrysohn
        August 17, 2021 at 12:29

        Thanks for responding. I solved all the problems! The problems were encrypted libilcpp.so and global-metadata.dat. The decryption of libil2cpp.so occured in runtime, but it didn’t decrypt the entire file. It first dlopened libil2cpp.so, used mprotect and decyrpted the .so loaded in the memory! How did I find this, just observing the weird partially encrypted(code was encrypted but binary metadata wasn’t encrypted) libil2cpp.so and altered memory around il2cpp_init during runtime, it reminded me of Cydia Substrate, and I hooked mprotect to see where it does the trick. Anyway as I said binary metadata wasn’t encrypted, so I thought there is something wrong with global-metadata.dat. It had correct magic and version number, but Il2cppInspector failed. This resembled the partial encryption scheme of libil2cpp.so so I hooked vm::MetadataLoader::LoadMetadataFile(symbol stripped, but was easy to find since I dumped decrypted libil2cpp.so), and dumped it. It greatly matches the original encrypted .dat file, but also had vast differences, another partial encryption scheme applied. One interesting thing is stripped magic and version number in the decrypted one. IL2CPP_ASSERT isn’t compiled so I think they stripped it to trick us. After fixing the magic and version number, I ran Il2CppInspector and succeed! Anyway thank you for these posts and the program.

  3. Alberto Jovito
    July 28, 2021 at 16:34

    Is possible port a game of unity for mac to linux? i see what with old versions the unity based in mono is possible…

  4. June 9, 2021 at 09:39

    You are called “The woman of life” đŸ™‚ . nice work.

  5. Test
    February 23, 2021 at 06:29

    Metadata header (1.2)

    10 01 00 00 58 7F 01 00 68 80 01 00 B4 7B 04 00 1C FC 05 00 68 CD 11 00 84 C9 17 00 40 05 00 00 C4 CE 17 00 70 D8 02 00 34 A7 1A 00 00 00 00 00 84 30 4F 00 BC 5E 00 00 40 8F 4F 00 A4 6B 02 00 E4 FA 51 00 84 9F 02 00 68 9A 54 00 AC 95 00 00 14 30 55 00 C0 D3 0B 00 D4 03 61 00 D0 A5 08 00 A4 A9 69 00 70 9B 00 00 14 45 6A 00 30 04 00 00 44 49 6A 00 20 68 00 00 64 B1 6A 00 08 41 00 00 6C F2 6A 00 08 2B 00 00 74 1D 6B 00 54 EE 05 00 C8 0B 71 00 70 18 01 00 38 24 72 00 00 00 00 00 F0 B0 82 00 40 DA 00 00 30 8B 83 00 48 08 00 00 78 93 83 00 14 0E 00 00 8C A1 83 00 58 BE 04 00 E4 5F 88 00 58 28 11 00 3C 88 99 00 40 08 00 00 7C 90 99 00 2C 04 00 00 A8 94 99 00 18 69 03 00 C0 FD 9C 00 34 86 01 00 F4 83 9E 00 AC 2E 00 00 A0 B2 9E 00 D8 20 00 00 78 D3 9E 00 00 00 00 00 78 D3 9E 00 38 1E 00 00

    Metadata header (1.21)

    10 01 00 00 60 7F 01 00 70 80 01 00 B8 7B 04 00 28 FC 05 00 80 CD 11 00 A8 C9 17 00 40 05 00 00 E8 CE 17 00 70 D8 02 00 58 A7 1A 00 00 00 00 00 DC 30 4F 00 BC 5E 00 00 98 8F 4F 00 A4 6B 02 00 3C FB 51 00 84 9F 02 00 C0 9A 54 00 AC 95 00 00 6C 30 55 00 CC D3 0B 00 38 04 61 00 DC A5 08 00 14 AA 69 00 70 9B 00 00 84 45 6A 00 30 04 00 00 B4 49 6A 00 20 68 00 00 D4 B1 6A 00 08 41 00 00 DC F2 6A 00 08 2B 00 00 E4 1D 6B 00 54 EE 05 00 38 0C 71 00 70 18 01 00 A8 24 72 00 00 00 00 00 60 B1 82 00 40 DA 00 00 A0 8B 83 00 48 08 00 00 E8 93 83 00 14 0E 00 00 FC A1 83 00 58 BE 04 00 54 60 88 00 68 28 11 00 BC 88 99 00 40 08 00 00 FC 90 99 00 2C 04 00 00 28 95 99 00 18 69 03 00 40 FE 9C 00 34 86 01 00 74 84 9E 00 AC 2E 00 00 20 B3 9E 00 D8 20 00 00 F8 D3 9E 00 00 00 00 00 F8 D3 9E 00 38 1E 00 00

  6. Test
    February 23, 2021 at 06:28

    Hey, love your blog, it’s extremely insightful.

    I’m in a situation where a game is using an encrypted Global-Metadata file, though, I can successfully dump a decrypted version of the file with a memdump. However, something’s been done to the header: potentially a combination of scrambling/naive encryption. This seems to only be the case with the header, since strings and such appear perfectly in plaintext. It does mean I can’t use out of the box tools like the Il2CPP Dumper. Do you have any ideas what kind of ideas I can try against this?

    I have three versions of this metadata, from three version updates of the game, and there do seem to be patterns with words MSB increasing by a bunch or so, per version, as you’d expect, and the LSB of the words are always zero, as you’d also typically expect from a proper metadata header. However, what’s weird is that I’d expect the first 8 bytes to always remain constant. Since after all, the magic number and version words should always remain constant in the metadata. However, that’s not the case, with only the first four bytes always remaining constant.

    Here’s an example of one such header in full (latest ver.):

    (Or well, I tried to post one, but Akismet seems to think it’s spam…)

    What sort of ideas would you try against this? I’ve tried searching in memdumps for magic: AF 1B B1 FA bytes to see if a proper version of the header would appear in memory, or I could find some hints to how the metadata is being read properly, but no luck. Any advice would be appreciated, hopefully I expressed myself well.

    • February 23, 2021 at 16:41

      Well it sounds like you’ve done your research! Have you tried looking at the disassembly of the game in IDA or Ghidra to see how it loads the metadata (I actually have an article on that coming very soon with some tips!)? I usually look at the metadata file as you have done to scour for obvious patterns, and if I can’t figure it out I’ll dive into the binary and try to find the deobfuscation code. If you are confident that only the header (Il2CppGlobalMetadataHeader) is encrypted, you could just try to reconstruct it from the rest of the data?

      Once you’ve deduced how it works, you can as mentioned create a plugin for Il2CppInspector – https://github.com/djkaty/Il2CppInspector/wiki/Plugins%3A-Getting-Started – so that you can load the app with Il2CppInspector’s normal out-of-box experience without having to edit the tool’s source code directly.

      • Test
        February 23, 2021 at 17:24

        Hey, thanks for the response.

        IDA doesn’t support the ARM Instruction set unfortunately, as far as I know, but I did decompile the il2cpp binary with Cutter. I struggled to find where it was doing the deobfuscation, so if you have any tips there, I’d really appreciate them.

        I have considered reconstructing the Header from scratch, though, I’m keeping that as a last of resort due to the time commitment of reading and learning how the file is structured. So far I am only familiar with how the header itself is structured (in Version 24 which is what the game uses).

        Thanks for the link to the Plugin page by the way.

        • February 23, 2021 at 21:10

          IDA definitely supports ARM – but I’m not sure about the free version. You can in any case use Ghidra which is free and includes a free decompiler.

          I figured I would finish off my article before replying so here it is, hot off the press: https://katyscode.wordpress.com/2021/02/23/il2cpp-finding-obfuscated-global-metadata/

          Hope that helps you find the code đŸ™‚ You can see the complete definition of the file format of global-metadata.dat in IL2CPP/MetadataClasses.cs in Il2CppInspector đŸ™‚

  1. May 31, 2023 at 04:35
  2. February 23, 2021 at 21:01
  3. January 17, 2021 at 22:41
  4. January 15, 2021 at 03:14
  5. December 27, 2020 at 22:34

Share your thoughts! Note: to post source code, enclose it in [code lang=...] [/code] tags. Valid values for 'lang' are cpp, csharp, xml, javascript, php etc. To post compiler errors or other text that is best read monospaced, use 'text' as the value for lang.

This site uses Akismet to reduce spam. Learn how your comment data is processed.