Home > IL2CPP > Reverse Engineering Adventures: Honkai Impact 3rd (Houkai 3) (IL2CPP) (Part 1)

Reverse Engineering Adventures: Honkai Impact 3rd (Houkai 3) (IL2CPP) (Part 1)

January 17, 2021 Leave a comment Go to comments

Of all the IL2CPP workloads that have landed on my office desk over the years, those published by miHoYo (web site in Chinese) are what I consider to be the current gold standard for IL2CPP obfuscation. miHoYo has taken aim at our beloved (and sometimes hated) IL2CPP tools and trashed them with customized metadata encryption and extensive struct reordering, encapsulated in an obfuscated UnityPlayer.dll built from a modified Unity source code base. We had a good chuckle together reverse engineering League of Legends: Wild Rift, but now it’s time to get serious.

People reverse engineer code for different reasons. If you’re a malware analyst, you don’t care how the payload is encrypted; you just want to understand what threat vectors the malware exploits, what its key behaviour is, and how to create a signature to detect it. If you’re the nefarious type who sells exploits for money, you probably don’t care how the target software works either, as long as you can sell your exploitative trash (shame on you).

Some people have a quite different motive: reverse engineering is a hobby for them; they don’t use or care about the product, they’re merely interested to learn about how different protections work – the reverse engineering is the game, so to speak. When I rewrote the disassembly of a Sky pay-TV smartcard in C in 1997 (I know, I was an unruly teenager who turned into an unruly adult, sorry) and the company who designed the smartcard (NDS – now merged with Cisco) wanted to “have a little chat” with me about this, one of the first questions they asked me was: if you wanted free TV, why didn’t you just run the smartcard code in a CPU emulator once you’d dumped the ROM instead of spending 8 months rewriting it in C? My answer was matter-of-fact: I already have a Sky subscription, I just wanted to know how the card worked and prove it could be done. They subsequently paid me to fix it for them (this is the smart play by the way: nobody can design security products as well as hackers; Sony would have done well to take this tip instead of suing GeoHot).

I’m sharing this humblebrag with you as a prelude to explaining my motivation regarding miHoYo’s games. Normally I make a point of learning how a particular protection works, but this time I had a bee in my bonnet: after going on a blitz adding various unpacking, decryption and deobfuscation functionality to Il2CppInspector, I was acutely aware that Honkai Impact was the only remaining title I knew about that my tool wouldn’t load. I also knew from earlier investigation how to make it load, that it was highly tailored to the one specific game, and so – unlike the other generalized deobfuscation code – had no place in a generic tool.

This failure to load every IL2CPP workload gnawed away at me, but as it happened I was also working on a plugin system, so by fortuitous confluence it seemed like the perfect subject material for a demo plugin. I was getting burned out on weeks of reverse engineering every day though, so I got lazy: the example we will present today demonstrates how to break protection in the “malware analyst” way: just get it to decrypt and don’t bother about the details of how it works beyond what’s necessary.

Today’s volunteer is Honkai Impact 3rd. Buckle up!

Tip: The walkthrough below uses version 4.3 but the process works in an identical fashion all the way back to 3.8, which is the earliest version I’ve tested. We demonstrate how to reverse engineer the Windows build of the game, however the algorithms used for the Android build are the same. The Android version can therefore be decrypted merely by substituting the PC build’s global-metadata.dat examined below with the Android version. How did we determine this is possible? Simply by trying it!

Info: You can view the complete, fully-commented source code of the miHoYo loader plugin here. This plugin is the result of the work described below and shows how to modify Il2CppInspector’s load pipeline to handle non-standard workloads without needing to fork or modify the original tool.

Surveying the battlefield

As usual, I start by just loading the game into Il2CppInspector to see what happens:

The supplied metadata file is not valid.

This error means the global-metadata.dat file doesn’t have the expected form. Specifically it starts with the magic bytes (signature) AF 1B B1 FA followed by a 32-bit integer containing the IL2CPP version number (at the time of writing, a value from 0F-1B). This is followed up by a long list of offset/length pairs demarcating the various metadata tables in the file – learn more in this article about IL2CPP’s load process.

Here is an example of the start of global-metadata.dat from an empty project:

Note that an “empty” Unity project still includes a pile of DLLs like mscorlib.dll, UnityEngine.dll and so on so it’s not really empty at all. The header ends at offset 0x110 and this is location is also the start of the first table.

global-metadata.dat for Honkai Impact 4.3 (PC version):

Ouch, this doesn’t look very appetizing. At a casual glance it just looks encrypted or compressed, but there are actually some nuggets of data in here.

Bytes 0x00-0x3F don’t make any obvious sense, and neither do the bytes from 0x158 onwards, but at least some of the data from 0x40-0x157 seems to mean something. We can surmise this both from the fact there is a smattering of zeroes (low-entropy data), and that it at least vaguely resembles the metadata header from the empty project. The areas around 0x60-0x6F, 0xD8-0xDF, 0xF0-0xF7, 0x100-0x10F and 0x140-0x14F seem garbled, but the rest does seem like a set of file offsets and lengths.

You essentially have to determine this by carefully reading all the hex values by eye. Values are stored little-endian, meaning that the first byte of a value is the least significant byte (LSB) (bits 0-7), and the final byte is the most significant (MSB) (bits 24-31 in the case of 32-bit values). Given that the file is 0x0353C7DC bytes long, we can try to verify that these suspected offset/length pairs do actually make sense. The final pointer at offset 0x150 is to offset 0x033AFECC, with a length specified at offset 0x154 as 0x00188908 bytes. This means the block pointed to ends at 0x035387D4, which is indeed inside the bounds of the file.

Let’s continue our investigation by scrolling down the file to see if the whole thing is encrypted or if there is anything else in plaintext. There is a large block of garbled data starting around offset 0x158, and then around 0x14647C-0x146480, it ends and we start to see normal metadata tables again:

Scrolling further, the rest of the file appears to contain normal data, except for one curious repeating pattern:

Every so often, there is a block of 0x40 garbled bytes in the middle of other data. After skipping around the file some more, we determine this happens like clockwork every 0x353C0 bytes.

We can determine from the offset/length lists in the header that these are not separate data structures, but embedded within valid lists. Therefore we can assume we’re looking at encryption. We can rule out trivial schemes like single-byte XOR because the encrypted blocks are high entropy (the distribution of values in the blocks is statistically even; see entropic security), so we are probably looking at strong encryption or a one-time pad (OTP) – the latter could potentially be a XOR blob (a block of random bytes to be XORed with the encrypted data to decrypt it).

Is there an OTP key hiding in the file somewhere? Looking at the second screenshot above, we might surmise (looking at the right hand three bytes on each of the four encrypted lines) that a XOR blob would contain sequential values 1E AE BE, 51 6D AD, 58 7A 03 and so on. We search for other occurrences of these in the file but come up blank.

The encryption may not be a XOR blob, or the XOR blob may be stored in the binary or an asset file, or the XOR blob may be obfuscated. On this occasion we come up empty-handed, but it’s important to exclude obvious potentially easy paths before we get our hands dirty analyzing assembly code, as it could save us a lot of time. We’re out of luck in this case though.

How far back in the file does this periodic block encryption go? The first encrypted offset we found is 0x174A40 (first of the two screenshots above), the block gap is 0x353C0 bytes. These two are exactly divisible with no remainder, therefore it’s plausible to imagine the first encrypted block starts at 0x0 – ie. the very first byte in the file. This also lines up with our earlier observation that bytes 0x00-0x3F are probably encrypted.

Let’s finish our analysis of the metadata by assessing the file’s coverage. In a normal global-metadata.dat, every byte is accounted for: that is to say, every single byte in the file is part of a header or table – there is no extraneous data. We do this by taking all of the offset/length pairs in the header and merging them together to map out all of the used regions in the file, then seeing if there is anything left over.

Why do we do this? Well, because hiding data in files is extremely common. In PE files (Windows exes and dlls), a highly common technique is to set the image size in the header to a value smaller than the true length of the file, and then add additional hidden data at the end. This data could be secret code, decryption keys or anything else.

In this case, we are aware that some of the offsets and lengths may be encrypted, but we work with what we’ve got anyway:

1C7AA8 + 1BE4E4 = 385F8C
385F8C + 4E4A8 = 3D4434
3D4434 + 382CD8 = 75710C
75710C + 9040 = 76014C

(16 bytes of unknown data)

76014C + 10C0 = 76120C
76120C + 2398 = 7635A4
7635A4 + 3F7E50 = B5B3F4
B5B3F4 + 6F50 = B62344
B62344 + CEA58 = C30D9C
C30D9C + 25044 = C55DE0
C55DE0 + 994 = C56774
C56774 + A0B60 = CF72D4
CF72D4 + 56BB8 = D4DE8C
D4DE8C + 99E8 = D57874
D57874 + 7490 = D5ED04
D5ED04 + B84 = D5F888

(8 bytes of unknown data)

15C0D8C + 3B4AA0 = 197582C
197582C + 74C = 1975F78

(8 bytes of unknown data)

1975F78 + 5C5238 = 1F3B1B0

(16 bytes of unknown data)

1F3B1B0 + 13F8 = 1F3C5A8
1F3C5A8 + 1DA4 = 1F3E34C
1F3E34C + 139200 = 207754C
207754C + 11B9A00 = 3230F4C
3230F4C + 15294 = 32461E0
32461E0 + 169CEC = 33AFECC

(16 bytes of unknown data)

33AFECC + 188908 = 35387D4

This is a breakdown of the data from 0x40-0x158.

The bytes at 0x158-0x1C7AA8, 0xD5F888-0x15C0D8C and 0x35387D4-0x3538CDC (the end of the file) are unaccounted for. We navigate to each of these offsets to see if there is anything of interest.

0x158-0x146480 contain probably encrypted data as mentioned earlier. 0x146480-0x1A7238 appear to contain a single table (we know this because it consists of a long sequence of what appears to be offsets and lengths, in ascending order). 0x1A7238-0x1AC558 contain another similar table, and so on. These look like normal metadata tables. 0xD5F888-0x15C0D8C contains the .NET symbol table (we know this because the data in this block is just human-readable strings). The most interesting block is probably the end of the file – a pointer to itself (the offset at 0x35387D4 contains the value 0x35387D4), four zeroes and then precisely 0x4000 of high entropy data – this may be encrypted data, or a decryption blob.

I haven’t included screenshots of everything here, but if your eyes are glazing over at all of these numbers right now, that’s perfectly okay: the best way to follow all of this is simply to open the metadata file into a hex editor and explore these file offsets for yourself. There is no special magic in how I determined these table boundaries: it is all determined by eye, by looking carefully and methodically for obvious patterns in the data to indicate groups of related data together in one place, and sudden changes in the data to indicate the boundaries between different kinds of data.

Let us now take a breath, step back and summarize what we’ve learned so far:

  • There are 0x40-byte blocks of unknown encryption every 0x353C0 bytes, starting most likely from the beginning of the file
  • There are some unknown pieces of data in the file header
  • A normal metadata header for this version of IL2CPP is 0x110 bytes. The header here appears to be 0x158 bytes long. The total amount of unknown data in the header is 0x40 bytes. This leaves a question mark over another 8 bytes.
  • There are three blocks of data that are unaccounted for. One contains various metadata tables and may be accounted for when we decrypt the first 0x40 bytes of the header. The second contains the string table. The third contains unknown data with a precise size of 0x4000 bytes.

Whether or not this information will actually be useful down the line is another question. As it turns out, some of it is and some of it isn’t. The key takeaway here is to just take a little bit of time to perform a superficial analysis of the data by eye and see what patterns can be spotted. Often, this insight is enough to determine a strategy to decrypt a file on its own, but in this case we’re going to need to step up our game.

Fun fact: In 2017, small indie game company Blizzard Entertainment encrypted one of its game’s main DLLs by using the standard Blowfish algorithm with the maximum 448-bit key size, and appending the key to the end of the DLL file. Variations on this kind of technique are a timeless classic – be aware of it!

To the trenches!

Clearly, we need to find out how to decrypt the metadata file. To do this, we first need to find out where in the code the decryption occurs. There are various ways of doing this, and you can certainly just trace the binary using static analysis in a disassembler, but there is an easier way.

ProcMon is an excellent piece of software to have in your arsenal. It allows you to – among other things – capture Windows API calls occurring in a target process and produce a stack trace from the call site. We’ll use this to find out where in Honkai Impact global-metadata.dat is accessed and then examine the code.

When ProcMon first loads, you’ll want to clear the default filters and create a new filter as follows:

This instructs ProcMon to capture all file accesses to global-metadata.dat coming from BH3.exe, which is Honkai Impact’s root process. Open C:\Program Files\Honkai Impact 3rd\Games in file explorer, double-click on BH3.exe, wait for the epilepsy seizure warning to appear then press Alt+F4 to kill the process. In ProcMon, you’ll see something like this:

Now we can see all of the API calls made using global-metadata.dat‘s file handle. Don’t be confused by the calls to CreateFile – this function can be used not just to create files but also to open existing files, which is the case here. We note calls to CreateFileMapping, which maps a file to a region of unallocated virtual memory without actually loading it from storage. When the application attempts to read from one of these memory addresses, the Windows kernel will read the corresponding portion of the file if necessary – this is called demand paging and reduces memory consumption at the expense of requiring an open file handle for however long the file contents are needed. It also means the file may be read out-of-order.

Note that the kernel will read the file in blocks – not just specifically the requested bytes – as an optimization. As you can see above, the page size is 32KB (each read has a length of 32,768 bytes). With this in mind, notice how the application reads the very end of the file first: the first call to ReadFile is at offset 55,791,616 (0x3535000) and has a length of 30,684 bytes (less than 32KB because the file size is not exactly divisible by 32KB); taking us to 0x353C7DC or the length of the file. The fact the kernel reads from 0x3535000 doesn’t mean the application requested precisely these bytes. It may have just wanted a portion of the data, but the kernel will always read in page-sized blocks when using demand paging. Recall that there is a blob of 0x4000 bytes of unknown data at the end of the file, beyond the metadata tables. We know that global-metadata.dat is usually read from the start, because the header at the beginning of the file contains the information needed to find everything else in the file. Reading the end of the file first is therefore highly suspicious, and lends credence to the theory that this data is needed first to be used in some kind of decryption function.

Let’s double-click on the ReadFile event where the data is read from offset zero – ie. the start of the file – and select the Stack tab to see the stack trace (the most recent calls appear first):

Native\UserAssembly.dll is what is normally called GameAssembly.dll in the Unity app’s root folder, but it has been moved and renamed by the developers here.

The first thing to note is that you should ignore the function names shown in the Location column: these assume the files have symbols available, so while they will be accurate for Windows DLLs like ntoskrnl.exe, they will be incorrect for our game. ProcMon just looks through the export table to find the function with the nearest starting address before the call site and assumes that is the name of the function. It is easy to tell the function names are wrong because they have massive offsets into the function start addresses: while UnityMain + 0x36 is almost certainly an instruction 0x36 bytes into UnityMain, we very much doubt that il2cpp_value_box (which converts a value type into a boxed reference type) is either 0x589113F bytes long, or would be playing any role in loading a file. This call is really being made from another, unexported function. The good news is that the absolute call addresses in the Address column will be correct in all cases, so we’ll focus on these.

All of the kernel mode calls (those prefixed by a K in the Frame column) can be ignored – these all basically just deal with the file read (or other API call) requested by the application and aren’t important to us. The relevant call is the final one made by our application, which is at address 0x7FFF4E2C385 in UnityPlayer.dll. This is the instruction which actually triggers the kernel to read data from the underlying storage.

In a normal Unity application, global-metadata.dat is read exclusively by the main game binary and not touched by UnityPlayer.dll, so the fact that UserAssembly.dll here calls back into UnityPlayer.dll to perform a read is suspicious. It may indicate custom decryption code added to UnityPlayer.dll.

We now want to trace through the code to see exactly what is happening, so we load up both UnityPlayer.dll and UserAssembly.dll into IDA. We also want to compare the shipped UnityPlayer.dll with one from a blank Unity project. We can determine the game’s Unity version by simply looking at the EXE’s file properties, or by loading an asset file into a hex editor and looking at the version string at the top. Honkai Impact 3rd uses Unity 2017.4.18f1, which in itself is noteworthy because Windows standalone IL2CPP support was not introduced until Unity 2018.1.0 – there is a considerable amount of customization going on here. We need to work with the closest version we can to minimize the amount of code changes in UnityPlayer.dll, so we install Unity 2018.1.0 via Unity Hub, create a blank 3D template project, set the scripting backend to IL2CPP, the architecture to x64, enable PDB generation so that we can see all of the symbols (function names and so on) when we disassemble our own DLL, but disable ‘Development build’ so that it doesn’t emit lots of extra debugging code in every function that will just confuse us, leave everything else at their default settings in the hope that the developers did the same, click Build, wait a while and then open our freshly-baked UnityPlayer.dll into IDA as well. When loading three binaries into IDA, strong coffee is advised.

DLLs have a preferred image base address – commonly but not always 0x180000000 – but they are usually allocated at a non-preferred base address in memory. IDA will initially display virtual addresses relative to the DLL’s preferred image base. For example, if the preferred image base of UserAssembly.dll is 0x180000000 and the offset of the il2cpp_init function from the image base is 0x123456 bytes, IDA will display this function at virtual address 0x180123456. However, if it is loaded in memory at 0x200000000 when actually executed, the address of il2cpp_init shown in ProcMon’s stack trace will be 0x200123456. To make the stack trace line up with the disassembly, we need to fix this somehow. There are two options: subtract the difference between preferred and actual image bases from every address with a calculator while moving around in the file, or change the image base address of the file in IDA. The latter is much less error-prone, so we’ll do that. This step is called rebasing. To do it, choose Edit -> Segments -> Rebase program… from the IDA menu, and set the options as follows:

The Process tab of the event in ProcMon helpfully shows us the loaded image base of every DLL used by the application:

In the case above, we’ll rebase UserAssembly.dll to 0x7FFF3C520000 and UnityPlayer.dll to 0x7FFF4E280000. You can also do this when you first load the files by ticking Manual load and accepting all the defaults on the many dialog boxes that appear besides the image base address, which is the first dialog.

If you live near a beach, now is a good time to take a midnight swim, or perhaps – as I did – just stare wistfully out of the window contemplating whether the rebase or the heat death of the Universe will win. It’s coming.

IDA: What's new in 7.3 – Hex Rays

Tip: It can be hard to understand the output of ProcMon without an anchor reference. For IL2CPP games, creating a blank Unity project and watching how it behaves in ProcMon will give you an excellent baseline to help you spot sneaky changes in production code.

Tip: ProcMon captures millions of events every minute and consumes large amounts of resources. Even when you have filters enabled, all events are still captured – just not displayed. Close ProcMon as soon as you are finished using it – it will crash eventually if you don’t.

Threading the needle

We start by navigating to the top of the user mode call stack, 0x7FFF4E2C3E85 in UnityPlayer.dll:

.text:00007FFF4E2C3E6C                 mov     [rbp+0D30h+anonymous_28], rax
.text:00007FFF4E2C3E73                 mov     rax, [rbp+0D30h+anonymous_69]
.text:00007FFF4E2C3E77                 mov     rcx, [rbp+0D30h+anonymous_30]
.text:00007FFF4E2C3E7E                 mov     rdx, [rbp+0D30h+anonymous_28]
.text:00007FFF4E2C3E85                 movups  xmm0, xmmword ptr [rax+rcx]
.text:00007FFF4E2C3E89                 movups  xmmword ptr [rdx], xmm0
.text:00007FFF4E2C3E8C                 mov     rsi, [rbp+0D30h+anonymous_23]
.text:00007FFF4E2C3E90                 sub     rsp, 20h
.text:00007FFF4E2C3E94                 mov     r8d, 0B00h      ; Size

Note that the instruction pointer (EIP for x86, RIP for x64) is incremented before it’s pushed onto the stack, so the actual instruction that triggers the call to ReadFile is the previous one, at 0x7FFF4E2C3E7E. It’s just a mov, so it is likely triggering the read call by attempting to read from an uninitialized location in the demand paged memory range. Not very interesting. We scroll up and down in this function and discover it is both huge and obfuscated using a technique called control flow obfuscation. Here is the control flow graph (CFG) for this function:

Essentially this is a form of multi-level control flow flattening. In a nutshell, the function is a giant finite state machine (FSM) controlled by an arbitrarily-introduced state variable. The function loops repeatedly in its entirety, performing a very small action on each loop iteration based on the state variable, then updating the state variable. The actions are buried within many layers of if and switch statements, making it very difficult to reverse engineer by static analysis. As an analyst, I could not possibly be less excited about this diagram.

At this juncture I should note that the object of static analysis is not to determine what every line in a program does. Disassemblies often consist of millions of lines of code, and trying to weave your way through figuring out what every instruction means is a slow laborious way to accomplish nothing. Instead, we try to judge the overall purpose of functions at a slightly higher level and only delve down into the instruction level for small snippets of code that hold the greatest relevance.

One way to do this is to look at the inputs and outputs of a function rather than its actual code. Consider a 100,000-line obfuscated function which takes two integers as its input and returns one integer. If feeding in 1 and 2 produces an output of 3 every time, and feeding in 11 and 22 produces an output of 33 every time, it’s fairly safe to assume that at least in general, the function sums its two inputs and returns the total. There is no need to reverse engineer the function’s code unless it produces something that deviates from our thesis.

With this in mind, we navigate to the top of the function, give it a name like DoSomethingWithMetadata and move down in the call stack to where this function is actually called – in this case, 0x7FFF41EE076C in UserAssembly.dll:

.text:00007FFF41EE075A                 xor     edx, edx
.text:00007FFF41EE075C                 call    sub_7FFF41EDD140
.text:00007FFF41EE0761                 jmp     short loc_7FFF41EE076F
.text:00007FFF41EE0763                 mov     edx, r12d       ; _QWORD
.text:00007FFF41EE0766                 call    cs:qword_7FFF43D74F80
.text:00007FFF41EE076C                 mov     rsi, rax
.text:00007FFF41EE076C ;   } // starts at 7FFF41EE06F0

Looking again at the instruction prior to the one pointed to by the stack, this time we find an actual call, to qword_7FFF43D74F80, which is an uninitialized static value set at runtime. We know for sure this calls DoSomethingWithMetadata in UnityPlayer.dll, so we rename this address to pDoSomethingWithMetadata (the p is short for pointer), navigate to the top of the function and invoke the decompiler. The decompiled function is a couple of hundred lines long but the call to the obfuscated function is visible and looks like this:

  a6 = 0;
  v29 = sub_7FFF41EB3860(&a1, 3, 1, 1u, 0, &a6);
  v26 = v29;
  if ( !a6 )
  {
    v27 = sub_7FFF41EB36A0(v29, &a6);
    v28 = v27.LowPart;
    if ( !a6 )
    {
      v25 = (const void *)sub_7FFF41EDCFE0(v26, 0i64, 0);
      sub_7FFF41EB3170(v26, &a6);
      if ( a6 )
        sub_7FFF41EDD140(v25);
      else
        v0 = pDoSomethingWithMetadata(v25, v28);
    }
  }

Immediately we have learned something useful. Our mystery function takes two arguments and returns one. In addition, we know the first argument is a pointer because v25 has been cast to const void *. The return value is stored in v0 and not referenced again until this function ends, whereupon it is passed back to the caller as the return value.

We might be able to determine what v0 is by moving down in the stack once more, but first we want to try to determine the input arguments. Generally we do this by clicking on the functions around the call to see if we can establish some context – particularly if they use the same arguments or return values subsequently passed as arguments to the function of interest. It doesn’t really matter how you approach this too much, but remember we just want to get an overview of what’s happening without perfectly understanding every function. I start arbitrarily with the prior function call to sub_7FFF41EDD140, whose only argument is the same as the first argument to the mystery function:

void __fastcall sub_7FFF41EDD140(LPCVOID a1)
{
  LPCVOID lpBaseAddress; // rbx
  void *v2; // rcx
  _QWORD *v3; // rax

  if ( a1 )
  {
    lpBaseAddress = a1;
    sub_7FFF41EE16C0(&unk_7FFF43D7DF50);
    UnmapViewOfFile(lpBaseAddress);
    v2 = qword_7FFF43D7DF58;
    v3 = (_QWORD *)*((_QWORD *)qword_7FFF43D7DF58 + 1);
    if ( *((_BYTE *)v3 + 25) )
      goto LABEL_15;

The full function is 36 lines but all we need is line 11: this function unmaps a file from memory. By way of illustration, lines 1, 9 and 11 are the only lines I looked at and the only lines of consequence. It doesn’t matter what the rest is – it’s likely to just be error handling and other cleanup. The input argument a1 is passed to UnmapViewOfFile and that is this function’s primary purpose. In this case, IDA helps us by automatically naming the Win32 API call for us, as well as renaming v1 to lpBaseAddress – the name of the argument to UnmapViewOfFile in Microsoft’s documentation.

Experienced analysts won’t need to look this up, but if you’re not familiar with an API call, it is especially useful to refer to the official documentation. Let’s see what Microsoft says lpBaseAddress is:

A pointer to the base address of the mapped view of a file that is to be unmapped. This value must be identical to the value returned by a previous call to the MapViewOfFile or MapViewOfFileEx function.

Since this argument is the same as the first argument to the mystery function, we now know that it is a pointer to demand paged memory. The call is on the other side of the if branch to the unmap function, so a6 in the first decompilation above is likely an error flag. We rename the function, v25 and a6, as well as setting a6 to bool (we don’t bother renaming anything in the unmap function, there is no need to since we have what we needed to learn from it already and won’t be revisiting it):

  *&error = 0;
  v25 = sub_7FFF41EB3860(&v35, 3, 1i64);
  v26 = v25;
  if ( !*&error )
  {
    v27 = sub_7FFF41EB36A0(v25, &error);
    if ( !*&error )
    {
      hFile = sub_7FFF41EDCFE0(v26, 0i64, 0i64);
      sub_7FFF41EB3170(v26, &error);
      if ( *&error )
        unmapFile(hFile);
      else
        v0 = pDoSomethingWithMetadata(hFile, v27);
    }
  }

Note that IDA in its infinite wisdom unfortunately also sometimes renumbers all of the other variables when you do this.

Before we go any further, do we have any thoughts on what the second argument – now v27 – might be? Unlike in .NET, arrays in C and C++ (including blocks of bytes) do not have a convenient Length property and are actually just raw pointers to memory locations. If you want to know the size of the array, you need to pass it as a separate argument, and that is an extremely common design pattern in C++. v27 is assigned by sub_7FFF41EB36A0 so let’s examine that function:

LARGE_INTEGER __fastcall sub_7FFF41EB36A0(void *a1, DWORD *a2)
{
  DWORD *v2; // rbx
  LARGE_INTEGER result; // rax
  LARGE_INTEGER FileSize; // [rsp+38h] [rbp+10h]

  v2 = a2;
  *a2 = 0;
  if ( GetFileSizeEx(a1, &FileSize) )
  {
    result = FileSize;
  }
  else
  {
    *v2 = GetLastError();
    result.QuadPart = 0i64;
  }
  return result;
}

Very straightforward, a1 is a file handle and the function gets its size with GetFileSizeEx, returning any errors in a2. Our theory is confirmed.

You can continue to flesh this out a bit if you like, depending on how much detail you need. Here is what I ended up with:

  *&error = 0;
  hFile_1 = fileOpen(&metadataPathname, 3, 1, 1u, 0, &error);
  if ( !*&error )
  {
    v27 = getFileSize(hFile_1, &error);
    metadataSize = v27.LowPart;
    if ( !*&error )
    {
      hFile = mapFile(hFile_1, 0i64, 0);
      closeFile(hFile_1, &error);
      if ( *&error )
        unmapFile(hFile);
      else
        v0 = pDoSomethingWithMetadata(hFile, metadataSize);
    }
  }

It should be pretty clear by this point that this code checks that global-metadata.dat exists, gets its file size, maps it into memory, and – if there were no errors – calls our mystery function with a pointer to the start of the file in paged memory and its length.

What is the result in v0, and what happens to it when the function we’re analyzing returns to the caller? Obviously the current line of thinking is that the DoSomethingWithMetadata function decrypts the metadata file, and the return value is a pointer to the decrypted data, or perhaps the number of bytes decrypted or a result or error code.

Let’s step back for a moment. In another Il2CPP article I presented this diagram illustrating the initialization process of IL2CPP as it pertains to loading the metadata:

The relevant part here is that there is a call chain that proceeds il2cpp_init() -> il2cpp::vm::Runtime::Init() -> il2cpp::vm::MetadataCache::Initialize(). There is actually one more function call before global-metadata.dat is accessed, which you can see from the source code of libil2cpp/vm/MetadataCache.cpp:

void MetadataCache::Initialize()
{
    s_GlobalMetadata = vm::MetadataLoader::LoadMetadataFile("global-metadata.dat");
    s_GlobalMetadataHeader = (const Il2CppGlobalMetadataHeader*)s_GlobalMetadata;
    IL2CPP_ASSERT(s_GlobalMetadataHeader->sanity == 0xFAB11BAF);

The function vm::MetadataLoader::LoadMetadataFile is defined in libil2cpp/vm/MetadataLoader.cpp and looks like this:

void* MetadataLoader::LoadMetadataFile(const char* fileName)
{
    std::string resourcesDirectory = utils::PathUtils::Combine(utils::Runtime::GetDataDir(), utils::StringView<char>("Metadata"));

    std::string resourceFilePath = utils::PathUtils::Combine(resourcesDirectory, utils::StringView<char>(fileName, strlen(fileName)));

    int error = 0;
    FileHandle* handle = File::Open(resourceFilePath, kFileModeOpen, kFileAccessRead, kFileShareRead, kFileOptionsNone, &error);
    if (error != 0)
        return NULL;

    void* fileBuffer = utils::MemoryMappedFile::Map(handle);

    File::Close(handle, &error);
    if (error != 0)
    {
        utils::MemoryMappedFile::Unmap(fileBuffer);
        fileBuffer = NULL;
        return NULL;
    }

    return fileBuffer;
}

This more or less resembles the decompiled code we just analyzed, except it would seem an else clause has been added to the final if to make that sneaky call into UnityPlayer.dll! Note that the return value of the original version of LoadMetadataFile is a pointer to the start of the mapped global-metadata.dat. Since our decompiled version of LoadMetadataFile returns the value returned by DoSomethingWithMetadata, it is almost a certainty that DoSomethingWithMetadata decrypts the metadata and returns a pointer to it, since the caller (il2cpp::vm::MetadataCache::Initialize()) will expect unencrypted data unless it has been modified too.

We don’t normally have the source code to parts of applications we’re reverse engineering so we’re quite lucky that IL2CPP is open source, but let’s imagine we don’t have that luxury. At this point I want to pull in the UnityPlayer.dll of our blank project, which we haven’t looked at yet. All the symbols are available so we can easily navigate to il2cpp::vm::MetadataLoader::LoadMetadataFile, scroll down and compare:

  error = 0;
  v27 = il2cpp::os::File::Open(&path, 3, 1, 1, 0, &error);
  v28 = v27;
  if ( !error )
  {
    v29 = il2cpp::os::MemoryMappedFile::Map(v27, 0i64, 0i64);
    il2cpp::os::File::Close(v28, &error);
    if ( !error )
      goto LABEL_45;
    il2cpp::os::MemoryMappedFile::Unmap(v29, 0i64);
  }

(if we didn’t have the symbols, we could just run ProcMon against the project and follow the stack trace as before)

It would indeed seem that the developers who obfuscated Honkai Impact added an extra call to fetch the file size, and an else branch to call the decryption function if the file was mapped successfully.

Tip: Mastering IDA keyboard shortcuts can dramatically improve your productivity. Here are the shortcuts I used for the session above:

Jump to virtual address: G, type the address, label or function name, Enter
Jump to start of current function: CTRL+P, Enter
Rename symbol: N, type the symbol name, Enter
Decompile current function: F5
Change variable type: place cursor on the variable, Y, input the type declaration, Enter
Navigate in visited function history: forward and back buttons on mouse
View cross-references to function: place cursor on the function name, X

Sharpen your knives

We’re quietly hopeful we’ve found the decryption function at this point, starting at 0x7FFF4E2C2110, which given the rebased image base of 0x7FFF4E280000 puts it at offset 0x42110 in the file. If we can call this function in isolation, we can decrypt the metadata without needing to reverse engineer that horrendous code, albeit we won’t actually understand how the encryption works.

Note there is no guarantee this will “just work”. There may be other initialization that needs to be performed first, but as always we try to take the path of least resistance. If it doesn’t work, we just have to go back to the disassembly and look through the rest of the call stack to find any other extra code.

It’s arguably easier to use C or C++ for this test, but I like to work in C# so I’ll demonstrate with that. The code is pretty simple:

using System;
using System.IO;
using System.Linq;
using System.Runtime.InteropServices;

public static class Test
{
    [DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Ansi)]
    private static extern IntPtr LoadLibrary(string path);

    [DllImport("kernel32.dll")]
    private static extern bool FreeLibrary(IntPtr hModule);

    [UnmanagedFunctionPointer(CallingConvention.Cdecl)]
    private delegate IntPtr DecryptMetadata(byte[] bytes, int length);

    public static void Main(string[] args) {
        IntPtr hModule = LoadLibrary("UnityPlayer.dll");

        IntPtr moduleBase = Process.GetCurrentProcess().Modules.Cast<ProcessModule>().First(m => m.ModuleName == "UnityPlayer.dll").BaseAddress;

        byte[] metadata = File.ReadAllBytes("global-metadata.dat");

        var pDecryptMetadata = (DecryptMetadata) Marshal.GetDelegateForFunctionPointer(moduleBase + 0x42110, typeof(DecryptMetadata));

        IntPtr pDecrypted = pDecryptMetadata(metadata, metadata.Length);

        byte[] decryptedMetadata;
        Marshal.Copy(pDecrypted, decryptedMetadata, 0, metadata.Length);

        FreeLibrary(hModule);

        File.WriteAllBytes("global-metadata-decrypted.dat", decryptedMetadata);
    }
}

The Windows APIs LoadLibrary and FreeLibrary are used to dynamically load and unload DLLs at runtime. .NET doesn’t have this functionality in the base class library so we use the DllImport attribute to import them directly from kernel32.dll where they are defined (lines 8-12) (I will leave it as an exercise for the reader to figure out how kernel32.dll gets loaded 🙂).

Delegates are .NET’s type-safe version of function pointers, so we define a delegate that matches the signature of the function we want to call and decorate it with UnmanagedFunctionPointer (lines 14-15). The single argument – a member of the CallingConvention enum – is extremely important to get right, as it specifies how the delegate arguments will be passed to the unmanaged function, ie. whether they will be passed in registers, pushed onto the stack or a combination thereof. Get this wrong and the target function won’t receive its arguments correctly, and probably crash. For 64-bit applications, it’s actually not much of a problem because all of the common calling conventions – cdecl, stdcall, fastcall and thiscall – behave the same way: the first four arguments are passed in RCX, RDX, R8 and R9, and the rest are pushed on the stack from right-to-left; the return value is supplied in RAX. When working with 32-bit applications, however, all of these calling conventions work differently and you must look at the assembly code to determine which is in use.

Honkai Impact is shipped as 64-bit binary so we don’t need to worry, but just for the sake of completeness let’s take a look at the call site:

.text:00007FFF41EE0720 ; 168:     v27 = getFileSize(hFile_1, &error);
.text:00007FFF41EE0720                 lea     rdx, [rbp+57h+error]
.text:00007FFF41EE0724                 mov     rcx, rax
.text:00007FFF41EE0727                 call    getFileSize
.text:00007FFF41EE072C ; 169:     metadataSize = v27.LowPart;
.text:00007FFF41EE072C                 mov     r12, rax
.text:00007FFF41EE072F ; 170:     if ( !*&error )
.text:00007FFF41EE072F                 cmp     [rbp+57h+error], 0
.text:00007FFF41EE0733                 jnz     short loc_7FFF41EE076F
.text:00007FFF41EE0735 ; 172:       hFile = mapFile(hFile_1, 0i64, 0);
.text:00007FFF41EE0735                 xor     r8d, r8d        ; dwFileOffsetLow
.text:00007FFF41EE0738                 xor     edx, edx        ; dwNumberOfBytesToMap
.text:00007FFF41EE073A                 mov     rcx, rbx        ; hFile
.text:00007FFF41EE073D                 call    mapFile
.text:00007FFF41EE0742                 mov     r14, rax
.text:00007FFF41EE0745 ; 173:       closeFile(hFile_1, &error);
.text:00007FFF41EE0745                 lea     rdx, [rbp+57h+error]
.text:00007FFF41EE0749                 mov     rcx, rbx
.text:00007FFF41EE074C                 call    closeFile
.text:00007FFF41EE0751                 mov     rcx, r14        ; lpBaseAddress
.text:00007FFF41EE0754 ; 174:       if ( *&error )
.text:00007FFF41EE0754                 cmp     [rbp+57h+error], 0
.text:00007FFF41EE0758                 jz      short loc_7FFF41EE0763
.text:00007FFF41EE075A ; 175:         unmapFile(hFile);
.text:00007FFF41EE075A                 xor     edx, edx
.text:00007FFF41EE075C                 call    unmapFile
.text:00007FFF41EE0761                 jmp     short loc_7FFF41EE076F
.text:00007FFF41EE0763 ; 177:         v0 = pDoSomethingWithMetadata(hFile, metadataSize);
.text:00007FFF41EE0763                 mov     edx, r12d       ; _QWORD
.text:00007FFF41EE0766                 call    cs:pDoSomethingWithMetadata
.text:00007FFF41EE076C                 mov     rsi, rax

On line 6, the return value from getFileSize is stored in R12. On line 15, the return value from mapFile is stored in R14. On lines 20 and 29, RCX and RDX are set to the two arguments of DoSomethingWithMetadata – the memory pointer (from R14) and the file size (from R12) respectively. The function is called on line 30, and on line 31 the return value from RAX is stored in RSI.

Going back to the C# code, we first load UnityPlayer.dll into memory (line 18) and then find its base address in memory (line 20). This code iterates through every DLL loaded in the process until it finds one called UnityPlayer.dll, then takes its base address.

Line 22 loads global-metadata.dat into an array of bytes. Line 24 is the key, and essentially the main result of our work so far: it creates a delegate which points to our DoSomethingWithMetadata function at offset 0x42110 from the loaded image base address, using the correct parameter types.

Line 26 calls the function in UnityPlayer.dll. The returned pointer is in unmanaged memory of course, so lines 28-29 copy it into a managed array. Line 31 releases the lock on the DLL, and line 33 writes the output of the function to a file.

We run the program and open up the output in a hex editor next to the original metadata:

Well, it’s done… something. The first 0x28 (or possibly 0x24) bytes still don’t make any sense to us, but we can clearly see that some bytes have been changed into metadata header entries consistent with what appears immediately following.

The two blocks of unknown data are still as garbled as ever (we assume the one at the end of the file is a decryption key of some kind though), but what about those periodic encrypted 0x40-byte blocks scattered throughout the file?

Gottem!

Next time…

Clearly there is still much work to be done, but we’ve made good headway in a short amount of time. I’m sad to say that this was the easy part: in the next part of this mini-series, we’ll find out how miHoYo abused my spare time by creating a nightmare scenario of metadata reordering, what happened to the string literals (spoiler alert: the large block of encrypted data at the start of the file is the string literals), and how to reverse engineer it all. You won’t want to miss it, it’s going to be tedious as hell. Until next time…

Goose Honk GIF - Goose Honk Inhale GIFs

Categories: IL2CPP Tags:
  1. kurumi
    February 9, 2021 at 05:43

    Using your method I tried to use unpack genshin blk file (genshin 1.3), I found LoadFromFileWithMiHoYoPath(UserAssembly:0x4B97870 UnityPlayer:0xBB1650), but he also called the UserAssembly method at the same time, so I found il2cpp_init(string UserAssemblyPath) 0x0B4B5B0 is used to initialize the symbol, but LoadFromFileWithMiHoYoPath still cannot be executed successfully. I am very confused now, I don’t know whether to analyze the assembly or continue to try to make it successfully called

  1. January 19, 2021 at 21:01
  2. January 21, 2021 at 21:54
  3. January 23, 2021 at 07:18
  4. January 24, 2021 at 22:56
  5. February 23, 2021 at 21:01

Share your thoughts! Note: to post source code, enclose it in [code lang=...] [/code] tags. Valid values for 'lang' are cpp, csharp, xml, javascript, php etc. To post compiler errors or other text that is best read monospaced, use 'text' as the value for lang.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: