Introducing Lumen Link to heading
Overview Link to heading
IDA 7.2 was released on November 5th, 2018. One of the highlights of this release was the new experimental Lumina feature. Lumina is a service that allows it’s users to share metadata about functions from their IDA database (idb) - allowing other users to save time for similar functions. This can be a very powerful & time saving feature.
IDA’s developer, Hex-Rays announced that they are currently not offering a private Lumina server. Users with strict data sharing policies may not be allowed to use Lumina - rendering the feature useless. Sometimes you will spot a Lumina function with a totally random name - I assume that such behaviour is a result of poisioning the Lumina database due to sharing data by mistake.
Just before you ask, we currently do not offer Lumina for private use. Let us wait for it to naturally grow and become mature.1
Private Lumina server Link to heading
I wanted to take advantage of such a feature, but also wanted to be in control of the data I share - that’s why I decided to create my own private Lumina compatible server. Lumina’s protocol isn’t a standard protocol, nor is it documented anywhere publicly at this time. It required reverse-engineering IDA and understanding Lumina’s protocol, especially since it’s a binary protocol.
A live version of Lumen is currently available at lumen.abda.nl.
From this point on this article will be about Lumina’s protocol.
Analyzing the protocol Link to heading
The first step to understanding a protocol, is getting sample data. Lumina is encrypted using TLS by default, so I had a look at IDA’s configuration file which contained two related configurable values: “LUMINA_HOST” & “LUMINA_PORT” which I set to 127.0.0.1 and 1234 respectively. TLS still had to be disable, a look into IDA’s string view revealed that yet another configurable “LUMINA_TLS” is available and can be set to “NO”.
I could have used Wireshark for this, but since I was writing a server anyway, I created a proxy for Lumina. The proxy would connect to the official Lumina server using TLS, and forward everything to and from IDA. The more I implemented parts of the protocol, the less hex-dumps. ..
RPC Messages Link to heading
I realized that the first 4 bytes represent the frame’s length. All requests are prepended with 5 bytes, the first 4 are the payload’s length and the 5th byte is an “RPC code”.
After reversing IDA it was clear that all messages are serialized and deserialized with a few basic types:
- dd - A “packed” representation of a an unsigned 32-bit integer. It uses IDA SDK’s
pack_dd
&unpack_dd
. I will provide more details about this later in the article. - dq - An unsigned 64-bit integer. Packed as two
dd
s, one for each half of the number (high/low). - cstr - null-terminated string C string.
- bytes - An instance of
array<u8>
. It’s simply the length of the array in it’s packed form followed by the bytes. - array<T> - Variable sized array,
pack_dd(n)
followed byn
timesT
. - seq<T>(len) - A fixed size “array”. Since the size is fixed, the length is omitted.
The protocol starts with IDA sending the server an RPC_HELO
message. If the server wishes to continue it responds with a RPC_OK
message. The server may respond to any request with an RPC_FAIL
message to indicate failure.
Once the server replies with RPC_OK
, the client may send either RPC_PULLMD
or RPC_PUSHMD
to which the server should reply with a RPC_PULLMD_RES
or RPC_PUSHMD_RES
respectively.
This can be done many times in a loop throughout the connection’s lifetime.
There are a few more RPC commands that have serialization & deserialization implemented, but such objects are not used and change between IDA versions. The following RPC are currently used and have existed since IDA 7.2 until IDA 7.5 SP1.
RPC_OK Link to heading
RPC Code: 0x0A
This message doesn’t have a payload.
RPC_FAIL Link to heading
RPC Code: 0x0B
- dd -
status_code
- cstr -
message
RPC_HELO Link to heading
RPC Code: 0x0D
- dd - TBD
- bytes -
license_key
- seq<u8>(6) -
license_id
- dd - TBD
RPC_PULLMD Link to heading
RPC Code: 0x0E
- dd -
arch
(0=32 bit, 1=64 bit) - array<…>
- dd - TBD
- array<…>
- dd - TBD
- bytes -
function_hash
RPC_PULLMD_RES Link to heading
RPC Code: 0x0F
- array<…>
- dd -
status
: 0=Found
- dd -
- array<…>
- cstr -
function_name
- dd -
function_length
- bytes -
metadata_payload
- dd -
popularity
- cstr -
RPC_PUSHMD Link to heading
RPC Code: 0x10
- dd - TBD
- cstr -
idb_path
- cstr -
original_filepath
- seq<u8>(16) -
md5
- cstr -
hostname
- array<…>
- cstr -
function_name
- dd -
function_length
- bytes -
metadata_payload
- dd - TBD
- bytes -
function_hash
- cstr -
- array<…>
- dq - TBD
I couldn’t understand why the hostname, original file and IDB paths are collected - perhaps it would help identify peers when IDA will have more support for collaboration in the future. Still, this could have probably been done using more anonymous identifiers (such as hashes of these values…)
RPC_PUSHMD_RES Link to heading
RPC Code: 0x11
- array<…>
- dd -
status
: 0=Exists, 1=New
- dd -
pack_dd
& unpack_dd
Link to heading
While writing the de/serializing code, I noticed something weird with pack_dd
& unpack_dd
’s behavior. According to the SDK’s documentation they are supposed to be “utf-8 like encoding”. Given that they are packing functions, it would make sense for them to use a minimal amount of bytes.
This table describes how a number is packed.
Min | Max(incl.) | data bits | encoded bytes | data |
---|---|---|---|---|
0 | 0x7f | 7 | 1 | 0aaaaaaa |
0x80 | 0x3fff | 14 | 2 | 10aaaaaabbbbbbbb |
0x4000 | 0x1fffff | 24 | 4 | 11000000aaaaaaaabbbbbbbbcccccccc |
0x200000 | 0xffffffff | 32 | 5 | 11111111aaaaaaaabbbbbbbbccccccccdddddddd |
Realize the 3rd case? For me it would make more sense if the the behavior was like this:
Min | Max(incl.) | data bits | encoded bytes | packed data |
---|---|---|---|---|
0 | 0x7f | 7 | 1 | 0aaaaaaa |
0x80 | 0x3fff | 14 | 2 | 10aaaaaabbbbbbbb |
0x4000 | 0x1fffff | 21 | 3 | 110aaaaabbbbbbbbcccccccc |
0x200000 | 0xfffffff | 28 | 4 | 1110aaaabbbbbbbbccccccccdddddddd |
0x10000000 | 0xffffffff | 32 | 5 | 11111111aaaaaaaabbbbbbbbccccccccdddddddd |
In some cases the current behavior of pack_dd
would waste a byte.
I don’t know why pack_dd
works this way, but I guess it won’t be possible to fix due to backwards compatibility issues.
Function metadata payload Link to heading
Many types of function metadata are relative to an offset in a function. Example: a function has many comments, each comment is relevant to a specific offset in the function.
This container can be described as follows:
- dd -
start_offset
- In a loop until end of buffer:
- dd -
offset_diff
- T - Depends on the type contained in this container.
- dd -
Each element in the list increases it’s offset by offset_diff
. If offset_diff
is set to zero, the current offset is reset - this allows going back to lower offsets, or to storing information which doesn’t need an offset.
Function metadata’s are stored as chunks, each chunk has a length and a tag. Parsing the metadata depends on the tag, unrecognized tags can be skipped since the length of the chunk is known.
Metadata Chunk:
- dd -
metadata_tag
- array<u8> -
data
type_info
Link to heading
Tag: 0x01
This contain serialized tinfo
that can be deserialized using IDA SDK’s deserialize_tinfo
. Lumen doesn’t parse this tag at the moment.
Function Comment Link to heading
Tag: 0x03, and 0x04 (repeatable comment)
The chunk’s payload is encoded as bytes and is a string that is not null-terminated.
Comments Link to heading
Tag: 0x05, and 0x06 (repeatable comment)
Comments are stored in the container that was described above. The container’s T type is encoded as bytes, and can be interpreted as a string that is not null-terminated.
Extra Comments Link to heading
Tag: 0x07
This type stores anterior and posterior comments. The data is encoded in the container described above, where T is the following structure:
- bytes -
anterior_comment
- bytes -
posterior_comment
Writing the server Link to heading
I decided to use Rust to implement the server for a few reasons:
- I enjoy developing in it, and this project is for fun.
- It’s a compiled language and it’s strict, there shouldn’t be any
RuntimeException
s orNullPointerException
s in the future. - Rust has powerful networking and serialization libraries.
The server currently uses PostgreSQL via tokio-postgres
for storage.
The POC version of the server would just overwrite any record in the database. Such behavior would be bad if two users overwrote each-other’s work with IDA’s PushAllMetaData. If the last user didn’t add any useful comments, the server would be potentially overwriting useful information. My solution for this was to create a rank for every push that was in a batch. The server would detect common string formats, and would only give the metadata points if it contained information that wasn’t automatically generated. The version with more points would replace the database’s entry. If a user decided to push a specific function, it would overwrite the database value regardless of the function’s ranking.
Thanks you for reading. Please tell me what you think, @naim94a.
If you're interested how IDA Pro's #Lumina's protocol & metadata works, here's my write-up!https://t.co/78E6GiwFbo#IDAPro #Lumina https://t.co/ECn4MYF7Ft
— Naim A. (@naim94a) December 15, 2020
https://www.hex-rays.com/products/ida/lumina/; Retrieved September 13th, 2020 ↩︎