Introducing Lumen Link to heading

Overview Link to heading

IDA 7.2 was released on November 5th, 2018. One of the highlights of this release was the new experimental Lumina feature. Lumina is a service that allows it’s users to share metadata about functions from their IDA database (idb) - allowing other users to save time for similar functions. This can be a very powerful & time saving feature.

IDA’s developer, Hex-Rays announced that they are currently not offering a private Lumina server. Users with strict data sharing policies may not be allowed to use Lumina - rendering the feature useless. Sometimes you will spot a Lumina function with a totally random name - I assume that such behaviour is a result of poisioning the Lumina database due to sharing data by mistake.

Just before you ask, we currently do not offer Lumina for private use. Let us wait for it to naturally grow and become mature.¹

Private Lumina server Link to heading

I wanted to take advantage of such a feature, but also wanted to be in control of the data I share - that’s why I decided to create my own private Lumina compatible server. Lumina’s protocol isn’t a standard protocol, nor is it documented anywhere publicly at this time. It required reverse-engineering IDA and understanding Lumina’s protocol, especially since it’s a binary protocol.

A live version of Lumen is currently available at lumen.abda.nl.

From this point on this article will be about Lumina’s protocol.

Analyzing the protocol Link to heading

The first step to understanding a protocol, is getting sample data. Lumina is encrypted using TLS by default, so I had a look at IDA’s configuration file which contained two related configurable values: “LUMINA_HOST” & “LUMINA_PORT” which I set to 127.0.0.1 and 1234 respectively. TLS still had to be disable, a look into IDA’s string view revealed that yet another configurable “LUMINA_TLS” is available and can be set to “NO”.

I could have used Wireshark for this, but since I was writing a server anyway, I created a proxy for Lumina. The proxy would connect to the official Lumina server using TLS, and forward everything to and from IDA. The more I implemented parts of the protocol, the less hex-dumps. ..

RPC Messages Link to heading

I realized that the first 4 bytes represent the frame’s length. All requests are prepended with 5 bytes, the first 4 are the payload’s length and the 5th byte is an “RPC code”.

After reversing IDA it was clear that all messages are serialized and deserialized with a few basic types:

dd - A “packed” representation of a an unsigned 32-bit integer. It uses IDA SDK’s pack_dd & unpack_dd. I will provide more details about this later in the article.
dq - An unsigned 64-bit integer. Packed as two dds, one for each half of the number (high/low).
cstr - null-terminated string C string.
bytes - An instance of array<u8>. It’s simply the length of the array in it’s packed form followed by the bytes.
array<T> - Variable sized array, pack_dd(n) followed by n times T.
seq<T>(len) - A fixed size “array”. Since the size is fixed, the length is omitted.

The protocol starts with IDA sending the server an RPC_HELO message. If the server wishes to continue it responds with a RPC_OK message. The server may respond to any request with an RPC_FAIL message to indicate failure.

Once the server replies with RPC_OK, the client may send either RPC_PULLMD or RPC_PUSHMD to which the server should reply with a RPC_PULLMD_RES or RPC_PUSHMD_RES respectively. This can be done many times in a loop throughout the connection’s lifetime.

There are a few more RPC commands that have serialization & deserialization implemented, but such objects are not used and change between IDA versions. The following RPC are currently used and have existed since IDA 7.2 until IDA 7.5 SP1.

RPC_OK Link to heading

RPC Code: 0x0A

This message doesn’t have a payload.

RPC_FAIL Link to heading

RPC Code: 0x0B

dd - status_code
cstr - message

RPC_HELO Link to heading

RPC Code: 0x0D

dd - TBD
bytes - license_key
seq<u8>(6) - license_id
dd - TBD

RPC_PULLMD Link to heading

RPC Code: 0x0E

dd - arch (0=32 bit, 1=64 bit)
array<…>
- dd - TBD
array<…>
- dd - TBD
- bytes - function_hash

RPC_PULLMD_RES Link to heading

RPC Code: 0x0F

array<…>
- dd - status: 0=Found
array<…>
- cstr - function_name
- dd - function_length
- bytes - metadata_payload
- dd - popularity

RPC_PUSHMD Link to heading

RPC Code: 0x10

dd - TBD
cstr - idb_path
cstr - original_filepath
seq<u8>(16) - md5
cstr - hostname
array<…>
- cstr - function_name
- dd - function_length
- bytes - metadata_payload
- dd - TBD
- bytes - function_hash
array<…>
- dq - TBD

I couldn’t understand why the hostname, original file and IDB paths are collected - perhaps it would help identify peers when IDA will have more support for collaboration in the future. Still, this could have probably been done using more anonymous identifiers (such as hashes of these values…)

RPC_PUSHMD_RES Link to heading

RPC Code: 0x11

array<…>
- dd - status: 0=Exists, 1=New

`pack_dd` & `unpack_dd` Link to heading

While writing the de/serializing code, I noticed something weird with pack_dd & unpack_dd’s behavior. According to the SDK’s documentation they are supposed to be “utf-8 like encoding”. Given that they are packing functions, it would make sense for them to use a minimal amount of bytes.

This table describes how a number is packed.

Min	Max(incl.)	data bits	encoded bytes	data
0	0x7f	7	1	`0aaaaaaa`
0x80	0x3fff	14	2	`10aaaaaabbbbbbbb`
0x4000	0x1fffff	24	4	`11000000aaaaaaaabbbbbbbbcccccccc`
0x200000	0xffffffff	32	5	`11111111aaaaaaaabbbbbbbbccccccccdddddddd`

Realize the 3rd case? For me it would make more sense if the the behavior was like this:

Min	Max(incl.)	data bits	encoded bytes	packed data
0	0x7f	7	1	`0aaaaaaa`
0x80	0x3fff	14	2	`10aaaaaabbbbbbbb`
0x4000	0x1fffff	21	3	`110aaaaabbbbbbbbcccccccc`
0x200000	0xfffffff	28	4	`1110aaaabbbbbbbbccccccccdddddddd`
0x10000000	0xffffffff	32	5	`11111111aaaaaaaabbbbbbbbccccccccdddddddd`

In some cases the current behavior of pack_dd would waste a byte. I don’t know why pack_dd works this way, but I guess it won’t be possible to fix due to backwards compatibility issues.

Function metadata payload Link to heading

Many types of function metadata are relative to an offset in a function. Example: a function has many comments, each comment is relevant to a specific offset in the function.

This container can be described as follows:

dd - start_offset
In a loop until end of buffer:
- dd - offset_diff
- T - Depends on the type contained in this container.

Each element in the list increases it’s offset by offset_diff. If offset_diff is set to zero, the current offset is reset - this allows going back to lower offsets, or to storing information which doesn’t need an offset.

Function metadata’s are stored as chunks, each chunk has a length and a tag. Parsing the metadata depends on the tag, unrecognized tags can be skipped since the length of the chunk is known.

Metadata Chunk:

dd - metadata_tag
array<u8> - data

`type_info` Link to heading

Tag: 0x01

This contain serialized tinfo that can be deserialized using IDA SDK’s deserialize_tinfo. Lumen doesn’t parse this tag at the moment.

Function Comment Link to heading

Tag: 0x03, and 0x04 (repeatable comment)

The chunk’s payload is encoded as bytes and is a string that is not null-terminated.

Comments Link to heading

Tag: 0x05, and 0x06 (repeatable comment)

Comments are stored in the container that was described above. The container’s T type is encoded as bytes, and can be interpreted as a string that is not null-terminated.

Extra Comments Link to heading

Tag: 0x07

This type stores anterior and posterior comments. The data is encoded in the container described above, where T is the following structure:

bytes - anterior_comment
bytes - posterior_comment

Writing the server Link to heading

I decided to use Rust to implement the server for a few reasons:

I enjoy developing in it, and this project is for fun.
It’s a compiled language and it’s strict, there shouldn’t be any RuntimeExceptions or NullPointerExceptions in the future.
Rust has powerful networking and serialization libraries.

The server currently uses PostgreSQL via tokio-postgres for storage.

The POC version of the server would just overwrite any record in the database. Such behavior would be bad if two users overwrote each-other’s work with IDA’s PushAllMetaData. If the last user didn’t add any useful comments, the server would be potentially overwriting useful information. My solution for this was to create a rank for every push that was in a batch. The server would detect common string formats, and would only give the metadata points if it contained information that wasn’t automatically generated. The version with more points would replace the database’s entry. If a user decided to push a specific function, it would overwrite the database value regardless of the function’s ranking.

Thanks you for reading. Please tell me what you think, @naim94a.

If you're interested how IDA Pro's #Lumina's protocol & metadata works, here's my write-up!https://t.co/78E6GiwFbo #IDAPro #Lumina https://t.co/ECn4MYF7Ft
— Naim A. (@naim94a) December 15, 2020

https://www.hex-rays.com/products/ida/lumina/; Retrieved September 13th, 2020 ↩︎