Introducing Lumen

Introducing Lumen

Overview

IDA 7.2 was released on November 5th, 2018. One of the highlights of this release was the new experimental Lumina feature. Lumina is a service that allows it’s users to share metadata about functions from their IDA database (idb) - allowing other users to save time for similar functions. This can be a very powerful & time saving feature.

IDA’s developer, Hex-Rays announced that they are currently not offering a private Lumina server. Users with strict data sharing policies may not be allowed to use Lumina - rendering the feature useless. Sometimes you will spot a Lumina function with a totally random name - I assume that such behaviour is a result of poisioning the Lumina database due to sharing data by mistake.

Just before you ask, we currently do not offer Lumina for private use. Let us wait for it to naturally grow and become mature.1

Private Lumina server

I wanted to take advantage of such a feature, but also wanted to be in control of the data I share - that’s why I decided to create my own private Lumina compatible server. Lumina’s protocol isn’t a standard protocol, nor is it documented anywhere publicly at this time. It required reverse-engineering IDA and understanding Lumina’s protocol, especially since it’s a binary protocol.

A live version of Lumen is currently available at lumen.abda.nl.

From this point on this article will be about Lumina’s protocol.

Analyzing the protocol

The first step to understanding a protocol, is getting sample data. Lumina is encrypted using TLS by default, so I had a look at IDA’s configuration file which contained two related configurable values: “LUMINA_HOST” & “LUMINA_PORT” which I set to 127.0.0.1 and 1234 respectively. TLS still had to be disable, a look into IDA’s string view revealed that yet another configurable “LUMINA_TLS” is available and can be set to “NO”.

I could have used Wireshark for this, but since I was writing a server anyway, I created a proxy for Lumina. The proxy would connect to the official Lumina server using TLS, and forward everything to and from IDA. The more I implemented parts of the protocol, the less hex-dumps. ..

RPC Messages

I realized that the first 4 bytes represent the frame’s length. All requests are prepended with 5 bytes, the first 4 are the payload’s length and the 5th byte is an “RPC code”.

After reversing IDA it was clear that all messages are serialized and deserialized with a few basic types:

  • dd - A “packed” representation of a an unsigned 32-bit integer. It uses IDA SDK’s pack_dd & unpack_dd. I will provide more details about this later in the article.
  • dq - An unsigned 64-bit integer. Packed as two dds, one for each half of the number (high/low).
  • cstr - null-terminated string C string.
  • bytes - An instance of array<u8>. It’s simply the length of the array in it’s packed form followed by the bytes.
  • array<T> - Variable sized array, pack_dd(n) followed by n times T.
  • seq<T>(len) - A fixed size “array”. Since the size is fixed, the length is omitted.

The protocol starts with IDA sending the server an RPC_HELO message. If the server wishes to continue it responds with a RPC_OK message. The server may respond to any request with an RPC_FAIL message to indicate failure.

Once the server replies with RPC_OK, the client may send either RPC_PULLMD or RPC_PUSHMD to which the server should reply with a RPC_PULLMD_RES or RPC_PUSHMD_RES respectively. This can be done many times in a loop throughout the connection’s lifetime.

There are a few more RPC commands that have serialization & deserialization implemented, but such objects are not used and change between IDA versions. The following RPC are currently used and have existed since IDA 7.2 until IDA 7.5 SP1.

RPC_OK

RPC Code: 0x0A

This message doesn’t have a payload.

RPC_FAIL

RPC Code: 0x0B

  • dd - status_code
  • cstr - message

RPC_HELO

RPC Code: 0x0D

  • dd - TBD
  • bytes - license_key
  • seq<u8>(6) - license_id
  • dd - TBD

RPC_PULLMD

RPC Code: 0x0E

  • dd - arch (0=32 bit, 1=64 bit)
  • array<…>
    • dd - TBD
  • array<…>
    • dd - TBD
    • bytes - function_hash

RPC_PULLMD_RES

RPC Code: 0x0F

  • array<…>
    • dd - status: 0=Found
  • array<…>
    • cstr - function_name
    • dd - function_length
    • bytes - metadata_payload
    • dd - popularity

RPC_PUSHMD

RPC Code: 0x10

  • dd - TBD
  • cstr - idb_path
  • cstr - original_filepath
  • seq<u8>(16) - md5
  • cstr - hostname
  • array<…>
    • cstr - function_name
    • dd - function_length
    • bytes - metadata_payload
    • dd - TBD
    • bytes - function_hash
  • array<…>
    • dq - TBD

I couldn’t understand why the hostname, original file and IDB paths are collected - perhaps it would help identify peers when IDA will have more support for collaboration in the future. Still, this could have probably been done using more anonymous identifiers (such as hashes of these values…)

RPC_PUSHMD_RES

RPC Code: 0x11

  • array<…>
    • dd - status: 0=Exists, 1=New

pack_dd & unpack_dd

While writing the de/serializing code, I noticed something weird with pack_dd & unpack_dd's behavior. According to the SDK’s documentation they are supposed to be “utf-8 like encoding”. Given that they are packing functions, it would make sense for them to use a minimal amount of bytes.

This table describes how a number is packed.

MinMax(incl.)data bitsencoded bytesdata
00x7f710aaaaaaa
0x800x3fff14210aaaaaabbbbbbbb
0x40000x1fffff24411000000aaaaaaaabbbbbbbbcccccccc
0x2000000xffffffff32511111111aaaaaaaabbbbbbbbccccccccdddddddd

Realize the 3rd case? For me it would make more sense if the the behavior was like this:

MinMax(incl.)data bitsencoded bytespacked data
00x7f710aaaaaaa
0x800x3fff14210aaaaaabbbbbbbb
0x40000x1fffff213110aaaaabbbbbbbbcccccccc
0x2000000xfffffff2841110aaaabbbbbbbbccccccccdddddddd
0x100000000xffffffff32511111111aaaaaaaabbbbbbbbccccccccdddddddd

In some cases the current behavior of pack_dd would waste a byte. I don’t know why pack_dd works this way, but I guess it won’t be possible to fix due to backwards compatibility issues.

Function metadata payload

Many types of function metadata are relative to an offset in a function. Example: a function has many comments, each comment is relevant to a specific offset in the function.

This container can be described as follows:

  • dd - start_offset
  • In a loop until end of buffer:
    • dd - offset_diff
    • T - Depends on the type contained in this container.

Each element in the list increases it’s offset by offset_diff. If offset_diff is set to zero, the current offset is reset - this allows going back to lower offsets, or to storing information which doesn’t need an offset.

Function metadata’s are stored as chunks, each chunk has a length and a tag. Parsing the metadata depends on the tag, unrecognized tags can be skipped since the length of the chunk is known.

Metadata Chunk:

  • dd - metadata_tag
  • array<u8> - data

type_info

Tag: 0x01

This contain serialized tinfo that can be deserialized using IDA SDK’s deserialize_tinfo. Lumen doesn’t parse this tag at the moment.

Function Comment

Tag: 0x03, and 0x04 (repeatable comment)

The chunk’s payload is encoded as bytes and is a string that is not null-terminated.

Comments

Tag: 0x05, and 0x06 (repeatable comment)

Comments are stored in the container that was described above. The container’s T type is encoded as bytes, and can be interpreted as a string that is not null-terminated.

Extra Comments

Tag: 0x07

This type stores anterior and posterior comments. The data is encoded in the container described above, where T is the following structure:

  • bytes - anterior_comment
  • bytes - posterior_comment

Writing the server

I decided to use Rust to implement the server for a few reasons:

  1. I enjoy developing in it, and this project is for fun.
  2. It’s a compiled language and it’s strict, there shouldn’t be any RuntimeExceptions or NullPointerExceptions in the future.
  3. Rust has powerful networking and serialization libraries.

The server currently uses PostgreSQL via tokio-postgres for storage.

The POC version of the server would just overwrite any record in the database. Such behavior would be bad if two users overwrote each-other’s work with IDA’s PushAllMetaData. If the last user didn’t add any useful comments, the server would be potentially overwriting useful information. My solution for this was to create a rank for every push that was in a batch. The server would detect common string formats, and would only give the metadata points if it contained information that wasn’t automatically generated. The version with more points would replace the database’s entry. If a user decided to push a specific function, it would overwrite the database value regardless of the function’s ranking.


Thanks you for reading. Please tell me what you think, @naim94a.


  1. https://www.hex-rays.com/products/ida/lumina/; Retrieved September 13th, 2020 ↩︎