Anubis – Managing a growing, distributed infrastructure for Binary Analysis [Part II]

In part I of this blog post, I summarized how and why the Anubis system has grown into a not only virtually, but also physically distributed analysis service. In part II, I will explain how we tackled the problem of frequent updates from the distributed workers to the central database. In part III, other issues mentioned in the first part will be described in some detail.

Database Access Needless to say, the Anubis system generates not only html, text, and PDF-reports while analyzing an unknown binary. Every network connection that is established, every process spawned, and any other interesting activity performed by a potential malware under analysis is entered into our central database. This allows us to conveniently gather statistics on recent malware activity, and gives us at least a chance to keep an overview over the latest trends in malware development. Further, all details on pending tasks that an analysis worker is allowed to process are stored inside the database. Thus, when a worker wishes to generate a new result, it queries the DB for any additional information (such as analysis-timeout, additionally provided auxiliary files, etc.) before starting the analysis of the sample, and later stores the gathered results there.

The incredible usability of storing results in the database on the one side required a lot of effort on the other: While plainly inserting every kind of activity into the DB would have been straight-forward, meaningful statistics and trends can only be seen when using smart mechanisms during data insertion. During the generation of database content, our system checks if other results have generated similar activity (e.g., started a process with the same name or modified the same registry key) and links the individual analysis results to this shared content. This, in turn, obviously requires a large number of database queries during insertion, leading to unbearable round-trip times between the analysis workers physically running in one location, and the database running in another.

One way of dealing with this situation would have been a local DB instance for each geographic location, performing periodic synchronization. While this may appear appealing at the first sight, concurrent insertions could have caused inconsistency problems. In fact, the huge number of database insertions per minute would have inevitably harmed the inter-result linking to a large degree in a very short amount of time.

For this reason, we decided to keep only one DB instance that is  updated with analysis results by the workers (all other slave DBs are read-only copies of this central, master instance). We introduced a new server-client based middle-layer in charge of mediating database access, called the worker-proxy, whose technical details are explained below:

The server part of this proxy system, running physically close and having a direct connection to the database server, is built on top of a standard HTTPs service implemented in Python. On top of the encrypted, multithreaded HTTP service, the server offers various additional methods over RPC to connecting workers by subclassing Python’s DocXMLRPCRequestHandler (a good introduction into implementing such a service can be found here).

One of the methods offered via RPC can be used by an idle worker to query the system for open analysis tasks. Thus, instead of querying the DB directly, a worker connects to the proxy using HTTPs asking for work. In terms of implementation, this is pretty straight-forward: Using Python’s xmlrpclib, the worker can instantiate a stub object supporting a method for fetching a task from the DB, returning a dictionary containing any information the worker requires for analyzing a sample (such as sample filename, analysis timeout, additional parameters, etc.).

Likewise, instead of pushing results into the database directly, the worker can reuse the stub-object to insert the analysis data. For this, the Anubis core-analysis binary (i.e., the enhanced Qemu emulator described in various publications and in a planned blog-post within this series) first stores all observed behavior inside an XML result file (similar to the files that can be fetched through the web-interface). This XML file, together with the pcap dump of the generated network traffic, is then uploaded to the proxy server (see section ‘File Access’ in a later part of this series). By invoking the RPC for data insertion on the stub-object, the worker completes its part of the analysis and can query the proxy for the next open task.

As mentioned above, the proxy service runs close to the database. Thus, it can run the same code for inserting results into the database as we used back when the whole Anubis infrastructure was still located inside the same room. In fact, during the infrastructure improvement, we only moved the Python  code related to DB insertions from the worker into the proxy. All that was necessary to make this work was to insert the transparent RPC mechanism in-between the binary analysis and DB insertions and adding code for uploading the generated results files. As we will see throughout the next part of this blog-post, adapting to the distributed nature of the new system was often facilitated by the decision of using an RPC-style, middle-layer proxy.

Looking Ahead In the next parts of this series of posts, we will go into the technical details of

  • File Access,
  • Software Updates,
  • Configuration Updates,
  • Service Monitoring,
  • Data Confidentiality and Integrity, and
  • Service Availability.
This entry was posted in Anubis, Binary Analysis, General, Malware Analysis and Detection. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s