Blog

Identifying Bugs in Router Firmware at Scale with Taint Analysis

In the past few months, Akash (@enigmatrix) and I (@daniellimws) worked on developing a taint analysis tool to find bugs in routers, with the guidance of Shi Ji (@puzzor) and Thach (@d4rkn3ss). We had developed a tool based on CVE-2019-8312 to CVE-2019-8319, which are command injection vulnerabilities on the D-Link DIR-878 router with firmware version 1.12A1. The goal was to automate the detection of such bugs. Ideally, the tool should be faster than finding the bugs manually.

This article will share the approaches used and the results this tool yielded on the PROLiNK PRC2402M, D-Link DIR-1960, and D-Link DIR-X1560 routers.

The Tool

Existing Tools

There are many existing taint analysis tools out there. Two that caught my interest were Triton and bincat, both of which are quite mature. However, we were not able to use them because they did not support the MIPS architecture, which was used by our target.

Using angr symbolic execution

Moving on, we focused on building our tool based on angr, a binary analysis framework on Python. We chose angr because it supports most architectures, including MIPS and ARM, which we are targeting. Earlier, @puzzor had made some custom changes to angr for static taint analysis, using the generated VEX IR program trace after angr simulates the program using symbolic execution. It successfully found command injection bugs in our test firmware.

However, we quickly faced a roadblock. To generate the program trace, we needed angr to simulate each function by emulating every instruction and using symbolic execution to decide whether to follow a branch instruction.

In more detail, angr maintains a stack of states. A state contains information like register values and memory contents. Naturally, when simulating a function, it will start with only one state. When a branch instruction is encountered, and if angr is unsure whether to take the branch, angr will duplicate the state, one of which will take the branch, while the other doesn’t.

Most of the time, there are loops in a function. If the loop condition is based on some user input, the stack of states will explode. As angr will always be unsure whether to continue or break from the loop and keep duplicating states. One important thing to also note is that these states are not simulated concurrently. Only one state is simulated at a time. In such cases, it takes very long to have a state that reached a vulnerable piece of code; or if the function is not even vulnerable, the simulation may never terminate.

As a symbolic execution framework, angr has different customisable settings (called simulation techniques) to decide which state to simulate first or keep a state. But after trying out many different techniques, we still aren’t able to improve the execution time.

To show some numbers, with a timeout of 2 minutes set for analysing each function in a binary, we were not able to finish analysing a binary even after 2 hours (because if a function is not vulnerable, it will keep simulating until timeout). Not to mention, there’s an unknown memory leak issue in angr, so after 2 hours, the computer will run out of RAM :(

Referring back to our goal earlier, we wanted the tool to be faster than manual work. So this is a no-go, and we continue to search for improvements or alternatives.

Using angr’s Reaching Definition analysis

Eventually, we stumbled upon this issue, which led us to read up more about angr’s Reaching Definitions analysis through the following resources:

use-def relationships

To summarize, the analysis generates use-def relationships between atoms in a function. An atom is analogous to a variable, and there are multiple types of atoms - register, stack variable, heap variable. Just think of atoms as variables, and it should be clear. Take the following code for example:

char* get_querystring_value(char* querystring, char* name)
{
    ...

    return ...
}

void vuln(char* querystring)
{
   // extract the "name" parameter from a querystring
   // e.g. ?name=$(echo gg)
   char* name = get_querystring_value(querystring, "name");
   char command[200];
   sprintf(command, "echo %s >> /tmp/log", name);
   system(command);
}

There is an obvious command injection vulnerability in the function above, at the system(command), from the name parameter of the querystring. If we want to model the use-def relationship between querystring and other atoms in this function, it will look like the following.

use-def0

Firstly, we see querystring defined as an argument of vuln, is used by get_querystring_value, as the querystring argument. Besides that, a name argument of get_querystring_value is defined as well. In the end, the return value of get_querystring_value is defined, which is considered to have used the 2 arguments given.

Moving on we see sprintf called with the the name variable (return value of get_querystring_value) and a string echo %s >> /tmp/log. This time, it is slightly different. As we know that the first argument of sprintf is the destination, we must define command to have used the 2 arguments given to spritnf, other than just the return value. The generated use-def relationship is as follows:

use-def1

With the same concept, this analysis generates a use-def relationship for all atoms in the function. As we can see above, the relationship can be modelled as a graph. The uses are edges and definitions are nodes. So, we can convert this into a graph analysis problem instead.

In taint analysis terms, a source is where the user controls the data in a program, and a sink is where the data from the source may or may not reach. Taint analysis is to determine whether data from a source reaches a sink. In the example above, get_querystring_value is the source, since it extracts some value from user input, whereas system is the sink. In this case, the data from the source does reach the sink.

With that said, in our use-def graph, we can identify the source and sink definitions (nodes), then traverse the graph with some heuristics to determine whether data from the source is used by the sink. If yes, then we mark the source as vulnerable and proceed to triage it.

Tool summary

To summarize, our tool utilizes angr’s Reaching Definitions analysis to generate a use-def relationship graph of functions in a router firmware. Then, it analyses the graph to detect possible vulnerabilities where user input (from source) reaches a dangerous function like system (to sink).

If you are familiar with CodeQL or Joern, yea it does something similar, except that our tool does not have such a robust query interface.

Results

Earlier, we had mentioned that the symbolic execution approach took longer than 2 hours. With this approach, we managed to finish the analysis in ~2 minutes! Certainly this is the right way to go.

After polishing the tool to remove false positives and cover more false negatives, we tested it on the DLink and PROLiNK routers.

Immediately, with the tool, we found close to 20 command injection vulnerabilities, 10 of which do not require authentication, accessible via the WAN interface. We quickly reported them to PROLiNK, and they responded very promptly as well. Once the bugs were fixed, we filed for CVEs and MITRE gave us CVE-2021-35400 to CVE-2021-35409.

Here are some snippets of vulnerable code, and the source and sinks are:

  • source - web_get
  • sink - system, do_system, popen
void sys_login1(undefined4 request)
{
    ...
    ipaddr = (char *)web_get("ipaddr",request,0);
    ipaddr = strdup(ipaddr);
    password = (char *)web_get("password",request,0);
    password = strdup(password);
    lang = (char *)web_get("lang",request,0);
    lang = strdup(lang);

    ...
    if (strncmp(password, correct_password_md5, 0x20)) {
        ...
        sprintf(command,"echo %s,%s, > /tmp/language &",ipaddr,lang);
        do_system(command);
        ...
    }
    ...
}
void qos_sta_settings(undefined4 body)
{
    char *cli_list;
    char *cli_num;
    char command [2048];

    cli_list = web_get("cli_list", body, 0);
    cli_list = strdup(cli_list);
    cli_num = cli_num("cli_num", body, 0);
    cli_num = strdup(cli_num);
    ...
    memset(command,0,0x800);
    sprintf(command, "/sbin/sta_qos.sh setup %s %s", cli_list, cli_num);
    ...
    system(command);
    ...
}
void setNightLed(char* querystring)
{
    start_hour = web_get("start_hour");
    start_hour = strdup(start_hour);
    start_min = web_get("start_min");
    start_min = strdup(start_min);
    end_hour = web_get("end_hour");
    end_hour = strdup(end_hour);
    end_min = web_get("end_min");
    end_min = strdup(end_min);
    ...
    sprintf(command,"echo -n %s %s %s %s > /tmp/scheduleSet &",start_hour,start_min,end_hour,end_min);
    do_system(command);
}

Bonus: Hardcoded Password or Backdoor? ๐Ÿ˜•

In the process, I also found some other bugs. It appears that there is a hardcoded or backdoor password that can be used to login to the router admin panel. The admin page sends the md5 hash of the user-supplied password to login.cgi for verification. The pseudocode is as follows:

correct_password = nvram_bufget("Password");

strcpy(salted_password, key);
strcat(salted_password, correct_password);
md5_sum(salted_password, salted_hash);

input_hash = strdup(web_get(querystring, "password"));
if (strncmp(input_hash, salted_hash, 0x20) == 0)
    ...

However, there is then a suspicious piece of code following it:

strcpy(salted_password, key);
strcat(salted_password, "user");
md5_sum(salted_password, salted_hash);

if (strncmp(input_hash, salted_hash, 0x20) == 0)
    ...

I tried user as the password, and successfully logged into the admin page. By logging in this way, a slightly different-looking dashboard is shown, which seems to provide less capabilities than a user that is authenticated with the actual password. However, http://prc2402m.setup/setting.shtml can be accessed, which gives control over the router settings.

We reported this to the vendor, and they responded with new firmware. To make sure the backdoor was gone, I opened up the same function again. I don’t see strcat(salted_password, "user") anymore. But I saw the following:

password_backup = (char *)nvram_bufget(0,"Password_backup")
strcat(salted_password, password_backup);

๐Ÿคจ

And it wasn’t too hard to find the value of Password_backup in the nvram.

โžœ  rootfs rg Password_backup
etc_ro/Wireless/RT2860AP/RT2860_default_vlan
11:Password_backup=nE7n$8q%5m

We quickly reported this again to the vendor. Fortunately, the second time we received the fix, there is no Debugdoor or backdoor password anymore.

Bonus: Stack-based Buffer Overflow

There are also many stack-based buffer overflow vulnerabilities due to the lack of bounds checking, which can be used to overwrite the return address on the stack, and gain control over the program execution. This can be seen in some of the command injection examples above, where user input is copied to a string through sprintf instead of snprintf.

Bonus: Denial of Service

While testing for a buffer overflow proof-of-concept, I also found a vulnerability that causes the router to stop responding to requests, until it is manually restarted with the power button. In the pseudocode below, a parameter cli_num is passed as an argument to the /sbin/sta_qos.sh script.

void qos_sta_settings(undefined4 body)
{
    char *cli_list;
    char *cli_num;
    char command [2048];

    cli_list = web_get("cli_list", body, 0);
    cli_list = strdup(cli_list);
    cli_num = cli_num("cli_num", body, 0);
    cli_num = strdup(cli_num);
    ...
    memset(command,0,0x800);
    sprintf(command, "/sbin/sta_qos.sh setup %s %s", cli_list, cli_num);
    ...
    system(command);
    ...
}

Inspecting the script contents, I see the following for-loop, where $sta_num holds the value of cli_num.

...
sta_setup() {
    ...
    for i in `seq 1 $sta_num`
    do
    ...

With cli_num being a huge value, e.g. 9999999999, the script will stay in the loop almost forever, effectively stuck in an infinite loop. By spamming the router with such requests, there will be many instances of this script being executed and stuck in the loop.

After a while, the router stops responding to any requests, and needs to be manually rebooted to work properly again.

Timeline

  • Jun 9 - Reported to vendor 10 command injection vulnerabilities.
  • Jun 11 - Vendor responded with fixes.
  • Jun 11 - Suggested to vendors some additional filters to prevent such vulnerabilities.
  • Jun 28 - Vendor responded with fixes according to our suggestions.
  • Jul 9 - Reported to vendors 3 more vulnerabilities (backdoor, buffer overflow, DoS)
  • Jul 23 - Vendor responded with fixes.

Other than the PROLiNK router, we also ran the tool on the DIR-1960 firmware. This time, we got back close to 200 results. However, after triaging the results, there are only 4 command injection vulnerabilities (We already reported them earlier) through the HNAP API, all of which require authentication. (A lot of room for improvement to eliminate false positives!)

If you’re curious about what HNAP is, it stands for Home Network Administration Protocol, which is a SOAP-based protocol that is used to communicate with the router admin panel.

Moving on, I decided to also run the tool on the DIR-X1560 firmware. The previous 2 routers above were running on MIPS, but this one runs on ARM. With not too much tweaking, I got the tool to properly analyse the ARM-based firmware. It was heartwarming to know that this tool is architecture-agnostic.

However, it was not so straightforward to identify the vulnerabilities on this firmware, due to the many layers of abstraction. In view of this, the tool helped massively in reverse engineering the firmware. I’m not sure what exactly is the name of the framework that the firmware is based on, but I managed to find some source code on GitHub, which is very helpful because of all the comments in it. The closest terms I can find in the source code are CMS (CPE Management System), CPE (Customer-Premises Equipment) and TR-069. However, note that this repo does not contain any DLink-specific code, so some reversing needs to be done.

From my perspective, it is similar to a MVC (Model-View-Controller) architecture, although it might not be.

+-----------+        +-----+
| HNAP1 API +--------> DAL |
+-----------+        +--+--+
                        |
             cmsObj_get | cmsObj_set   +-----+
                        +--------------> RCL |
                        |              | RUT |
                     +--v--+           +-----+
                     | MDM |
                     | ODL |
                     +-----+

More terms and acronyms to explain here.

The DAL (Data Aggregation Layer) API, as the name suggests, is for interacting with data, which mainly are the router configurations. But the actual storing of the data is done by the MDM (Memory Data Mode) and ODL (Object Dispatch Layer) APIs. The DAL uses the cmsObj_get and cmsObj_set functions (or their variations) as an interface with the MDM/ODL to get or set the value of certain objects. For example, to get the IP_PING_DIAG MDM object and store it in ipPingObj, then save it back after modifying it:

    cmsObj_get(MDMOID_DEV2_IP_PING_DIAG, &iidStack, 0, (void **) &ipPingObj);
    // some modifications ...
    cmsObj_set(ipPingObj, &iidStack);

Here we see the arguments used are:

  • MDMOID_DEV2_IP_PING_DIAG - An enum that specifies to access the IP_PING_DIAG object
  • iidStack - Some internal data that we don’t need to care much about
  • ipPingObj - Contents of the IP_PING_DIAG object

Besides that, there are also the RCL (Runtime Config Layer) and RUT (Run-time UTility) APIs. Each MDM object (e.g. MDMOID_DEV2_IP_PING_DIAG) has a corresponding RCL handler function (rcl_dev2IpPingDiagObject). Every time cmsObj_set is called, ODL will call the object’s RCL handler, which in turn calls some RUT utility functions.

There’s a lot going on @.@ which is exactly the problem with reverse engineering this firmware. Anyways, the flow is as follows:

  1. User makes a POST request to interact with the HNAP API (e.g. SetTimeSettings)
  2. HNAP API handler calls the DAL API (e.g. cmsDal_setNtpCfgDLink_dev2)
  3. DAL API calls the MDM/ODL API (cmsObj_set) to set the MDM object (e.g. Dev2TimeDlinkObject)
    • i.e. cmsObj_set(MDMOID_DEV2_TIME_DLINK, &iidstack, 0, &timeDlinkObj)
  4. ODL API calls the RCL handler (e.g. rcl_dev2TimeDlinkObject)
  5. RCL handler calls the RUT API (e.g. rut_TZ_Nvram_update)

Not so confusing now, I hope.

If we were to look at the HNAP and RUT functions mentioned above, we see:

void SetTimeSettings(void **request)
{
    char* tzlocation;
    char* ntp;
    char* ntpserver;

    tzlocation = websGetVar(request, "/SetTimeSettings/TZLocation");
    ntp = websGetVar(request, "/SetTimeSettings/NTP");
    ntp_server = websGetVar(request, "/SetTimeSettings/NTPServer");

    ...

    cmsDal_setNtpCfgDLink_dev2(ntp, ntp_server, tzlocation);

    ...

    return;
}
int rut_TZ_Nvram_update(Dev2TimeDlinkObject time_object)
{
    char command [132];
    memset(command,0,0x80);

    ...

    if (time_object->ntp_server != (char *)0x0) {
        snprintf(command,0x80,"nvram set ntp_server=%s", time_object->ntp_server);
        system(command);
    }

    ...

    return 0;
}

After a long journey, the NTPServer parameter ends up in a command passed to system.

As we saw above, a user input string (from HNAP) is passed through many functions before it finally ends up at a system call (in RUT), to result in a command injection vulnerability. If I were to manually look through the firmware, unless I’m lucky it would have taken me almost forever to find this. With the help of the tool, although it was not able to directly make the connection from HNAP to RUT, I at least was able to shortlist the relevant DAL functions to look at, saving me a lot of time.

Connection between DAL and RCL/RUT

Here, I shall share more about how the DAL API is related to the RCL/RUT API. The pseudocode of cmsDal_setNtpCfgDLink_dev2 (DAL API called by HNAP API as mentioned earlier) looks like the following:

int cmsDal_setNtpCfgDLink_dev2(char* ntp, char* ntpserver, char* tzlocation)
{
    ...

    // get the MDM Dev2TimeDlinkObject through the ODL API
    // 0x416 = MDMOID_DEV2_TIME_DLINK
    res = cmsObj_get(0x416, &iidStack, 0, &timeDlinkObj);
    if (res == 0)
    {
        // some checks
        ...

        // set the ntp_server field
        if (timeDlinkObj->ntp_server != 0) {
            cmsMem_free(timeDlinkObj->ntp_server);
        }
        timeDlinkObj->ntp_server = cmsMem_strdup(ntpserver);

        // set some other fields
        ...

        // save timeDlinkObj back to the MDM/ODL layer
        cmsObj_set(timeDlinkObj, &iidStack);
    }

    ...
}

The code snippet above shows the typical process that a DAL function takes to set/update a MDM object. Note that cmsObj_get is called with 0x416 as the MDMOID (MDM Object ID). Since there is no source code, I only see the value (0x416) and not the enum name (MDMOID_DEV2_TIME_DLINK) which I inferred from the function names and some strings in the firmware.

As mentioned earlier, when cmsObj_set is called, the ODL API will call the corresponding RCL handler, in this case rcl_dev2TimeDlinkObject. I won’t go into the implementation details of cmsObj_set because it is quite complicated - there are many checks and function calls done. If you are interested, this line calls the RCL handler function.

Obtaining the mapping between MDMOID and RCL handler is not difficult, as it is stored in the OID table in the firmware, that looks like this:

...
000fd998        0x415
000fd99c        s_Device.Time._000c5213                          = "Device.Time."
000fd9a0        00000000
000fd9a4        rcl_dev2TimeObject
000fd9a8        00000000
000fd9ac        stl_dev2TimeObject
000fd9b0        00000000

000fd9b4        0x416
000fd9b8        s_Device.X_BROADCOM_COM_TimeDlink._000c5220      = "Device.X_BROADCOM_COM_TimeDlink"
000fd9bc        00000000
000fd9c0        rcl_dev2TimeDlinkObject
000fd9c4        00000000
000fd9c8        stl_dev2TimeDlinkObject
000fd9cc        00000000
...

In this table, it is easy to see that 0x416 is the MDMOID for the TimeDlink MDM object, and rcl_dev2TimeDlinkObject is the RCL handler. Here, we also see something called the STL handler but it does not do much.

Now, the RCL handler rcl_dev2TimeDlinkObject looks like the following:

int rcl_dev2TimeDlinkObject(TimeDlinkObject* newMdmObj, TimeDlinkObject* currMdmObj)
{
    // there is some validation being done, which is not so interesting
    ...
    rut_TZ_Nvram_update(newMdmObj);
    ...
}

We see that a newMdmObj is passed into the vulnerable function rut_TZ_Nvram_update (covered earlier). This newMdmObj is exactly the timeDlinkObj that was passed to cmsObj_set by the DAL function just now. So, now we can visualize that the DAL and RCL is connected as follows:

                +---------+
                |   DAL   |
                +--+---+--+
                   |   |
                   |   |
+------------+-----v---v--+
| cmsObj_set | MDMOID obj |
+------------+--+---+-----+
                |   |
         +------v---v---+
         | rcl_...(...) |
         +--------------+

DAL gives cmsObj_set a MDMOID and an object, then

  • MDMOID determines which RCL handler to call
  • The object is given to the RCL handler

We see that from a DAL function, it is not so hard to find out which RCL function is called, because we have the MDMOID, and can refer to the OID table seen above. But when looking for command injection vulnerabilities, the steps are reversed.

First, with the tool, I find RCL/RUT functions that might be vulnerable, with source being the function arguments and sink being system (or its variants). Nothing new here. But now, I need to find the DAL functions that access the relevant MDM object. In other words, like above, I know the MDMOID, but this time, instead of finding the RCL handler, I ask the question: which DAL functions call cmsObj_set with this MDMOID? Now it feels like this:

                +---------+
                | DAL ??? |
                +----^----+
                     |
                     |
+------------+-------+----+
| cmsObj_set | MDMOID obj |
+------------+--^-----^---+
                |     |
         +------+-----+-+
         | rcl_...(...) |
         +--------------+

At first, I tried the naive method - going through xrefs to cmsObj_set one by one, until I find one that is called with the correct MDMOID. There are over 200 xrefs and after a few minutes, I feel that I will go crazy if I continue. So, I decided to use the tool to help me filter out functions using a certain MDMOID. In particular, I am only concerned about MDM object fields that are a string/buffer. If a field holds an integer value, it is not really useful for command injection or buffer overflow.

Recall that a string field of a MDM object is set like this:

    // get the MDM Dev2TimeDlinkObject through the ODL API
    // 0x416 = MDMOID_DEV2_TIME_DLINK
    res = cmsObj_get(0x416, &iidStack, 0, &timeDlinkObj);

    ...

        // set the ntp_server field
        if (timeDlinkObj->ntp_server != 0) {
            cmsMem_free(timeDlinkObj->ntp_server);
        }
        timeDlinkObj->ntp_server = cmsMem_strdup(ntpserver);

So, without needing to modify the tool by much, I set source to be cmsObj_get and sink to be cmsMem_free. And it worked. For each MDMOID, I filtered out the few DAL functions that modify the relevant MDM object. Then, I check the xrefs of these DAL functions to see how they are called by the HNAP API, to find out how user input is passed into the MDM object.

Huge boost in efficiency, and in the end, I managed to find 4 command injection vulnerabilities in this firmware.

Demo

It’s Demo time.

Demo

Remarks

Currently, the tool is still at its young development stage, and was only successful in finding command injection vulnerabilities. Besides, some manual work still had to be done when analysing complicated firmware like the DIR-X1560, as it does not automatically tell which HNAP function is vulnerable. I look forward to polishing it, and hopefully some day it can assist with finding other vulnerabilities like buffer overflow, use after free, or double free in a firmware.