Attack Surface Management (ASM) tools find quite a lot of vulnerabilities on the Web. This really isn’t surprising, given that HTTP/S is by far the most common and broadest of all the services comprising the Internet. In fact, Web-based issues represent the majority of the findings about which our Managed Service Providers (MSPs) inform our Chariot customers. Therefore, understanding how Chariot approaches the Web is an important part of understanding its approach to ASM overall.
If you think about a single website, it’s a bit like an iceberg in that much of the content an attacker might care about is not sitting on the front page of the site, but is buried more deeply (check out our white paper on this exact topic). If an attacker were to look only at the immediate landing page of each Web server within an attack surface, they would certainly find some vulnerabilities. They might find S3 buckets exposed, Grafana sitting out on the open Internet–it’s a potentially long list. However, when we dive beneath the surface life gets even more interesting.
To get to the next level of potential vulnerabilities, one needs to explore the target website much more deeply. That brings us to the topic of this blog series: content discovery. In this first installment, we’ll look at the actual process of content discovery. In the second, we’ll discuss some examples of the kinds of things users can find in the content they download. The final installment will introduce some of the new tooling we’ve been working on that builds the “content discovery engine of your dreams”. More on that in a few months, though. For now, let’s dig into how an attacker or penetration tester finds content today.
Crawl, Walk, or Run?
Crawling is the first thing you might reasonably do when encountering a website on a penetration test, and is a pretty familiar technique for content discovery. Put simply, you follow each link on the site until you have created a complete listing of all the content.
Crawling Deeper…
Even something as simple as crawling, however, has levels of complexity. If I were to click around on a website, I’d end up with a list of web pages that, while useful, leaves a lot of value on the table. A more complete approach would be to look at the actual downloaded HTML. This gives an attacker a full list of all the files downloaded and linked from every page. It is a much longer list than just the URLs of the full pages one visits.
However, even this list is woefully inadequate for the purposes of penetration testing. What about all the Javascript that runs on these pages? With the current DOM supported by HTML (the so-called “HTML Living Standard”) the idea of a page being the “unit” of web content is a little broken. Content is much more dynamic, so cataloging what a website holds requires a more dynamic approach. This is particularly true in the case of web applications, which often make use of RESTful interfaces or GraphQL endpoints to provide information to the front end.
…Still Misses Information
Even if we had a fantastic crawler that excelled at dealing with complex standards such as Ajax, that crawler would need to exercise the target site to really get a good idea of what information the site is hosting. There are some pretty good crawlers out there (such as katana, https://github.com/projectdiscovery/katana) that do a lot of this, albeit imperfectly. Unfortunately, even a great crawler cannot provide the whole story, and we can illustrate this with an example.
Imagine we see the following web pages in our crawler log:
https://www.praetorian.com/content/1.html 200
https://www.praetorian.com/content/2.html 200
https://www.praetorian.com/content/3.html 200
https://www.praetorian.com/content/5.html 200
https://www.praetorian.com/content/6.html 200
The first thing that stands out is the missing 4.html. Let’s say we’re an inquisitive kind of person (aka an adversary). We could hazard a guess that 4.html is worth trying, and might not be linked in to the main site for a reason. And that’s where brute forcing comes into the story: content discovery without it most definitely leaves some of the most interesting information on the table.
Brute Forcing is the Answer
Brute forcing (or forced browsing) is a standard content discovery technique for evaluating web applications for security vulnerabilities. An attacker will request a resource from the target host and check the server’s response for the requested content.
The “brute force workflow” involves sending a large number (often thousands or more) of requests to a webserver, normally looking for files that might be of particular security interest. When a website returns a valid response to the query, the attacker can then evaluate the content to determine if there is anything “security relevant.” A typical workflow might look something like the following:
- The attacker targets a web application hosted at “example.com.”
- The attacker uses a content discovery tool and a wordlist to brute force potential resources the server might unintentionally expose.
- The tool reports that `example.com/.gitconfig` returns a successful content response.
- The attacker checks the response and validates that it is an exposed git version control server.
- The attacker then uses a tool like `https://github.com/internetwache/GitTools` to download the content, thereby exploiting the brute forced endpoint.
Use Case-Dependent
Content discovery brute forcing takes the process a step further, and requires the attacker to make a few different decisions to effectively target the webserver depending on the use case of content discovery. Let’s consider the following two examples:
Red Team Engagement
The primary motivation for performing the content discovery attack would be leveraging it to exploit the application to gain access to the host server and private network. Stealth is a requirement, and the amount of noise (network requests and raw traffic amount) a brute-forcing tool generates could alert the target that scanning activity is occurring, thereby undermining the goal of the attack. Time constraints might further impact the approach, as smaller brute force coverage is better able to return results within a fixed period. Therefore, an engineer in a Red Team use-case might use a very targeted, minimal wordlist against one server while rate limiting the outgoing requests.
Web Application Penetration Test
The primary motivation for performing a content discovery attack would be to deliver a report on the security posture of the web app. Detection is not a consideration at all as the client knows the test is ongoing. The desired coverage level might increase significantly as delivering a report at the end requires thoroughness of coverage. The time constraint might be even further compressed if the schedule allows only a few days to evaluate the application. Consequently, a security engineer might decide to run larger wordlists at a high throughput.
Open Source Tools
Multiple open source tools exist to perform brute forcing, and a lot of presentations, tools, and blog posts already discuss the subject at length. A few good references (in no particular order) include:
- https://blog.assetnote.io/2021/04/05/contextual-content-discovery/
- https://github.com/ffuf/ffuf
- https://github.com/maurosoria/dirsearch
- https://github.com/OJ/gobuster
- https://media.defcon.org/DEF%20CON%2023/DEF%20CON%2023%20presentations/DEF%20CON%2023%20-%20Brent-White-Hacking-Web-Apps-WP.pdf
Wordlist Approach
The primary way most content discovery tools work is by taking a wordlist, generating requests for each entry in that wordlist, sending them to the server, and finally outputting the results. A user might be able to optionally configure modifications to the request generation process to obtain better results from the target server. One might specify “.js” and “.aspx” to append “.js” and “.aspx” to every single request from the wordlist, essentially tripling the coverage surface. Of course, this also comes at the cost of sending triple the requests.
A user normally “brings their own list” and needs to select a wordlist when running the tool. A common source for the wordlists used for these applications is https://github.com/danielmiessler/SecLists/. The user’s wordlists might vary per target, and the decision on which to use will impact the value of the results. Many content discovery projects exist and typically each new tool release iterates on the workflow to add improvements in speed, customization, or ease of use. A few more examples of these are:
Contextual Approach
In an ideal situation (from a tester’s perspective), we could just throw a wordlist of every single potentially interesting file at the server and get a response, but that’s not possible. We don’t have enough time, the server might ban requests after a certain amount, we might divert unnecessary server resources if we increase request throughput, etc. Contextual tools attempt to refine the process by using information from the server to then change their scanning behavior. Tools in this category might include:
Chameleon can optionally fingerprint the technology stack on the target and add appropriate words to the wordlist depending on the technology it identifies. An engineer provides a starting wordlist, and if “javascript” is identified in a server response, the tool adds a javascript-specific wordlist to the queue. The goal, of course, is to discover more content using the dynamic targeted wordlist.
Kiterunner’s developers collected a dataset of typical API schemas and incorporated these into its request sending logic, which means it can obtain better discovery results when targeting APIs than a standard wordlist. These contextual approaches help increase coverage in a smarter way than selecting a larger wordlist at the start.
Common Issues in Brute Force Tooling
Engineers using content discovery tools occasionally run into common issues, particularly the following three:
- Server rate limiting – The server or third party CDN host rate-limits or bans the originating IP after noting too many network requests. Because lots of the tools attempt to increase speed, sending thousands of requests might result in getting all further requests blocked.
- Misleading responses – Servers might return response codes, such as “200 Success” to indicate a valid response, for every single item requested even if it’s not there. Or they might return a “302” redirect for all the requests. When behavior like this occurs, tools might output lots of false positives. Many of them have options for excluding certain types of server responses before configuring the scan, but these issues typically are unknown before scanning the target. The engineer needs to be involved for the iterative feedback loop to work.
- Coverage – Choosing the wordlist with the optimal coverage is difficult to do in advance. If an engineer runs a wordlist targeting php files and the server only serves aspx files, the engineer might conclude that there is nothing to be gained from content discovery because the php file wordlist returned no results. But the server might return lots of interesting aspx files.
Closing Thoughts
As we have seen, “content discovery” is much more complex than simply browsing a website. Instead we need to think about crawling, Javascript execution, and brute forcing. Moreover, figuring out how deep to go, and how to balance cost (and risk of detection, when carrying out a Red Team exercise) with return are complex. Despite this, not having a concrete and well-considered approach to content discovery would be reckless.
Part of good Attack Surface Management is really approaching your site like an attacker would, so automating this collection process is an important step in risk reduction. However, having the content is not sufficient: once we have a “treasure map” of the website, we have to decide where the real valuables lie. That is, we need to decide what to do with the content we’ve gathered. That’s our topic for the next installment in this series.

