Automating Web Content Discovery (Alerting)

Automating content discovery to get alerts when new content is pushed to a website.

The Motivation

Browsing my Twitter feed when I came across this tweet by @nnwakelam who succiently gives us an example of why web content discovery is important.


Additionally, in the wonderful blog post by @infosec_au, he mentioned that he would get alerts of new assets or content and start hacking on them.

Although, in this post he was talking about DNS asset identification, you can see that he also has set up web content discovery as well.

This was something that I was planning to develop anyways, so might as well get started.

The tools

I created a post on Reddit hoping that what I wanted was already created, but it appears it is not. Of course many people suggest the normal tooling that we all know, but I didn’t see anything that fit all my needs so its time to hack something together.

The bruteforcer

When looking for a tool to do the bruteforcing we all know about the classic DirBuster but I was having problems getting it to do what I wanted. To test it’s functionality I simply created a file called ‘secret’ and pushed it to a webroot. However, DirBuster is mostly good for directories and not files, as it did not report that it found the ‘secret’ document (even if it was in the wordlist). You could say it was a (dir)bust.

dirbust DirBuster failing to find the ‘secret’ file

Therefore, I had to expand my options and used wfuzz.

Using the same wordlist that I used with DirBuster, I ran wfuzz and it correctly found the file and reported it. Based off of this test, I decided that I would be going forward with wfuzz as it appeared to have all the functionality I needed.

wfuzz wfuzz finding the ‘secret’ file

The wordlist

Previously I had used my own wordlist just to test that it would find the files that needed to be there. So, now we need a bigger wordlist that contains common files and directories. This is a pretty well documented space as there is the SecLists GitHub repository which contains a ton of wordlists. To choose which wordlists to use from this repository, I looked mainly at the opinion’s of professionals.


A tweet asking a similar question

Looks like the RAFT wordlists are recommended and anything else in the SecLists repository really.

Automate it

Here is the methodology we will follow

  1. Profile the website
  2. Log everything that was found
  3. Diff what we found now vs what was found previously
  4. Notify if there are any diffs

An interesting thought (which I did not come up with) was to use git to do this work for us. If you think about, what we are doing is actually version control. The website was at version 1, and now is at version 2 with new content. How can we easily see the new content? Why not just look at the differences in git?

From testing, I already know how to do the profiling with the following command:

python -w test.txt --hc 404

Now we have to figure out how to do step 2 without all the crap that the tool outputs (i.e. get the output in something that can be easily parsed by a program). There is a printer option for wfuzz but it is poorly documented and I didn’t know which ‘printers’ were supported. I did find a pull request that has a csv printer so I will be using that.

Let’s see the output of wfuzz using the csv option: wfuzz with csv

Great! The output is easily recognizable by any programming language. Since this is the initial profiling, we are done before step 3. However, past that, we must diff and notify if there are any changes.

So, make a new directory for the website, and do a git init on it. Then, run the profiler and log the files that it found to a file in the directory. Add this file to git for version tracking.

mkdir website
cd website
git init
python -w test.txt -o csv --hc 404 http://website/FUZZ > files
git add files
git commit -m "initial profile"

Now, assume the website changes and run the profiler again outputting to the same file as before. Add the file to git and commit the changes. If there is a difference between this commit and the previous, then new content has been discovered. Note: the file can only be committed if it is different than the previous file.

python -w test.txt -o csv --hc 404 http://website/FUZZ > files
git add files
git commit -m "new profile" # only commits if the file is different!

To see if there are any changes between the file now, and the file in the past version use the command

git diff HEAD^ HEAD files

For me, emailing this diff is enough. When you receive the email you will look at the subtraction signs (deleted content) and the addition signs (new content).

To email the diff I will be using the mail command.

git diff HEAD^ HEAD files | cat | mail -s "Content Changed"

Wrap it up

Finally, make a bash script that is smart enough to do all of these steps using the commands explained previously and run it in a cron job. The bash script should be placed in the git repository.


python /path/to/wfuzz/ -w /path/to/wordlist.txt -o csv --hc 404 http://website/FUZZ > /path/to/repository/files
git add files
if git commit -m "scan"; then
    git diff HEAD^ HEAD files | cat | mail -s "Content Changed"

Now, create a crontab entry to run this script however often you would like. Here is the entry for every hour:

0 * * * * /path/to/




This took a lot longer than I would have thought even though the process seems pretty simple. There is definitely room for improvement in this aspect as not everything is as automated as it should be (need to be able to monitor multiple sites without doing the manual labor of initalizing repositories and setting crontabs).

Next post should go over automatically downloading the newly found files, that way if there is an update that gets retracted, you will still have the content.

Written on January 19, 2017