Automating Web Content Downloading

Automatically downloading web content after it has been found.

The Approach

Always when tackling automation challenges, I prefer to go for the easiest way possible first and then iterate on what we have until our solution is successful.

From the last post on web content discovery, we had wfuzz create a CSV file that contains the directories or files that we want to download. Therefore, the easiest method to download these files is as follows:

  1. We get an alert that new content was discovered
  2. Split the CSV file to obtain the field with the directory or file
  3. With the directory or file name we can then wget to recursively retrieve the data that we need
  4. Add the downloaded data to a git repository and commit it

Overall, you should be alerted of some new files found, and they should all be downloaded for you.

Automate it

With our steps in mind, let’s set out to actually get this working. This algorithm gets initiated by our previously created web discovery script. We will simply modify the previous script to call another script we will create to do all the work with wget.

Step 0

Here is the previous script:

#!/bin/bash

python /path/to/wfuzz/wfuzz.py -w /path/to/wordlist.txt -o csv --hc 404 http://website/FUZZ > /path/to/repository/files
git add files
if git commit -m "scan"; then
    git diff HEAD^ HEAD files | cat | mail -s "Content Changed" email@gmail.com
fi

We simply add a line before the git diff that looks like the following:

/bin/bash /path/to/downloading/script

Easy enough, our new downloading script will be called whenever new content is discovered.

Step 1

Now we must get the filenames or directory from our wfuzz CSV output. Our CSV looks like this:

id,response,lines,word,chars,request,success
0,200,1,1,8,secret,1
1,301,7,13,194,.git,1
3,200,0,0,0,script_secret,1
4,200,0,0,0,new_secret,1

So we cut on ‘,’ and get the 6th field with cat files | cut -d "," -f 6. Pretty simple.

Step 2

We must now use wget to download the files and directories we need. There are many posts about this so we will use these methods.

For simplicity, let us assume we are only tracking one website and that we already know the domain. Therefore, any changes that are found are for only this domain only.

Then, we alter the wget command to recursively get each ‘secret’ file: wget --mirror --no-parent --convert-links --wait=2 http://domain/$file

Where $file is the word that we cut from our CSV file and domain is the website we are trying to profile. This command will download everything it can find links to from the $file path to your /path/to/domain/repo. This way, when everything is done, we use the version control to see the changes.

Note that there is not a way to output to a specific directory using the –mirror option in wget. So, in our crontab entry for later we will have to cd to the right spot to keep our wget downloads structured how we want.

Step 3

In the /path/to/domain/repo we will want to add all the files in the directory to the git staging area to get a diff of the changes. This will look the same as our previous script:

git add .
git commit -m "download"

Wrap it up

Now we just need to take all of our steps we have here and put it into a script that will be run whenever the content discovery script runs.

download.sh:

cd /path/to/domain/repo
for file in $(cat files | cut -d "," -f 6)
do
    wget --mirror --no-parent --convert-links http://domain/$file
    git add .
    git commit -m "download"
done

Note that this script can download A LOT of content. The mirror option for wget will infinitely recurse if it is possible. The --no-parent option should help prevent large downloads but will not prevent all of them.

I recommend reading through this and the previous post to fully understand how these scripts interact with each other, and how your file structure should be set up to function properly.

Files attached for download: discover.sh, download.sh

Thoughts

Originally I thought this solution was hacky and was interested in using a different tool to do this job. Specifically, I was interested in using HTTrack but it would always write over the website folder whenever running with --update. After a couple of hours I gave up and used the simple wget command instead. There is definitely room for improvement here.

Written on February 3, 2017