Archiving Websites for OSINT Investigators

Adapted from a presentation given to the National Child Protection Task Force in June 2021

Why Archive?

It is a challenge for investigators to properly capture the content and context of information from websites. Modern websites are rarely comprised of simple html content hosted in a single location (I realise the irony of this appearing on a website built in exactly this way using Hugo). Most websites are more like a restaurant meal, assembled using a variety of ingredients sourced from different locations and brought together according to a specific set of instructions to ensure consistent presentation and experience. In addition to this potentially complex construction modern website content is often dynamic, reacting to user interaction and this can be hard for archiving and page cature tools to replicate.

Despite this a surprising amount of internet investigations rely on simple screenshots of websites as evidential material.

Whilst there is no doubt that using a browser extention or the operationg system screenshot tool is convenient IMHO trying to investigate a website purely using screenshots is like trying to review a meal served at a restaurant by merely looking at a photo of the meal. Simple screenshots may suffice for very simple investigations but any thorough investigation needs to be looking deeper below the surface, both for verification of the content but also to develop further lines of enquiry.

In these cases investigators should aim to capture the true experience of navigating a website with all the local content, remote content, underlying scripts and frameworks preserved, and crucially, the archived copy should be interactive so that steps can be replicated and results verified. If we can acheive this objective but retain the convenience of a screenshot tool, all the better.

Properly archiving a website offers a number of additional advantages:

  1. We only have to visit the site on a single occasion, leaving the smallest possible footprint on the server logs and avoiding the dreaded pitfall of multiple website visits being interpreted as surveillance;

  2. Content will be captured in a format that allows advanced investigative techniques to be conducted without alerting the suspect or website operator;

  3. Anyone can replicate the steps we have taken to identify relevant information in the website content and (hopefully) come to the same conclusions;

  4. The entire website archive can be presented as an evidential product to a Prosecutor;

  5. The content, as seen at the time of the capture, is preserved, so any subsequent changes made by the website operator will not impact our investigation.

There are many techniques an investigator can use to develop lines of enquiry from their target website which can also be conducted on an archived copy, some examples from the consistently excellent osintcurio.us include:

I’m going to focus on a couple of open source tools (this is an OSINT blog after all) that I have found to be simple yet effective options for archiving target websites.

Singlefile

My first choice is not strictly an ‘archiving tool’, but it is certainly a significant upgrade to the standard screenshot program. SingleFile (and SingleFileZ) allow users to convert a web page into a single html file with quality reproduction of the appearance of the webpage. They are available as browser extentions for Firefox and Chrome-based browsers or as a CLI tool. SingleFile will capture the appearance of a website but also, importantly for us, the underlying source code and scripts.

Advantages

  • Single click to capture (with browser extension);
  • Webpage preserved as a HTML file (or a compressed version of the same if using SingleFileZ). This can be opened using any web browser.
  • Does a great job of reproducing the visuals of the original website, even with modern dynamically built websites.
  • Page source is also captured with an additional note giving the local system time that the capture was made and noting that the capture was made using SingleFile, excellent for auditing purposes.

SingleFile_Stamp

  • Command-Line version is available for automation of page captures.

Drawbacks

  • Cannot reproduce full behaviour of all interactive elements such as drop-down menus.
  • Internal and external hyperlinks are preserved - too much clicking about in the preserved copy and you will find yourself back visiting the original site or a linked site - BEWARE.
  • Struggles with some infinite scrolling sites such as Instagram and Facebook;
  • Does not capture embedded video content.

Webrecorder Project

Webrecorder Logo

In my opinion a hidden gem amongst opensource tools. The project includes an archiving tool [archiveweb.page] and an archive replay tool [replayweb.page]. Archiveweb is available as a browser extention for all chromium-based browsers and as a standalone desktop app. Webrecorder is a fully functional archiving tool with the convenience of a standard screenshot tool.

Advantages

  • This is by far the best tool I have ever used for capturing dynamic websites and interactive content, it reproduces content extremely accurately.
  • Ease of use is tremendous, simply set the browser extension going and browse as normal, visual feedback will tell you when the page has been fully captured.
  • The nifty autopilot feature, will handle infinite scroll websites and embedded videos (including YouTube);
  • Archives are saved in the industry standard WARC or WARZ format, which can be viewed via replayweb.page or any other web archive player.
  • When reviewing the archive the user will be warned if they attempt to visit any external resources, a truly useful function for those looking to minimise their footprint on the subjects of their investigation.

Drawbacks

  • Archives can get pretty big pretty quickly, especially if there is embedded video in the web pages.
  • Automating webrecorder is not straightforward but can be achieved using the pywb toolkit (also available from webrecorder.net)
  • Loading and reviewing the archives can be a little slow, but I’m really splitting hairs here.

Summary

I see SingleFile as a direct replacement for your standard screenshot tool, using the tool is just as convenient but you get a big increase in useability and auditability. For any complex or sensitive investigation the Webrecorder is a fantastic option. In my experience the only thing that comes close is the mighty Hunchly, but even that struggles to replicate website content as accurately as Webrecorder. For free I really don’t think there is any competition if you are looking for an archiving tool to add to your osint investigation toolkit.