Thursday, August 23, 2012

Safari History without History.plist

I came back from 7 days of training to a nightmare: impatient criminal investigators booted an Apple Macbook to "see" what the owner had been doing on the Internet in the days leading up to a crime.  The laptop had been in evidence for over a year but never examined (no request for examination was made), and trial was looming.  A change in the defendants plea less than two weeks before trial necessitated an examination, but no qualified examiners were immediately available.  So, someone thought it was a good idea just to boot the Macbook and take a look around .

Safari, decided that the History.plist was tool old and created a new one with the websites that the investigators tried to visit--apparently from bookmarks.  So, I was faced with trying to determine a browsing history absent the relevant History.plist.

Assuming the old History.plist was not overwritten, carving 500gb for a plist was not a good option in the amount of time I had to do the work.  I was aware from past examinations that Safari keeps web page elements in a sqlite database called Cache.db.  It is found in "/Users/<username>/Library/Caches/com.apple.Safari".  You may be familiar with that location, because it is the home of a directory called "Webpage Previews" in which are stored jpeg images of visited webpages.

The Cache.db contains four tables: cfurl_cache_schema_version, cfurl_cache_response, sqlite_sequence, cfurl_cache_blob_data.  The relevant tables to this discussion are cfurl_cache_response and cfurl_cache_blob_data, which I will refer to as Response and Blob for ease of discussion.  I will not try to describe every element of these tables, just those I chose to implement my solution.

The Response Table

The Response table has the following schema:
CREATE TABLE cfurl_cache_response(
  entry_ID INTEGER PRIMARY KEY AUTOINCREMENT UNIQUE,
  version INTEGER, hash_value INTEGER,
  storage_policy INTEGER,
  request_key TEXT UNIQUE,
  time_stamp NOT NULL DEFAULT CURRENT_TIMESTAMP)
The request_key field contains the URL of the webpage element (html, images, flash, javascript, css, etc.).  The time_stamp field is in plain language (yyyy-mm-dd hh:mm:ss) and does not require conversion from an epoch time.  The entry_ID is a unique integer that relates directly to the data in the Blob table with a corresponding entry_ID.

The BLOB Table

The Blob table has the following schema:
CREATE TABLE cfurl_cache_blob_data(
  entry_ID INTEGER PRIMARY KEY,
  response_object BLOB, request_object BLOB,
  receiver_data BLOB,
  proto_props BLOB,
  user_info BLOB)
The receiver_data field contains the object downloaded from the URL in the Response table request_key. The response_object and receiver_data fields contain XML that provide metadata about the receiver_data content.  And, again, the entry_ID is a unique integer for this table that corresponds directly an entry_ID in the Response Table.

Making History

The Response table and the Blob table have a 1 to 1 correspondence, i.e., there are equal numbers of records in each table.  Response entry_ID 1 corresponds to Blob entry_ID 1, and so on.  However, not all the records a relevant to a history of visited webpages.  Most are elements of webpages, not the HTML that represents the webpage.  

The URLs in the Response table request_key field do not help us much here: many webpages do not end with ".html".  A quick look at Google search results will bear this out.  Therefore, we can't filter the URLs in any meaningful way.  We could try to exclude certain objects, like .jpg, .png, .js, .css, .swf, etc., but this is difficult and not reliable.

I settled on filtering the Blob table receiver_data content for the HTML tag that represents a webpage (or at least HTML content).  The query uses a "natural join" syntax that marries the tables on the entry_ID field and then only returns the records that have the leading "" tag:
sqlite3 -header Cache.db "select r.entry_ID, time_stamp, request_key from cfurl_cache_blob_data c , cfurl_cache_response r where c.entry_id=r.entry_id and receiver_data like '<html>%'" 
The query above reduced a database containing 6700 elements to 187 webpages.  I expanded this to any item with an HTML tag, in the event of malformed pages, by placing a leading wildcard before the tag:
# sqlite3 -header Cache.db "select r.entry_ID, time_stamp, request_key from cfurl_cache_blob_data c , cfurl_cache_response r where c.entry_id=r.entry_id and receiver_data like '%<html>%'"
By adding the leading wildcard, my history increased to 261 records.  I can redirect the output to a file for analysis, and I can modify the query to actually export the data in the  receiver_data field to view the actual content of any pages of interest.

EDIT:
Better still, because file headers can very in HTML documents (think "doctype" strings), is to search for the HTML footer tag "/html" (brackets excluded intentionally):
sqlite3 -header Cache.db "select r.entry_ID, time_stamp, request_key from cfurl_cache_blob_data c , cfurl_cache_response r where c.entry_id=r.entry_id and receiver_data like '%/html%'"
By allowing for a variable header string, my HTML history increased to 328 records.  I have since written a utility to export the files in the Cache.db and have checked the files by mime-type: there were 349 html files detected.  I hope to reconcile this in the near future.

Making Sense


I can't take the time to break down the queries right now, but I would like to highlight one expression that might not be familiar to casual sqlite users:
sqlite3 -header Cache.db "select r.entry_ID, time_stamp, request_key from cfurl_cache_blob_data c , cfurl_cache_response r where c.entry_id=r.entry_id and receiver_data like '<html>%'"
In the highlighted portion above, I list the two tables that are the subject of the query.  The trailing "c" and "r" behind the full table names are aliases for the full names.  The alias save a lot of typing, and you see them employed in the select clause, which tells sqlite which entry_ID I desire (since it exists in both tables) and in the where clause.

If anyone has another idea on how to accomplish this task or is aware of any shortcomings to this approach, please comment.  After all, I'm trying to make lemonade from from the lemons I've been handed...