Linux Sleuthing: August 2012

Thursday, August 23, 2012

Safari History without History.plist

I came back from 7 days of training to a nightmare: impatient criminal investigators booted an Apple Macbook to "see" what the owner had been doing on the Internet in the days leading up to a crime. The laptop had been in evidence for over a year but never examined (no request for examination was made), and trial was looming. A change in the defendants plea less than two weeks before trial necessitated an examination, but no qualified examiners were immediately available. So, someone thought it was a good idea just to boot the Macbook and take a look around .

Safari, decided that the History.plist was tool old and created a new one with the websites that the investigators tried to visit--apparently from bookmarks. So, I was faced with trying to determine a browsing history absent the relevant History.plist.

Assuming the old History.plist was not overwritten, carving 500gb for a plist was not a good option in the amount of time I had to do the work. I was aware from past examinations that Safari keeps web page elements in a sqlite database called Cache.db. It is found in "/Users/<username>/Library/Caches/com.apple.Safari". You may be familiar with that location, because it is the home of a directory called "Webpage Previews" in which are stored jpeg images of visited webpages.

The Cache.db contains four tables: cfurl_cache_schema_version, cfurl_cache_response, sqlite_sequence, cfurl_cache_blob_data. The relevant tables to this discussion are cfurl_cache_response and cfurl_cache_blob_data, which I will refer to as Response and Blob for ease of discussion. I will not try to describe every element of these tables, just those I chose to implement my solution.

The Response Table

The Response table has the following schema:

CREATE TABLE cfurl_cache_response(
entry_ID INTEGER PRIMARY KEY AUTOINCREMENT UNIQUE,
version INTEGER, hash_value INTEGER,
storage_policy INTEGER,
request_key TEXT UNIQUE,
time_stamp NOT NULL DEFAULT CURRENT_TIMESTAMP)

The request_key field contains the URL of the webpage element (html, images, flash, javascript, css, etc.). The time_stamp field is in plain language (yyyy-mm-dd hh:mm:ss) and does not require conversion from an epoch time. The entry_ID is a unique integer that relates directly to the data in the Blob table with a corresponding entry_ID.

The BLOB Table

The Blob table has the following schema:

CREATE TABLE cfurl_cache_blob_data(
entry_ID INTEGER PRIMARY KEY,
response_object BLOB, request_object BLOB,
receiver_data BLOB,
proto_props BLOB,
user_info BLOB)

The receiver_data field contains the object downloaded from the URL in the Response table request_key. The response_object and receiver_data fields contain XML that provide metadata about the receiver_data content. And, again, the entry_ID is a unique integer for this table that corresponds directly an entry_ID in the Response Table.

Making History

The Response table and the Blob table have a 1 to 1 correspondence, i.e., there are equal numbers of records in each table. Response entry_ID 1 corresponds to Blob entry_ID 1, and so on. However, not all the records a relevant to a history of visited webpages. Most are elements of webpages, not the HTML that represents the webpage.

The URLs in the Response table request_key field do not help us much here: many webpages do not end with ".html". A quick look at Google search results will bear this out. Therefore, we can't filter the URLs in any meaningful way. We could try to exclude certain objects, like .jpg, .png, .js, .css, .swf, etc., but this is difficult and not reliable.

I settled on filtering the Blob table receiver_data content for the HTML tag that represents a webpage (or at least HTML content). The query uses a "natural join" syntax that marries the tables on the entry_ID field and then only returns the records that have the leading "" tag:

sqlite3 -header Cache.db "select r.entry_ID, time_stamp, request_key from cfurl_cache_blob_data c , cfurl_cache_response r where c.entry_id=r.entry_id and receiver_data like '<html>%'"

The query above reduced a database containing 6700 elements to 187 webpages. I expanded this to any item with an HTML tag, in the event of malformed pages, by placing a leading wildcard before the tag:

# sqlite3 -header Cache.db "select r.entry_ID, time_stamp, request_key from cfurl_cache_blob_data c , cfurl_cache_response r where c.entry_id=r.entry_id and receiver_data like '%<html>%'"

By adding the leading wildcard, my history increased to 261 records. I can redirect the output to a file for analysis, and I can modify the query to actually export the data in the receiver_data field to view the actual content of any pages of interest.

EDIT:
Better still, because file headers can very in HTML documents (think "doctype" strings), is to search for the HTML footer tag "/html" (brackets excluded intentionally):

sqlite3 -header Cache.db "select r.entry_ID, time_stamp, request_key from cfurl_cache_blob_data c , cfurl_cache_response r where c.entry_id=r.entry_id and receiver_data like '%/html%'"

By allowing for a variable header string, my HTML history increased to 328 records. I have since written a utility to export the files in the Cache.db and have checked the files by mime-type: there were 349 html files detected. I hope to reconcile this in the near future.

Making Sense

I can't take the time to break down the queries right now, but I would like to highlight one expression that might not be familiar to casual sqlite users:

sqlite3 -header Cache.db "select r.entry_ID, time_stamp, request_key from cfurl_cache_blob_data c , cfurl_cache_response r where c.entry_id=r.entry_id and receiver_data like '<html>%'"

In the highlighted portion above, I list the two tables that are the subject of the query. The trailing "c" and "r" behind the full table names are aliases for the full names. The alias save a lot of typing, and you see them employed in the select clause, which tells sqlite which entry_ID I desire (since it exists in both tables) and in the where clause.

If anyone has another idea on how to accomplish this task or is aware of any shortcomings to this approach, please comment. After all, I'm trying to make lemonade from from the lemons I've been handed...

Thursday, August 9, 2012

File "Cliffs" Notes: Abbreviating File Reads

Sometimes a forensics task is pretty narrow. Determining file type means reading the first few 10's of bytes in most cases. Luckily there are tools that do just that. But for other tasks we might face, our standard tools don't really give us a method for narrowing their scope. You might want to search files for a string that you expect to find in the first 1000 bytes of the file. Grep will allow you to stop searching a file once the string is found, but it won't stop searching until it finds the match. This might not seem like a big deal until you consider large files like videos. You really want to search 1k of data, but you end up searching 2 GB!

Head

The head command is used to output the first part of a file. By default, it outputs the first 10 lines of a files (best when applied to text files), but it can be used to output a user-defined amount of bytes.

$ head -c 1000 file # Export the first 1000 bytes of "file". "1k" can be used as a shortcut for "1000". Similarly, the tail command can be used to export data from the end of a file.

You can stop reading here if you are only needing to ouput a fixed amount of bytes from one file. But, usually the forensic examiner is concerned with searching through hundreds of thousands of files, if not millions!

A Real-World Example

Let's say we have a disk image containing hundreds of thousand graphics, but we only want to view photographs created with a digital camera. Graphics viewing programs like Gthumb have the ability to recurse through a file system and find graphics images, and even provide for some search limits like file size, modified date, etc. But these filters don't really help us here.

We need a way to limit the graphics we view to digital photographs. The best way for us to do this is to limit our search for images based on the presence of Exif data. The term Exif stands for Exchangeable Image File Format. Exif data contains camera and setting information and was developed to encourage interoperability between digital devices (see exif.org). It is found before the image data in a jpeg file. Most digital cameras use this standard.

Exif data is preceed by the header 'Exif' found 25 bytes into a jpeg file. So, it seems logical to search graphics files for the Exif header. But how do we search for graphics files, and then search only the first 28 bytes for the term 'Exif'?

Find and File

The find command can be used to search recursively through a file system. It's a very robust tool, but I'll only focus on one of the the simplest invocations here: searching a path recursively for any file:

$ find path/to/search/ -type f # The "f" argument to the "-type" option returns only files to standard output. The "d" argument would limit output to directories, etc.

We can further limit our output based on file name, if we wish:

$ find path/to/search -type f -iname ".jp*g" # The -iname option matches the following argument with case insensitivity. Note that standard wildcards are accepted. The -name option would produce case sensitive results.

We might be looking for graphics that have been renamed, however, and limiting our search by name is too narrow. This is where the file command comes in. File reads a files header and returns its type. We can pipe files located with find and read their type thusly:

$ find path/to/search -type f | file -i - # The -i option returns the files MIME type, which is more standardized than the "plain language" descriptions otherwise provided. The file command requires a file argument, and the '-' after the -i option is a shortcut to stdin. In otherwords, the - is replaced by the file path returned by the find command.

We now have a list of files with their types being returned to the standard output. What we want to do is filter those files for images, and then get a subset of images with exif data. To do so, insert a while loop and add a test:

$ find path/to/search -type f | while read i; do file -ib "$i" | grep "image"; [ $? = 0 ] && echo "$i"; done # Here, each file located with find is assigned to variable $i in the while loop. Inside the loop, the file type of $i is determined and if the type contains the string "image", the file path is printed. The "-b" option was added to the file command to produce "brief" output, that is output that includes the file type absent the file name (this avoids false hits where the term "image" is in the file path).

If you're not familiar with bash tests, please review this post, or on the BASH command line, enter "help test". In brief, we are testing the grep exit code, and if it is '0', which means success, we execute the echo command to print the file path.

We've now managed a list of files that are classified as images. We need to determine which have Exif data to complete the our quest. We do this by integrating one more test:

$ find path/to/search -type f | while read i; do file -ib "$i" | grep "image"; [ $? = 0 ] && head -c 28 "$i" | grep "Exif"; [ $? = 0 ] && echo "$i"; done # Here, if the file is determined to be an image, the first 28 bytes are grepped for the string "Exif". If that results in a match, the filename is printed.

Okay, you caught me: I only produced a list of files with Exif headers and I can't see the content of the images by reading a list. This is easily rectified with the ln or 'link' command. We can replace the last echo with a command to create link files to the images we located allowing for all the files to be viewed from one location:

$ find path/to/search -type f | while read i; do file -ib "$i" | grep "image"; [ $? = 0 ] && head -c 28 "$i" | grep "Exif"; [ $? = 0 ] && ln -s "$i" "${i##*/}"-$(stat -c %i "$i"); done # The -s option creates a symbolic link to file $i in the current directory. The link will be named for the file's basename and have the inode number appended to prevent errors the occur if the files have the same name.

Don't worry about the "${i##*/}"-$(stat -c %i "$i") razzle-dazzle for now. I'll explain that in another post. For now, just mimic that part of the command if you are concerned about files having the same name.

A Little More Detail for Those Still Conscious

It turns out that Exif is really not the best string on which to filter. More effective was filtering on an exif creation date. Consider a recent case where I carved jpegs from unallocated clusters. Here, I could filter on file name over file type because the carver automatically appended the .jpg file extension.

Total number of files carved:

$ find recovered/ -type f | wc -l # The wc or "word count" command can be used to count lines of output
231331

My total number of jpeg images:

$ find recovered/ -name "*.jpg" | wc
61432

Total number of jpeg images with Exif data:

$ find recovered/ -name "*.jpg" | while read i; do head -c 28 $i | grep Exif ; done | wc -l
3815

Total number of jpeg imags with Exif creation dates:

$ find recovered/ -name "*.jpg" | while read i; do head -c 256 $i | grep -E '[0-9]{4}:[0-9]{2}:[0-9]{2}\ [0-9]{2}:[0-9]{2}:[0-9]{2}'; done | wc -l # The regular expression matches output like "2012:08:09 17:11:56"
841

As you can see, filtering on a date code was much more effective at finding files with meta data associated with cameras (file creation dates) the on the Exif header alone.

I know that there is a lot of information here, but I hope by presenting it completely you are able to understand and adapt the concepts to your own uses.

Thursday, August 2, 2012

Obtaining USB Device Details

It's very often the desire of Forensics investigators to determine if a particular USB flash drive has been mounted in a computer that is the subject of an examination. This brief post will not cover the various methods used check the Windows Registry or other OSes for a history of mounted devices, but instead how to extract USB Device details to match to the Windows Registry artifacts.

usbutils

The usbutils package includes the lsusb, a tool to list USB devices. It can be used to find the device manufacturer (not always apparent from the exterior), the serial number, the vendor ID, and other information that can be used to identify the device in a computer system log or settings file.

The basic command, lsusb, lists all attached usb devices, including hubs:

$ lsusbBus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hubBus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hubBus 001 Device 002: ID 8087:0020 Intel Corp. Integrated Rate Matching HubBus 002 Device 002: ID 8087:0020 Intel Corp. Integrated Rate Matching HubBus 002 Device 003: ID 0557:8021 ATEN International Co., Ltd Bus 002 Device 004: ID 0557:2213 ATEN International Co., Ltd CS682 2-Port USB 2.0 DVI KVM SwitchBus 001 Device 004: ID 0930:6545 Toshiba Corp. Kingston DataTraveler 102 Flash Drive / HEMA Flash Drive 2 GB / PNY Attache 4GB StickBus 001 Device 005: ID 1058:1111 Western Digital Technologies, Inc.

The Kingston device is the device of interest. As you can see, the basic output lets us determine the bus, device number, and vendor id (ID 0930:6545) but not the serial number or other device details. From the basic help, we can see how to address the specific device of interest to obtain more information:

Usage: lsusb [options]...
List USB devices
-v, --verbose
Increase verbosity (show descriptors)
-s [[bus]:][devnum]
Show only devices with specified device and/or
bus numbers (in decimal)
-d vendor:[product]
Show only devices with the specified vendor and
product ID numbers (in hexadecimal)
-D device
Selects which device lsusb will examine
-t
Dump the physical USB device hierarchy as a tree
-V, --version
Show version of program

Therefore, to obtain a full set of details of the Kingston flash drive, we can use the -D option and provide the device path as an argument. We determine the device path using the Bus/Device information from the base lsusb output. Because we are adressing a device, we need root privileges, so su to root or use sudo:

# lsusb -D /dev/bus/usb/001/004
Device: ID 0930:6545 Toshiba Corp. Kingston DataTraveler 102 Flash Drive / HEMA Flash Drive 2 GB / PNY Attache 4GB Stick
Device Descriptor:
bLength 18
bDescriptorType 1
bcdUSB 2.00
bDeviceClass 0 (Defined at Interface level)
bDeviceSubClass 0
bDeviceProtocol 0
bMaxPacketSize0 64
idVendor 0x0930 Toshiba Corp.
idProduct 0x6545 Kingston DataTraveler 102 Flash Drive / HEMA Flash Drive 2 GB / PNY Attache 4GB Stick
bcdDevice 1.00
iManufacturer 1 Kingston
iProduct 2 DT 101 G2
iSerial 3 001CC0EC346EEC11######## (redacted)
bNumConfigurations 1
Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength 32
bNumInterfaces 1
bConfigurationValue 1
iConfiguration 0
bmAttributes 0x80
(Bus Powered)
MaxPower 200mA
Interface Descriptor:
bLength 9
bDescriptorType 4
bInterfaceNumber 0
bAlternateSetting 0
bNumEndpoints 2
bInterfaceClass 8 Mass Storage
bInterfaceSubClass 6 SCSI
bInterfaceProtocol 80 Bulk-Only
iInterface 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x81 EP 1 IN
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0200 1x 512 bytes
bInterval 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x02 EP 2 OUT
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0200 1x 512 bytes
bInterval 0
Device Qualifier (for other device speed):
bLength 10
bDescriptorType 6
bcdUSB 2.00
bDeviceClass 0 (Defined at Interface level)
bDeviceSubClass 0
bDeviceProtocol 0
bMaxPacketSize0 64
bNumConfigurations 1
Device Status: 0x0000
(Bus Powered)

As you can see, we get a nice, report worthy display of the device details, including the serial number. The output can be redirected to a file with the usual redirection operators, or sent to the terminal and a file at the same time with the tee command:

# lsusb -D /dev/bus/usb/001/004 | tee device_details.txt

Finally, the verbose option can be used to obtain the same device details as the -D option and without the need to exercise root privileges or address the device, but there's a catch: you get verbose output of all USB devices. Pick your poison.

Wednesday, August 1, 2012

Waiting on Long Processes? Don't!

The Scene

[Cue dramatic organ music] You started imaging a large hard drive, and you need to follow that with a md5sum of the device to verify the data hash in the image is the same as the device. But, its time to go home and the image isn't complete! [dunt, Dunt, DUN!] Yes, you could have written you command to start the md5sum program upon completion of the acquisition, but you didn't. Is there anything you can do to ensure you can go home to a warm dinner AND still complete the hashing operation?

You bet!!

Much of life, computing and digital forensics included, is about processes. First "A" happens, then "B", and next "C." Forensic disk imaging is like that: First attach the device, then image the device, and next verify the image. GUI tools can be handy for this in that they can be configured to do several sequential steps for us. Take the Guymager imaging tool for example: it allows you to verify the image with a separate hash of the device, and it performs the imaging and verification, one after the other, without your intervention.

But what if you are working on the command line? How can you start one process immediately after another completes? Back to our scenario...

What You Could Have Done

The whole issue could have been solved at the BASH command line when you issued your acquisition command. BASH allows control operators to control the flow of your commands. You have the ability to run command_1 AND command_2, or alternatively, command_1 OR command_2, for example. The linchpin is the exit status (success or failure) of command_1.

With the AND operator, which is represented in BASH by "&&", command_2 will execute if command_1 was successful (exit status 0).
With the OR operator, which is represented in BASH by "||", command_2 will execute if command_1 was unsuccessful (non-zero exit status).

So, if I wanted to make an Expert Witness Format image with ewfacquire, a tool from the libewf library, and follow that automatically with a hash of the device, I could:

# ewfacquire /dev/sdd && md5sum /dev/sdd

With that command, the md5sum of /dev/sdd will calculate if ewfacquire is successful,. If ewfacquire fails for some reason, then the hashing operation will not execute.

Woulda, Coulda, Shoulda!

But in our scenario, you are already hours into an acquisition and it would be counter productive to stop ewfacquire process just to use the AND operator described above. How can you cause the md5sum program to run after the acquisition in such a circumstance? Well, two ways, as a matter of fact.

Wait

The wait command is a BASH builtin that takes a process ID as an argument. When the process terminates, wait exits. It can be used much like the example above:

# wait 1972 && md5sum /dev/sdd

Thus, when wait exits successfully, the md5sum operations begins. If that seems too easy, in a sense, it is. Wait only works on processes of the same shell session. That means to use wait, you'd have to interrupt the ewfacquire process with ctrl-z, run it in the background with bg, and then determine its process ID before wait would work. You'll may find pull this off, because ewfacquire messages will still print to stderr making working in the shell difficult.

A better way

You can accomplish the same result as wait without backgrounding the ewfacquire process through the BASH test builtin command. With test, it is possible to open another shell, determine the process ID of ewfacquire, and test for the presence of the process in the second shell. Consider:

# ps -opid= -C ewfacquire
1972
# while [ -d /proc/1972 ]; do sleep 60; done && md5sum /dev/sdd

The first command, ps, shows current process. The arguments here format the output to show only the process ID (-opid=) and search for the process by name (-c ewfacquire). There are many ways to determine the process id, and this command is borrowed from here so you can see how it could be incorporated into a script.

The while loop uses test in the form of [ -d /proc/1972 ] to check for the presence of the /proc/1972 directory. Every running process has a directory in the /proc file system which is removed when the process ends. The sleep command pauses the while loop 60 seconds at a time. The loop successfully terminates when the process directory is removed causing the md5sum the execute.

It may seem complicated at first blush, but its really not. And, you get to go home to a warm meal instead of cold leftovers for your efforts.