Tuesday, July 26, 2011

BlackBerry Messenger / Google Talk for BlackBerry Save-files

My last post, "BlackBerry Text Message Parsing, AKA, Why I use Linux for Forensics," had unintended but pleasant consequences: it introduced me to BlackBerry forensics expert Shafik Punja.  I really wrote the post to comment on the versatility of the Linux operating system, and fortunately I erred in calling the file that was the subject of my discussion a "BlackBerry Text Message" save file, rather than the BlackBerry Messenger (BBM) save file that it was.

Shafik corrected me and taught me a bit more about the file type that I'll share here in the event that you missed his comments.  BBM messenger does not save message content by default.  It is a user selected option, and only available in BBM v5 and higher.  The files can be saved to the device's internal memory or an installed memory card.  Google Talk has the same save options and defaults as BBM, and also saves in the same basic format.

Shafik asked if I could write a program to parse the files.  I agreed and quickly converted the basic Bash script in my last post to a more complete program.  However, when Shafik tried to run the program in OS X, it failed because OS X and Linux do not utilize the same 'date' command.  I was faced with writing a new program for use in OS X, which I wished to support, or make my current script intelligently determine the OS in which it was running and apply the correct date command syntax.

Nothing seems like a bigger waste to me than to write essentially the same thing twice.  Isn't there a way to write a program that doesn't have to be altered based on the operating system in which it is run, I asked?  Then a little Australian-accented voice of reason reached my conscious thought: why don't you learn python?  For the millionth time: Thank you Michael, good idea.

Michael Cohen, a software engineer at Google and author/maintainer of pyFLAG, AFF4,  and other tools soon to be released, has suggest many times over that I learn python, but I only recently started to do so.  Python is an excellent solution for cross platform programs, and I decided it was time to start.

The BBM / GTalk file format

BBM and GTalk save files are CSV text files with the following basic structure:
Application Name, version[newline]
DateCode, "Sender_ID", "Receiver_ID", Message[newline]
...

The first line of the file is the source application and the application version, separated by a comma. The version is followed by a newline control code.

The remaining lines are formatted thus: the line starts with a DateCode consisting of a 21 digit number. The first 8 digits are the date of the sent message rendered as YYYYMMDD. The remaining 13 digits are Unix epoch time, rendered in milliseconds. In the case of BBM, the sender Hex_ID follows the DateCode, separated by a comma and encapsulated in parenthesis. Similarly, the receiver's Hex_ID follows. For GTalk, the sender and receiver are identified by Google Account ID. Finally, the message, in text, is stored followed by a newline control code.

One minor issue in rendering the file in a spreadsheet centers on the message field. Messages are not encapsulated in commas; therefore, messages with commas will not format well in spreadsheets without additional spreadsheet configuration. This might not be a problem for one file, but would be a major annoyance for many files (one of my colleagues found 23 BBM save files on an SD card). I determined to encapsulate the message field in quotes to overcome this issue.

The python3 code I settled on follows:
import sys
from time import strftime, localtime

def main(csv):
    line_no = 0
    with open(csv) as db_file:
        for line in db_file:
            if line_no == 0:
                print(line)
                print('Date,DateCode,Sender,Receiver,Message')
                line_no += 1
            else:
                datecode, sender, receiver, message = line.split(',', 3)
                date = int(datecode[8:18])
                date = strftime('%Y-%m-%d %H:%M:%S (%Z)', localtime(date))
                print('"{}",{},"{}","{}","{}"'.format(date, datecode, sender, receiver, message.strip()))

if __name__ == '__main__':
    for file in sys.argv[1:]:
        main(file)

I won't try to explain all the code here, because with my rudimentary knowledge of python, I'm likely to misspeak. I will highlight a couple parts of the script and a couple of nice python features, however. First, the "if" condition in line 8 is present to skip over date decoding in the first line of the file. Instead, it prints the line, and adds a line of column headers to help with interpreting the remaining message data.

Line 13 demonstrates a very nice python feature: multiple value assignment. Here, I assign data from line 2 and higher, one line at a time (note the 'for' loop) to the datecode, sender, receiver, and message objects. To work correctly, however, there needs to be four values to assign to the four objects. The split method divides a string object into a list of strings based on a delimiter. Here, I configure split to divide on commas a maximum of three times. This creates a list of four strings. Remaining commas are ignored and included in the last string, which here is assigned to 'message'.

Line 14 converts the 'datecode' string to an integer, and line 15 uses the 'strftime' and 'localtime' functions from the 'time' module to convert the Unix epoch time embedded in the datecode to a human-readable format. I use string slicing, that is, selecting a portion of the datecode string by its position in the string (also called indexing where the starting position is 0), to extract the 10 digits that represent the Unix epoch value. The last three digits of the datecode are ignored since they represent milliseconds.

The final line of which to take note is Line 16, where a few things of import are occurring. First, the interpreted datecode is printed. The original datecode follows for reference and validation purposes, and then the remainder of the line. A format method is applied that replaces the brackets place holders in the print string with the objects that are arguments of the format method. Here, the message object is encapsulated in quotes. Also, the message object has the strip method applied to remove the newline control code at the end of each message to improve formatting.

Though I won't comment on it, I'll include the bash script I wrote first for comparison:
line_no=0 #set counter

cat $1 | while read line
do
    if [ $line_no -eq 0 ]
    then
        echo ${i}  #print first line unaltered
        echo "Date,DateCode,Sender,Receiver,Message"
        line_no=1
    else
        for var in date sent rcvd message
        do
            eval $var="\${line%%,*}"
            [[ "$var" = "message" ]] && message=$line #ignore commas in message
            line="${line#*,}"
        done
        date=$(date +"%Y-%m-%d %H:%M:%S (%Z)" -d @${date:8:10}) #convert date code and store
        len=$((${#message}-1)) #subtract 1 from remainder length for stripping newline control code
        echo "\"$date\",\"$sent\",\"$rcvd\",\"${message:0:$len}\""
    fi
done

Thursday, July 21, 2011

BlackBerry Text Message Parsing, AKA, Why I use Linux for Forensics

A little detour from my usual posts to explain why I use Linux for forensics, though my upbringing was in Windows-based tools like EnCase.  A colleague contacted me today with a little issue: He had found a BlackBerry text messaging backup file in CSV format (EDIT: This was actually a BlackBerry Messenger save file) on an external memory card, but the date code for each message was perplexing.  He asked me if I could help in interpret the code.  It looked like the following:
201010181287467321760
The full format of the CSV was "date, from(hexID), to(hexID), message."  It was obvious to my colleague and me that the first 8 digits of the date code was the date of the message in plain text, i.e., "20101018" or "2010-10-18."  My Unix roots made the remaining digits of the numeric string easy to identify: unixepoch in milliseconds, i.e., 1287467321760 milliseconds since 1-1-1970 00:00:00.

A quick verification with the date command, but truncating the date at seconds (i.e., dividing by 1000):
$ date -d @1287467321
Mon Oct 18 22:48:41 PDT 2010

The converted unixepoch date matches the plain text date. We seem to have interpreted the code correctly, and now we know the local time of the message as well.

But of course, using the date command is not why I find Linux so valuable.  It because of the ease with which I was able to convert the whole file, having discovered the meaning of the date code.  Remember the format of the file?  It was "date, from(hexID), to(hexID), message."

Consider three lines from the file as an example:
201010181287467321760,"6C31FB2C","0F315216",Hey
201010191287534544913, Oct 19 17:29:04 PDT 2010,"6C31FB2C","0F315216",Hey, you there?
201010191287534602157,"0F315216","6C31FB2C",Yeah, let's meet.

A quick, simple while loop to read each line of the Messenger save file,
$ cat backup.csv | while read line; do date=${line%%,*}; remainder=${line#*,}; echo "$(date -d @${date:8:10}),$remainder"; done
Mon Oct 18 22:48:41 PDT 2010,"6C31FB2C","0F315216",Hey
Tue Oct 19 17:29:04 PDT 2010,"6C31FB2C","0F315216",Hey, you there?
Tue Oct 19 17:30:02 PDT 2010,"0F315216","6C31FB2C",Yeah, let's meet.
...

Let me break that down:
cat backup.csv | while read line  #read a line of the file, assign to variable 'line'
do
  date=${line%%,*}  #read everything up to the first comma, assign to variable 'date'
  remainder=${line#*,}  #read everything beyond the first comma, assign to variable 'remainder'
  echo "$(date -d @${date:8:10}),$remainder" #convert digits 9-18 to local time, print localtime and the remainder of the line to stdout
done

The key to this solution is something I didn't learn in my initial studies of Bash, but I make extensive use of it now: variable expansion. I use the various expansions available in Bash 4 to assign portions each line to variables that I could then operate on and print the result. I won't discuss all available expansions here, but I will explain those I used:

Removing the longest match from the end:

Consider: Each line contained comma separated values. I really only needed to operated on the first value -- the date code. I prefer, as much as possible, to not call external tools, such as cut or awk, so as to not unnecessarily start external processes. Bash variable expansion makes this possible. The syntax ${var%%PATTERN} will remove the longest match to PATTERN from the variable 'var'.

So, in line 3, date=${line%%,*} assigns to the 'date' variable all of the contents of the 'line' variable up to the first comma. Thus, in the case of the first line, date="201010181287467321760".

Removing the shortest match from the beginning:

With the date code isolated in the 'date' variable, we still need to print the rest of the line once we convert the date. The syntax ${var#PATTERN} will remove the shortest match to PATTERN from the variable 'var'.

So, in line 4, remainder=${line#*,} assigns to the 'remainder' variable all of the contents of the 'line' variable after the first comma. Thus, in the case of the first line, remainder=""6C31FB2C","0F315216",Hey"

Returning a substring of a variable:

Finally, we need to isolate the unix epoch time from the plain text date in the date code now stored in the variable 'date'. We do this by indexing. The syntax ${var:OFFSET:LENGTH} will return a substring of the variable 'var' starting at OFFSET for the specified LENGTH. The first character in a variable is indexed at offset 0.

So, in line 5, ${date:8:10} returns 10 characters of the variable 'date' starting at the 9th character (remember, indexing starts at 0). Thus, we have now fed the unix epoch date string incorporated in the Messenger date code to the unix date command to be converted to local time in a human readable format.

Line five is a complex command, that indexes the 'date' variable, converts it in a sub-process with the date command, and then echo the result with the contents of the 'remainder' variable appended.

Where to go from here:

If you have some Bash skills, but want to advance them, I recommend the book "Pro Bash Programming: Scripting the GNU/Linux Shell" by Chris F.A. Johnson from Apress.

Tuesday, July 19, 2011

Mounting Split Raw Images

A raw image, made with dd or a variant, is still a common image format, and will not go away soon even as many argue the benefits of forensic images such as the Expert Witness Format (supplied through libewf) and the Advanced Forensic Format (supplied through afflib).  But raw images can be difficult to tote around because they are bit for bit copies which makes the copy as large as the original.  As such, the images are often split to fit on external media such as DVD.

But splitting, while solving storage problems, creates a new problem.  What if you want to mount the image for examination?  True, Sleuthkit can handle the examination of split raw images, but sometimes there is no equal to simply mounting an image during an examination.

Let me illustrate using a situation I encountered yesterday.  A colleague had a split raw image of over 200 segments that he wished to mount and then boot in a virtual machine.  He tried to follow my tutorial  but was unsuccessful, uncertain as to why.  When I looked into the situation with him, the issue became clear: xmount, the tool used to create a virtual disk from a disk image, was only mounting the first segment of the split raw image, despite being given all the segments as arguments as is required with Expert Witness Format images.  More simply put, xmount does not handle split raw images.  It will handle a single raw image file just fine, however.

What to do?  One could simply cat the files together, but that means doubling storage requirements, at least until the concatenation operation is concluded.  That might not be feasible or desirable, and it can be very time consuming.  In this case, we were talking 300 GB of data.  It would be great to be able to treat the segments as one file, and pass that file to xmount to accomplish the purpose.

Affuse to the rescue!  Affuse is part of the afflib tool suite.  It creates a virtual file system using fuse and mounts it to a location you specify.  You only pass the first segment of the split image as an argument.  The command takes the form:

# affuse image mount_point

Affuse creates an image.raw file (that is, the name of the segment with '.raw' appended)  in the mount point along with a log file.  Yes, its that easy.

To finish the scenario, xmount can then take the image.raw file as an argument to create the virtual disk, thusly:

# xmount --in dd --out vdi --cache image.cache mount_point/image.raw new_mount_point/

This command tells xmount that the input file, image.raw, is raw data, the output desired is a VirtualBox vdi format, that a cache file called "image.cache" is desired to store system changes when the virtual machine is running.  The .vdi file will be mounted in the "new_mount_point" directory.  If xmount is unfamiliar to you, I recommend you read my previous post.

Like affuse, xmount utilizes the fuse file system.  Both utilities accept fuse file system arguments as well as tool specific arguments, so read only mounting and permissions options exist (type "man fuse" at the command line for more details).  As always, practice on non-case data to become familiar with the tools.