Tuesday, July 26, 2011

BlackBerry Messenger / Google Talk for BlackBerry Save-files

My last post, "BlackBerry Text Message Parsing, AKA, Why I use Linux for Forensics," had unintended but pleasant consequences: it introduced me to BlackBerry forensics expert Shafik Punja.  I really wrote the post to comment on the versatility of the Linux operating system, and fortunately I erred in calling the file that was the subject of my discussion a "BlackBerry Text Message" save file, rather than the BlackBerry Messenger (BBM) save file that it was.

Shafik corrected me and taught me a bit more about the file type that I'll share here in the event that you missed his comments.  BBM messenger does not save message content by default.  It is a user selected option, and only available in BBM v5 and higher.  The files can be saved to the device's internal memory or an installed memory card.  Google Talk has the same save options and defaults as BBM, and also saves in the same basic format.

Shafik asked if I could write a program to parse the files.  I agreed and quickly converted the basic Bash script in my last post to a more complete program.  However, when Shafik tried to run the program in OS X, it failed because OS X and Linux do not utilize the same 'date' command.  I was faced with writing a new program for use in OS X, which I wished to support, or make my current script intelligently determine the OS in which it was running and apply the correct date command syntax.

Nothing seems like a bigger waste to me than to write essentially the same thing twice.  Isn't there a way to write a program that doesn't have to be altered based on the operating system in which it is run, I asked?  Then a little Australian-accented voice of reason reached my conscious thought: why don't you learn python?  For the millionth time: Thank you Michael, good idea.

Michael Cohen, a software engineer at Google and author/maintainer of pyFLAG, AFF4,  and other tools soon to be released, has suggest many times over that I learn python, but I only recently started to do so.  Python is an excellent solution for cross platform programs, and I decided it was time to start.

The BBM / GTalk file format

BBM and GTalk save files are CSV text files with the following basic structure:
Application Name, version[newline]
DateCode, "Sender_ID", "Receiver_ID", Message[newline]
...

The first line of the file is the source application and the application version, separated by a comma. The version is followed by a newline control code.

The remaining lines are formatted thus: the line starts with a DateCode consisting of a 21 digit number. The first 8 digits are the date of the sent message rendered as YYYYMMDD. The remaining 13 digits are Unix epoch time, rendered in milliseconds. In the case of BBM, the sender Hex_ID follows the DateCode, separated by a comma and encapsulated in parenthesis. Similarly, the receiver's Hex_ID follows. For GTalk, the sender and receiver are identified by Google Account ID. Finally, the message, in text, is stored followed by a newline control code.

One minor issue in rendering the file in a spreadsheet centers on the message field. Messages are not encapsulated in commas; therefore, messages with commas will not format well in spreadsheets without additional spreadsheet configuration. This might not be a problem for one file, but would be a major annoyance for many files (one of my colleagues found 23 BBM save files on an SD card). I determined to encapsulate the message field in quotes to overcome this issue.

The python3 code I settled on follows:
import sys
from time import strftime, localtime

def main(csv):
    line_no = 0
    with open(csv) as db_file:
        for line in db_file:
            if line_no == 0:
                print(line)
                print('Date,DateCode,Sender,Receiver,Message')
                line_no += 1
            else:
                datecode, sender, receiver, message = line.split(',', 3)
                date = int(datecode[8:18])
                date = strftime('%Y-%m-%d %H:%M:%S (%Z)', localtime(date))
                print('"{}",{},"{}","{}","{}"'.format(date, datecode, sender, receiver, message.strip()))

if __name__ == '__main__':
    for file in sys.argv[1:]:
        main(file)

I won't try to explain all the code here, because with my rudimentary knowledge of python, I'm likely to misspeak. I will highlight a couple parts of the script and a couple of nice python features, however. First, the "if" condition in line 8 is present to skip over date decoding in the first line of the file. Instead, it prints the line, and adds a line of column headers to help with interpreting the remaining message data.

Line 13 demonstrates a very nice python feature: multiple value assignment. Here, I assign data from line 2 and higher, one line at a time (note the 'for' loop) to the datecode, sender, receiver, and message objects. To work correctly, however, there needs to be four values to assign to the four objects. The split method divides a string object into a list of strings based on a delimiter. Here, I configure split to divide on commas a maximum of three times. This creates a list of four strings. Remaining commas are ignored and included in the last string, which here is assigned to 'message'.

Line 14 converts the 'datecode' string to an integer, and line 15 uses the 'strftime' and 'localtime' functions from the 'time' module to convert the Unix epoch time embedded in the datecode to a human-readable format. I use string slicing, that is, selecting a portion of the datecode string by its position in the string (also called indexing where the starting position is 0), to extract the 10 digits that represent the Unix epoch value. The last three digits of the datecode are ignored since they represent milliseconds.

The final line of which to take note is Line 16, where a few things of import are occurring. First, the interpreted datecode is printed. The original datecode follows for reference and validation purposes, and then the remainder of the line. A format method is applied that replaces the brackets place holders in the print string with the objects that are arguments of the format method. Here, the message object is encapsulated in quotes. Also, the message object has the strip method applied to remove the newline control code at the end of each message to improve formatting.

Though I won't comment on it, I'll include the bash script I wrote first for comparison:
line_no=0 #set counter

cat $1 | while read line
do
    if [ $line_no -eq 0 ]
    then
        echo ${i}  #print first line unaltered
        echo "Date,DateCode,Sender,Receiver,Message"
        line_no=1
    else
        for var in date sent rcvd message
        do
            eval $var="\${line%%,*}"
            [[ "$var" = "message" ]] && message=$line #ignore commas in message
            line="${line#*,}"
        done
        date=$(date +"%Y-%m-%d %H:%M:%S (%Z)" -d @${date:8:10}) #convert date code and store
        len=$((${#message}-1)) #subtract 1 from remainder length for stripping newline control code
        echo "\"$date\",\"$sent\",\"$rcvd\",\"${message:0:$len}\""
    fi
done