Tuesday, September 29, 2015

Compression and Android Gmail

Every registered Android mobile device has an associated Google account. Google accounts usually mean Gmail. And, for investigators interested in the Gmail content stored on Androids, that content can be found in the /data/com.google.android.gm/databases directory in a database named in the following format:

mailstore.[GoogleAccount]@gmail.com.db

The database contains 23 tables (at least at the time of this writing), the most interesting of which is messages.

The messages table has 41 fields (or columns). To obtain the basic email content (say, for keyword searching), an investigator would likely want to export the sender’s and receiver’s addresses, the date sent or received, and the subject line, and the message body, at the very least. There is plenty more to be gleaned from the database, but your investigation will dictate the investigative needs.

Caution
Automated tools do not provide the full weath of data to be found in the mailstore database. It is always a good idea to become familiar with the database schema to learn the full potential for your investigation.

The Big Squeeze

If you have experience searching SQLite databases, you might be thinking, "Why go to the trouble of exporting messages from the database? SQLite strings are usually UTF-8, so I can just search the database with regular expressions or plain keywords." Well, there is a catch when it comes to email content in the Gmail mailstore database: zlib compression.

Short length message bodies are written to the body field in the messages table as a plain text string. In a recent exam, the longest message I found in this field in a recent exam was 98 bytes, however. Longer message bodies are compressed using the zlib algorythm and stored in the bodyCompressed field. While SQLite supports compressed databases, it has no function to decompress fields within databases. Instead, it stores such data as a blob type, and it is up to the database user to decompress the data.

Note
The SQLite blob type is sort of a catch-all for any type or data. Data is stored in the format in which it was input.

Extracting Messages

Python is a good option for exporting messages from the Gmail messagestore database. It can both open and query databases, and it can decompress the long message bodies.

Exporting Gmail Messages
import sqlite3
import zlib

# open and query the database
conn = sqlite3.connect('messagestore.db') # database name abbreviated
c = conn.cursor
c.execute("select _id, fromAddress, datetime(dateSentMS/1000,
    'unixepoch', 'localtime'), datetime(dateReceivedMS/1000,
    'unixepoch', 'localtime'), case when body not Null then body
    else bodycompressed end from messages ")
rows = c.fetchall()

# interate through the rows and decompress the long messages
for row in rows:
    id, _from, sent, recv, body = row
    try:
        body = zlib.decompress(body)
    except:
        pass
    print('{}|{}|{}|{}|{}'.format(id, _from, sent, recv, body))
Note
The final line can be adapted to your own needs, i.e., writing the content to a new file or database, or use python regular expressions to search the content, etc.

Some Explaination

The example above is just that: an example. It is intended, like all my posts, to remind me how to process the data and demonstrate how just a few lines of python can be leveraged to extract data. The script could have been shorter, but it would have come at the cost of clarity. That said, there is still some explanation to be had:

The SQLite query in the c.execute method might need some dissecting for you to understand what I did.

select
    _id,
    fromAddress,
    datetime(dateSentMS/1000, 'unixepoch',   'localtime'),
    datetime(dateReceivedMS/1000, 'unixepoch', 'localtime'),
    case
        when body not Null
        then body
        else bodyCompressed
    end
from messages

The dateSentMS and dateReceivedMS fields are recorded in milliseconds since 1/1/1970 (Unix epoch). I let SQLite do the date converstion for me, rather and doing it python, and I converted from Unix epoch to local time. The case statement pulls a little trickery to select the body field or bodyCompressed field. Basically, the row’s body field is checked to see if it is populated. If so, it is returned. If not, the contents of the bodyCompressed field are returned.

In the body decompression section of the script, the contents of the each row are assigned to variables id, _from, sent, recv and body. The try/except clause attempts attempts to decompress the body. If it fails, as it will on the short message bodies, it just uses the contents of the body variable as is. Finally, the row is printed in a pipe-delimited fashion.