Pages

Wednesday, October 31, 2012

Android Pin/Password Cracking: Halloween isn't the Only Scary Thing in October

CCL Forensics did the mobile forensics world a great service when it released several python scripts for cracking Android gesture, pin, and password locks. I have mostly encountered gesture locks in my examinations, and I have successfully cracked them each time with CCL’s Tools. But recently, I had a chance to take a look at the pin/password cracking tool.

A colleague contacted me and said he’d been running the CCL BruteForceAndroidPin.py tool for more than two weeks but had not cracked a password. I was pretty naive when I thought, "That’s ridiculous, you must be doing something wrong," but I’m glad I had the thought none the less. I’m glad because the thought made me investigate. And the investigation led to some pretty surprising facts and an improvement in the brute force attack.


What’s In an Android Password?

First, let’s look at what an Android password can look like: * 4-16 Characters long * 94 possible characters per space: Uppercase and lowercase letters, digits, and punctuation (known as keyspace) * Spaces are not allowed * Alphanumeric passwords MUST have at least one letter * The password is salted with a random integer and stored in a text file as a concatenated SHA-1 and MD5 value while the salt is stored in a SQLite Database.

So, how many password possibilites are there in a 4-16 character length password where each place value can be anyone of 94 characters? I mean, what exactly are we asking our cpu’s to do here? The table below provides the answer.

Table 1. Android Password Possibilities
Length Number of Possibilities

4

78,074,896

5

7,339,040,224

6

689,869,781,056

7

64,847,759,419,264

8

6,095,689,385,410,816

9

572,994,802,228,616,704

10

53,861,511,409,489,970,176

11

5,062,982,072,492,057,196,544

12

475,920,314,814,253,376,475,136

13

44,736,509,592,539,817,388,662,784

14

4,205,231,901,698,742,834,534,301,696

15

395,291,798,759,681,826,446,224,359,424

16

37,157,429,083,410,091,685,945,089,785,856

Note The figures in each row in the table are independent of the other rows

Yes, the chart above totals over 37.5 trillion quadrillion possibilities! And totaling the columns—which is the true effect of searching all 4-16 length password combinations, by the way—is a whopping 3.7557e+31!

It can be difficult to understand what large numbers mean. So, lets assume a worst case scenario: a password of length of 16 characters. Since we can’t know the password length from examining the hash values, we have to start our attack at a length of 4. Now, Assuming we have a reasonably good processor, we can make about 200,000 password attempts a second with the CCL brute force script. That means it will take us about 1.87784856658094e+26 seconds to complete the task. There are 86400 seconds in a day, and 365 days in a year (yes, I know you knew that one). So, that’s a mere: 5,954,618,742,329,212,000 years! Let me try to say that in English, "That’s 5.9 quintilian years!"

"Alright," you say, "I understand that I’ve no hope of cracking a password of 16 characters in length. But, how long of a password can I reasonably expect to crack with this script?" Not too many, I’m afraid. Let’s continue to assume 94 characters are possible per place value, and we can generate and test passwords at a rate of 200,000 per second.

Table 2. Time to Complete, 200K/sec
Length Number of Possibilities Time to complete

4

7.80749e+07

6 minutes

5

7.33904e+09

10 hours

6

6.8987e+11

39 days

7

6.48478e+13

1 decades

8

6.09569e+15

96 decades

9

5.72995e+17

90 millennia

10

5.38615e+19

8,539 millennia

11

5.06298e+21

802,730 millennia

12

4.7592e+23

75,456,670 millennia

13

4.47365e+25

7,092,927,066 millennia

14

4.20523e+27

666,735,144,231 millennia

15

3.95292e+29

62,673,103,557,788 millennia

16

3.71574e+31

5,891,271,734,432,093 millennia

Note The figures in each row in the table are independent of the other rows

The stats above don’t really offer much hope if the password has a length of seven or more. But, since this is simple math, it becomes readily apparent how to speed our results: reduce the keyspace, or speed the number of calculations per second. Better yet, do both!

==Reducing the Keyspace

The CCL brute force tool, as written, includes 95 ASCII characters. One of these, the space character, is illegal in Android passwords, so, we can edit the CHAR_LIST variable a the start of the code to 94 characters. But if we consider what characters are visible on the default android keyboard at the lock screen, we see that there are 26 lowercase letters and 5 characters of punctuation accessible with a single, regular key press. Human nature suggests that most passwords would be composed of these keys alone, and since words are easiest to remember, probably the 26 lowercase letters would suffice.

So, what happens if we reduce the CHAR_LIST variable to just lower case letters, thus reducing our keyspace to 26, but are still calculating at a rate of 200K/sec? Let’s take a look:

Table 3. Time to Complete, keyspace = 26, 200k/sec
Length Number of Possibilities Time to complete

4

456976

2 seconds

5

1.18814e+07

59 seconds

6

3.08916e+08

25 minutes

7

8.03181e+09

11 hours

8

2.08827e+11

12 days

9

5.4295e+12

314 days

10

1.41167e+14

2 decades

11

3.67034e+15

58 decades

12

9.5429e+16

15 millennia

13

2.48115e+18

393 millennia

14

6.451e+19

102,278 millennia

15

1.67726e+21

265,927 millennia

16

4.36087e+22

6,914,120 millennia

Note The figures in each row in the table are independent of the other rows

Hey, now we’re cooking! Unless the password is 9 or more characters, we have some hope of cracking it. And since most Android devices are smart phones, users are probably not creating ultra-long passwords because they want quick access to their devices.

This modification to the CCL script might be sufficient for most attacks. And, we can improve the script by making different keyspaces selectable through options, such as -l for lower case, -u for upper case, -d for digits, and -s for special characters (punctuation, etc.). In fact, I went to the effort to do just that, but before you get too excited and ask me to send the modified tool to you, please read on.

Obviously, a keyspace of only lowercase letters will fail to crack a password that includes any characters not included that keyspace, and you won’t know definitively if it does or does not for 6.9 billion years! We need to do more than just reduce the keyspace, we need to increase the calculation rate, too!


Increasing the Calculation Rate

I hope you realize from the previous section that an intelligent approach to the keyspace can significantly improve your search times… to a point. But if we are going to make real progress on password cracking, we are going to need to thank CCL Forensics mightily for their contribution, but politely move on to other tools. Python simply isn’t the best platform for this type of process.

On the other hand, OCL is apparently an excellent language for this type of processing, and the specialized Graphical Processing Unit is more adept and this kind of calculation than the more generally capable CPU. And it just so happens that the hashcat tool uses both.

CPU Version

Hashcat comes in several flavors. The original hashcat is coded to use the cpu for password cracking, and unlike the CCL tool, can use multiple cores. As and example of what the OCL programming language and multi-core processing can do for you, I achieved 25 million/sec on my core i5 processor with hashcat compared to 200k/sec with python script. Let’s see what that does for our processing times, and couple that with the lower-case letters keyspace we used previously:

Table 4. Time to Complete, keyspace=26, 25M/sec
Length Number of Possibilities Time to complete

4

456976

0 seconds

5

1.18814e+07

0 seconds

6

3.08916e+08

12 seconds

7

8.03181e+09

5 minutes

8

2.08827e+11

2 hours

9

5.4295e+12

2 days

10

1.41167e+14

65 days

11

3.67034e+15

4 years

12

9.5429e+16

12 decades

13

2.48115e+18

3 millennia

14

6.451e+19

81 millennia

15

1.67726e+21

2,127 millennia

16

4.36087e+22

55,312 millennia

Note The figures in each row in the table are independent of the other rows

Wow! Cracking passwords of 9 characters just became reasonable (recall that initially, 6 character-length password took us about 40 days). With a little luck, we’ll be examining the device’s contents in a fairly short time. And we didn’t spend a dime to improve our systems, we just found a tool that is more adept at the task.

GPU Version

But, I mentioned GPU, right? Hashcat has two other versions, called oclHashcat-plus and ocl-Hashcat-lite, both of which are coded to run on Nvidia and ATI GPUs (currently, you must select the right tool for the processor type, but a future version is planned with multi-processor support). Because the GPU is better suited for this type of calculation, the GPU versions of hashcat are preferred.

As it turns out, I have an inexpensive but supported NVidia GeForce 9500 graphics card. I spent $50 on it about three years ago. This is not a high-end card by any stretch (1023MB, 1375Mhz). This is a good place to mention that while NVidia graphics cards are generally thought to perform better rendering 3D graphics, the ATI cards are better for password cracking.

I used the cudaHashcat-lite version of the tool. CUDA refers to NVidia’s parallel processing platform. I achieved a significant performance increase over my quad-core Intel processor, reaching 35 million calculations/sec. This had a significant effect on my cracking times:

Table 5. Time to Complete, keyspace=26, 35M/sec
Length Number of Possibilities Time to complete

4

456976

0 seconds

5

1.18814e+07

0 seconds

6

3.08916e+08

8 seconds

7

8.03181e+09

3 minutes

8

2.08827e+11

1 hours

9

5.4295e+12

1 days

10

1.41167e+14

46 days

11

3.67034e+15

3 years

12

9.5429e+16

8 decades

13

2.48115e+18

2 millennia

14

6.451e+19

58 millennia

15

1.67726e+21

1,519 millennia

16

4.36087e+22

39,509 millennia

Note The figures in each row in the table are independent of the other rows

We cut our 9 character length processing time in half and put 10 characters in reach. That’s not bad for an low end graphics card, and it was substantially less expensive than the quad-core processor I purchased at the same time.


Fine Tuning the Attack

Other than getting a better graphics cards or more of them (yes, hashcat can make parallel use of up to 128 GPUs!), you might think we’re done with this topic. But I had one more idea that I thought worth checking. The password.key file contains two abutting hash values, a 40 byte SHA-1 hash and a 32 byte MD5 hash. The CCL brute force attack is on the SHA-1 value, but what if we targeted the MD5 value? It can’t make that much difference, right?

Wrong! It makes a three-fold difference for the better! Utilizing the MD5 hash and settings.db salt value, I achieved 107 million calculations/sec with the same GPU as before! (Yes, I know that was three exclamations in a row, but darn it, it deserves three exclamations! Ok, that’s four…) The effect on our calculation times is dramatic:

Table 6. Time to Complete, keyspace=26, 107M/sec
Length Number of Possibilities Time to complete

4

456976

0 seconds

5

1.18814e+07

0 seconds

6

3.08916e+08

2 seconds

7

8.03181e+09

1 minutes

8

2.08827e+11

32 minutes

9

5.4295e+12

14 hours

10

1.41167e+14

15 days

11

3.67034e+15

1 years

12

9.5429e+16

2 decades

13

2.48115e+18

73 decades

14

6.451e+19

19 millennia

15

1.67726e+21

497 millennia

16

4.36087e+22

12,923 millennia

Note The figures in each row in the table are independent of the other rows

We’re cracking 8 character passwords in half and hour! What if we bought one of those fancy $400 ATI 4790 cards that they brag about on the hashcat forums, you know, the ones that calculate at a rate of 9 billion a second?

Table 7. Time to Complete, keyspace=26, 9B/sec
Length Number of Possibilities Time to complete

4

456976

0 seconds

5

1.18814e+07

0 seconds

6

3.08916e+08

0 seconds

7

8.03181e+09

0 seconds

8

2.08827e+11

23 seconds

9

5.4295e+12

10 minutes

10

1.41167e+14

4 hours

11

3.67034e+15

4 days

12

9.5429e+16

122 days

13

2.48115e+18

8 years

14

6.451e+19

22 decades

15

1.67726e+21

5 millennia

16

4.36087e+22

153 millennia

Note The figures in each row in the table are independent of the other rows

And what of those (apparently wealthy) guys who are stringing 8 of those cards together and getting speeds of 72 billion calculations/sec?

Table 8. Time to Complete, keyspace=26, 72B/sec
Length Number of Possibilities Time to complete

4

456976

0 seconds

5

1.18814e+07

0 seconds

6

3.08916e+08

0 seconds

7

8.03181e+09

0 seconds

8

2.08827e+11

2 seconds

9

5.4295e+12

1 minutes

10

1.41167e+14

32 minutes

11

3.67034e+15

14 hours

12

9.5429e+16

15 days

13

2.48115e+18

1 years

14

6.451e+19

2 decades

15

1.67726e+21

73 decades

16

4.36087e+22

19 millennia

Note The figures in each row in the table are independent of the other rows

And what if we’re a three-letter agency that could maximize hashcat with 128 GPUs? …Alright, alright, I’ll stop. But I actually posted those last two tables for a reason. Notice that 8 high-end cards only put one more password level within realistic reach. Most of us, if properly motivated and only modestly funded, could manage one of those cards and achieve impressive results. But do the math before you drop nearly $4000 on graphics cards, and consider waiting for the next version of hashcat to put your collection of older cards to work for you.


A HashCat Tip

You won’t find a lot of helpful documentation on using hashcat, unless you already understand cryptography and cryptographic principals, that is. In order to use the salt with hashcat, you need to convert it from an integer to hexadecimal notation. In python3, this can be accomplished as follows:

 >>> import binascii, struct
 >>> binascii.hexlify(struct.pack('>q', salt )) #where 'salt' is an integer

There is quite a bit more to know about hashcat, including password masks and keyspace definitions, but I’ll cover that in a future post. If you decide to give hashcat a whirl before I blog any further, make sure you test on known data: I had to roll back a version to get cudaHashcat-lite and cudaHashcat-plus to work properly.

Thursday, October 18, 2012

Christmas Comes Early: HFS/HFS+ Mounting

Note This is my first attempt at using asciidoc to create a more legible document for the reader. My thanks go out to Dr. Michael Cohen at Google for pointing out the formatting tool.

The Ghost of HFS Past

It is no great revelation that Apple favors the HFS/HFS+ file systems. But, when I started my adventures in Linux-based forensics, my chief forensics tool, The Sleuthkit, did not support HFS file systems. The solution was to mount the file systems read-only and examine the allocated files with standard Linux tools and applications. Even when The Sleuthkit added HFS support, it was still useful to mount the partition, especially when it came to examining media files.

Note

Microsoft Windows

Windows users have been doing pretty much the same using tools like HFSExplorer.

The process in Linux was pretty straight forward:

  1. Image the device to be examined.

  2. Read the partition table to identify the byte-offset to the HFS partition in the image.

  3. Use the mount command with the byte-offset to mount the partition.

For example

# mmls macbookpro320.dd -a
GUID Partition Table (EFI)
Offset Sector: 0
Units are in 512-byte sectors

     Slot    Start        End          Length       Description
04:  00      0000000040   0000409639   0000409600   EFI system partition
05:  01      0000409640   0624880263   0624470624   Customer

# mount -o ro,loop,offset=$((409640*512)) macbookpro320.dd /mnt/analysis

The mount command attaches file systems on storage devices to the GNU Linux file tree. Conversely, the umount (not ‘unmount’ as you might expect) command detaches the device file system from the GNU Linux file system. The mount command above worked with the macbookpro320.dd disk image because of the loop option.

Loop Device

In Unix-like operating systems, a loop device, vnd (vnode disk), or lofi (loopback file interface) is a pseudo-device that makes a file accessible as a block device.

http://en.wikipedia.org/wiki/Loop_device
— Wikipedia


The Ghost of HFS Present

But, somewhere around the 2.6.34 Linux kernel, the mount command stopped working with HFS loop devices in the manner demonstrated above. The mount command can still be used to mount a partition addressed block device, but not addressed by image by offset like the illustration above.

By Partition Block Device File (verbose mode for illustration)

# blkid
/dev/sdd1: LABEL="EFI" UUID="2860-11F4" TYPE="vfat"
/dev/sdd2: UUID="f2477645-5489-3419-b477-d504574057e3" LABEL="Macintosh 
HD" TYPE="hfsplus"

# mount -o ro -v /dev/sdd2 /mnt/analysis/
mount: you didn't specify a filesystem type for /dev/sdd2
       I will try type hfsplus
/dev/sdd2 on /mnt/analysis type hfsplus (ro)

# umount -v /mnt/analysis/
/dev/sdd2 has been unmounted

As you can see, the mount command works as expected when the partition is addressed by the special block device file. But, when the partition is addressed by byte-offset from the raw device or disk image, it fails:

By Byte-Offset (block device shown)

# mmls -a /dev/sdd
GUID Partition Table (EFI)
Offset Sector: 0
Units are in 512-byte sectors

     Slot    Start        End          Length       Description
04:  00      0000000040   0000409639   0000409600   EFI system partition
05:  01      0000409640   0624880263   0624470624   Customer

# mount -o ro,offset=$((409640*512)) /dev/sdd /mnt/analysis
mount: wrong fs type, bad option, bad superblock on /dev/loop2,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so
Note This bug only applies to HFS formatted partitions. Other common file systems—FAT, NFTS, EXT2/3/4—do not seem to be affected.

The Ghost of HFS Future

The work around for storage devices is obvious: point mount at the partition’s block device file. But in forensics, we work with disk images to avoid making changes to the original storage medium. How do we address HFS partitions in disk images if we can’t address them by offset?

Enter kpartx. The kpartx command reads partition tables and maps partitions to device files. It works on devices and disk images. This means we can map HFS partitions in a disk image to a special block device file and mount those partitions by addressing that block device file as if it were part of an attached device!

kpartx

# kpartx
usage : kpartx [-a|-d|-l] [-f] [-v] wholedisk
    -a add partition devmappings
    -r devmappings will be readonly
    -d del partition devmappings
    -u update partition devmappings
    -l list partitions devmappings that would be added by -a
    -p set device name-partition number delimiter
    -g force GUID partition table (GPT)
    -f force devmap create
    -v verbose
    -s sync mode. Don't return until the partitions are created

To demonstrate its use, we will use a raw (dd) image from a MacBook Pro. Other disk image formats will work too, so long as they can be exposed to GNU Linux as raw devices (think xmount).

Image Information

# img_stat macbookpro320.dd
IMAGE FILE INFORMATION
 --------------------------------------------
Image Type: raw

Size in bytes: 320072933376

We can read the partitions in the disk image with kpartx.

kpartx: Detected Partitions

# kpartx -l macbookpro320.dd
loop1p1 : 0 409600 /dev/loop1 40
loop1p2 : 0 624470624 /dev/loop1 409640

# kpartx -lg macbookpro320.dd
loop1p1 : 0 409600 /dev/loop1 40
loop1p2 : 0 624470624 /dev/loop1 409640
Note

The GUID Partition Table

kpartx has a option to force a GUID partition table, which the MacBook uses, but as demonstrated, it wasn’t necessary in this case.

All that remains to do is to map the partitions so that we can mount them. We can use the -r option to create read-only devices.

kpartx: Mapping Partitions

# kpartx -av macbookpro320.dd
add map loop1p1 (254:0): 0 409600 linear /dev/loop1 40
add map loop1p2 (254:1): 0 624470624 linear /dev/loop1 409640

Now, mounting is as simple as pointing at the new block device files, which are found in the /dev/mapper directory.

Mounting special HFS block devices

# mount -o ro -v /dev/mapper/loop1p2 /mnt/analysis
mount: you didn't specify a filesystem type for /dev/mapper/loop1p2
       I will try type hfsplus
/dev/mapper/loop1p2 on /mnt/analysis type hfsplus (ro)

# mount
...
/dev/mapper/loop1p2 on /mnt/analysis type hfsplus (ro,relatime,umask=22,
uid=0,gid=0,nls=utf8)

After you have completed your analysis, unmount the partition and remove the special block devices.

Cleaning up

# umount -v /mnt/analysis/
/dev/mapper/loop1p2 has been unmounted

# kpartx -dv macbookpro320.dd
del devmap : loop1p2
del devmap : loop1p1
loop deleted : /dev/loop1
Note

Accessing Unallocated File System Space

Unallocated space in HFS partitions is addressable through The Sleuthkit with the blkls tool.

Now, go forth and conquer!

Tuesday, October 9, 2012

Addressing the iOS6 Address Book and SQLite Pitfalls

Now that I have a basic handle on the iOS6 sms.db, it's time to look at the iOS6 Address Book.  After all, I now know what's being said when I examine the sms.db, but I don't have a real good picture, other than the phone number, of who's sending the message.  The iOS AddressBook.sqlitedb is the place to look for data related to the phone number.

A brief look at AddressBook.sqlitedb

The iOS6 AddressBook.sqlitedb database has 29 tables:


ABAccount                        ABPersonFullTextSearch_segdir  
ABGroup                          ABPersonFullTextSearch_segments
ABGroupChanges                   ABPersonFullTextSearch_stat    
ABGroupMembers                   ABPersonLink                   
ABMultiValue                     ABPersonMultiValueDeletes      
ABMultiValueEntry                ABPersonSearchKey              
ABMultiValueEntryKey             ABPhoneLastFour                
ABMultiValueLabel                ABRecent                       
ABPerson                         ABStore                        
ABPersonBasicChanges             FirstSortSectionCount          
ABPersonChanges                  FirstSortSectionCountTotal     
ABPersonFullTextSearch           LastSortSectionCount           
ABPersonFullTextSearch_content   LastSortSectionCountTotal      
ABPersonFullTextSearch_docsize   _SqliteDatabaseProperties  

The abperson table seems to be the obvious target for the data we want.  Here's its schema:


CREATE TABLE ABPerson (ROWID INTEGER PRIMARY KEY AUTOINCREMENT, First TEXT, Last TEXT, Middle TEXT, FirstPhonetic TEXT, MiddlePhonetic TEXT, LastPhonetic TEXT, Organization TEXT, Department TEXT, Note TEXT, Kind INTEGER, Birthday TEXT, JobTitle TEXT, Nickname TEXT, Prefix TEXT, Suffix TEXT, FirstSort TEXT, LastSort TEXT, CreationDate INTEGER, ModificationDate INTEGER, CompositeNameFallback TEXT, ExternalIdentifier TEXT, ExternalModificationTag TEXT, ExternalUUID TEXT, StoreID INTEGER, DisplayName TEXT, ExternalRepresentation BLOB, FirstSortSection TEXT, LastSortSection TEXT, FirstSortLanguageIndex INTEGER DEFAULT 2147483647, LastSortLanguageIndex INTEGER DEFAULT 2147483647, PersonLink INTEGER DEFAULT -1, ImageURI TEXT, IsPreferredName INTEGER DEFAULT 1)

If you bothered to read through that, you'll notice there is nothing about phone numbers or email addresses.  We need to find the table containing the phone numbers, and somehow join that to this table that has the name data.

Take a look at a sample (fictitious) row:

                  ROWID = 1
                  First = Some
                   Last = Name
                 Middle = 
          FirstPhonetic = 
         MiddlePhonetic = 
           LastPhonetic = 
           Organization = SomeCompany
             Department = 
                   Note = 
                   Kind = 0
               Birthday = 
               JobTitle = 
               Nickname = 
                 Prefix = 
                 Suffix = 
              FirstSort = -'- 1'?7=W +'= 17I/ )'MM'=7CA +57/1
               LastSort = 1'?7=W -'- +'= 17I/ )'MM'=7CA +57/1
           CreationDate = 350880583
       ModificationDate = 369882045
  CompositeNameFallback = 
     ExternalIdentifier = 
ExternalModificationTag = 
           ExternalUUID = 68516E99-7C39-4C4D-8871-BDE114EDD6B4
                StoreID = 0
            DisplayName = 
 ExternalRepresentation = 
       FirstSortSection = -
        LastSortSection = 1
 FirstSortLanguageIndex = 0
  LastSortLanguageIndex = 0
             PersonLink = -1
               ImageURI = 
        IsPreferredName = 1

A little more digging, we find the ABMultivalue table contains phone numbers, email addresses, URLs to social networking sites, and potentially more.  There are only six fields (making it a little easier on the eyes), but there's a minor issue.  Reading the table flat (that is, not related to any other tables) we don't know whom the numbers and email addresses belong to, nor to we know what type of data it is.  For example, is it a home, work, mobile, or FAX phone number? Take a look at its makeup and some sample rows and you'll see what I mean:

CREATE TABLE ABMultiValue (UID INTEGER PRIMARY KEY, record_id INTEGER, property INTEGER, identifier INTEGER, label INTEGER, value TEXT)

       UID = 643
 record_id = 1
  property = 4
identifier = 0
     label = 3
     value = someguy@somedomain.com

       UID = 1026
 record_id = 1
  property = 3
identifier = 0
     label = 1
     value = (###) ###-####

That's a little easier on the eyes.  You can see that the 'value' field at the end of the row contains the phone numbers and email addresses we're looking for.  The problem comes from the fact its the only text field in the table.  All other values are integers.  So, how do we relate the names in the ABPerson table to the numbers in the ABMultivalue?  I made it pretty obvious here, but ABperson.ROWID = ABMultivalue.record_id.

So, what about that other problem: what kind of data is that?  Here comes the ABMultiValueLabel table to the rescue!  Take a look at this:

value = iPhone

value = _$![MOBILE]!$_

value = _$![HOME]!$_

value = mobile1

value = mobile2

value = DAD

value = MOM

If that looks a bit strange, its because the iOS user can create their own labels.  Some, like mobile1 and mobile2, are going to be default.  Others like Mom and DAD are custom.  But something else should catch your eye, too: there are no values in the ABMultiValueLabel table with which to match the label integer in the  ABMultivalue table!  Now what?

I introduced the SQLite rowid field in an earlier article.  In sum, every SQLite table has a rowid field, which is a unique integer, whether or not it was specified when the table was created.  In the case of the ABMultiValueLabel table, the rowid is the relation to the label integer in the ABMultivalue table to inform us of the data type.  Observe:


$ sqlite3 -line AddressBook.sqlitedb 'select rowid, value from abmultivaluelabel where rowid = 1 or rowid = 3'

rowid = 1
value = iPhone

rowid = 3
value = _$![Home]!$_

Now that we know how these tables interrelate, we can compose a query to generate a simple contacts list.

A Simple Contact List

To quickly recap, the ABPerson, ABMultivalue, and ABMultivalueLabel tables are interrelated: ABPerson contains the contact names, ABMultivalue houses the phone numbers, email addresses, and potentially other values (e.g. URLs), and ABMultivalueLabel defines the values.  We can create a list of contacts, sorted by contact identification number, thusly:

$ sqlite3 -header AddressBook.sqlitedb 'select p.rowid, p.first, p.last, p.organization, l.value as Type, m.value from abperson p, abmultivalue m, abmultivaluelabel l where p.rowid=m.record_id and l.rowid=m.label order by p.rowid;'

ROWID|First|Last|Organization|Type|value

1|Some|Guy|SomeCompany|_$![Home]!$_|someguy@somedomain.com
1|Some|Guy|SomeCompany|iPhone|(###) ###-####

(Note: Keep reading! Though it might not be apparent, this query above is flawed.)

Checking our work

While the query above works nicely, it does not return all the data available in the tables.  Let's take a look some database stats and you'll see what I mean:

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abperson;' 
542


$ sqlite3 AddressBook.sqlitedb 'select count(*) from abmultivalue;'
1588

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abmultivaluelabel;'
14


The ABPerson table has 542 entries.  We'd expect at least 542 lines in our query, assuming one phone number, email or other value in the ABMultivalue table.  Considering there are 1588 ABMultivalue rows, we actually expect an average of three rows per contact.  But, how many rows are returned with our "Simple Contact List" query?

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abperson p, abmultivalue m, abmultivaluelabel l where p.rowid=m.record_id and l.rowid=m.label;'
305

"Ruh-row, Raggy!"  What happened?  Maybe there are contacts in ABPerson not found in ABMultivalue?  

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abperson where rowid not in (select record_id from abmultivalue);' 
0

No, all 542 rowids from ABPerson are found in the record_ids of ABMultivalue.  We certainly want all the ABPerson records represented in our output, and clearly we're not getting them with the query as written.  But how many of the ABMultivalue records relate to the ABPerson records?

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abmultivalue where record_id in (select rowid from abperson);' 
1588

 All of them.  So, now we know the score: our query won't be correct until we have 1588 rows of data.  So, where's the breakdown?  To find out, we need to look carefully at our original query:

select              # here we a just selecting the columns we desire
    p.rowid, 
    p.first, 
    p.last, 
    p.organization, 
    l.value as Type, 
    m.value 
from                # here we list the source tables, and assign an alias
    abperson p, 
    abmultivalue m, 
    abmultivaluelabel l 
where               # here we filter the data, and this is our danger area
    p.rowid=m.record_id 
    and l.rowid=m.label 
order by            # here we are simply ordering the results
    p.rowid;'

So, it would appear we need to concentrate on the where clause which has the effect of filtering our data.  We are doing two things here:
  1. p.rowid=m.record_id - ABPerson's rowid must match a ABMultivalue record_id, which we have already shown occurs in all cases
  2. l.rowid=m.label - There must be a matching rowid in ABMultivalueLabel for each ABMultivalue label.
It appears that the second statement in where clause is our culprit.  Can we demonstrate this to be the case and verify our assessment of the original query?

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abperson, abmultivalue where rowid=record_id;' 
1588

Replacing our count function with the original field names (replacing the label name with the integer because we don't call to the ABMultivalueLabel table in this instance) doesn't change the count because the select statement is the location where we define the fields to be displayed, not filter the results.  I'll demonstrate:

$ sqlite3 AddressBook.sqlitedb 'select p.rowid, p.first, p.last, p.organization, m.label as Type, m.value from abperson p, abmultivalue m where p.rowid = m.record_id;' | wc -l
1588

So, it was the final filter that resulted in the reduced data set.  That means there must be undefined labels in the ABMultivalue table (and in fact, 1283 of the records have NULL label values)!  So, can we just delete that filter from our original query and call it good?

$ sqlite3 AddressBook.sqlitedb 'select p.rowid, p.first, p.last, p.organization, l.value as Type, m.value from abperson p, abmultivalue m, abmultivaluelabel l where p.rowid=m.record_id order by p.rowid;' | wc -l
22233

What?!  Over 22,000 rows are returned?  How does that happen?  It's simple, but its also easily misunderstood.  Without our last filter, our select statement says to show the ABMultivaluelabel value field.  There are 14 rows in that table, so we get 14 rows for each of the 1588 records in the ABMultivalue table.  I'm not a math genius, but 542+1588+14= ... two plus eight plus four is fourteen, carry the one... no where near 22K!

So, we need a way to display label name if it exists in the ABMultivaluelabel table, otherwise display the ABMultivalue label integer.  Even if we don't know what it means, we don't want to ignore an ABMutlivalue entry because we may come to understand its meaning through another phase of our investigation.  We can accomplish this with a pair of nested select statements in the primary select statement.

select 
    p.rowid, 
    p.first, p.last, 
    p.organization, 
    case 
        when m.label in (select rowid from abmultivaluelabel) 
        then (select value from abmultivaluelabel where m.label = rowid) 
        else m.label 
    end as Type, 
    m.value  
from 
    abperson p, 
    abmultivalue m, 
where 
    p.rowid=m.record_id 
order by 
    p.rowid;

What we have going on in the highlighted portion is a case statement.  Case is the if/then/else of SQLite.  What we are saying above is, if the ABMultivalue label is in the ABMultivalueLabel table, then print the english translation from the ABMultivalueLabel table.  Otherwise, just print the label integer.  The 'as Type' suffix just labels the column 'Type' in the output.

Let's see what this does to our results:

$ sqlite3 AddressBook.sqlitedb 'select p.rowid, p.first, p.last, p.organization, case when m.label in (select rowid from abmultivaluelabel) then (select value from abmultivaluelabel where m.label = rowid) else m.label end as Type, m.value  from abperson p, abmultivalue m where p.rowid=m.record_id order by p.rowid;' | wc -l
1588

Bingo!  We now have our expected results.  One more tiny addition to our query cleans up our output.  It turns out, that a lot of the entries in the ABMultivalue table have NULL values and are useless for our purpose.  We can remove them with a filter in the where clause and get a final product:

$ sqlite3 AddressBook.sqlitedb 'select p.rowid, p.first, p.last, p.organization, case when m.label in (select rowid from abmultivaluelabel) then (select value from abmultivaluelabel where m.label = rowid) else m.label end as Type, m.value  from abperson p, abmultivalue m where p.rowid=m.record_id and m.value not null order by p.rowid;' | wc -l
725

So, we improved our results greatly over the initial 305 records that "seemed right".  

The Larger Lesson Learned

You likely tuned into this article because of the iOS6 AdressBook.sqlitedb topic.  And I hope I've helped you extract the results you sought.  But the bigger takeaway comes from the analysis after the initial query that resulted in 305 records.  And, it is this: SQLite will do what you tell it, so long as you use proper syntax, no matter how stupid your query is.  In otherwords, you ask the question, it answers that question, not the one you were thinking, the one you typed.  Make sure you really understand the question before you rely on the answer!



Addendum: What About Street Addresses?

It's great that we've figured out the phone numbers email addresses, etc. of our contacts, but what if we want to find their physical addresses?  The ABMultivalue table I examined did not have street addresses in its values.  Instead, I found Street addresses in the ABMultivalueEntry and ABPersonFullTextSearch_content tables.  I would not preclude addresses from existing ABMultivalue, however, since we have seen there can be custom labels and the value field can contain any text data.  But, I digress.

The ABPersonFullTextSearch_content has the following structure:

CREATE TABLE 'ABPersonFullTextSearch_content'(docid INTEGER PRIMARY KEY, 'c0First', 'c1Last', 'c2Middle', 'c3FirstPhonetic', 'c4MiddlePhonetic', 'c5LastPhonetic', 'c6Organization', 'c7Department', 'c8Note', 'c9Birthday', 'c10JobTitle', 'c11Nickname', 'c12Prefix', 'c13Suffix', 'c14DisplayName', 'c15Phone', 'c16Email', 'c17Address', 'c18SocialProfile', 'c19URL', 'c20RelatedNames', 'c21IM', 'c22Date')

At first glance, the structure of this table might cause you to think you can simply dump the data from this table for a very complete list of contacts.  But this table was intended for searching if you take its name at face value.  Looking at the data, this assertion is supported by the fact that phone numbers are merged into one field and then recorded in a variety of formats (i.e., international, with/without area code, with/without punctuation, etc).  Thus, it is difficult to read and there is no distinction in the type of data (i.e, home/work phone number, etc).

EDIT:
I have discovered how the ABMultivalueEntry relates to the ABMultivalue table and solved the mystery of the blank ABMultivalue "values" fields.

ABMultivalueEntry contains physical addresses which are distinguished a "key" integer.  The key is is given its meaning, such as "Country," "Street," "Zip," "City," etc., by the ABMultivalueEntryKey table.  The key in ABMultivalueEntry relates to the rowid in ABMultivalueEntryKey.  The address data itself is stored in the "value" field of the ABMultivalueEntry table, which in turn relates by the "parent_id" field to the ABmultivalue table by a matching UID field.  Confused?

Let me try it this way:  Contact names are in the ABperson table.  Contact phone numbers, emails addresses, and URLs are located in the ABmultivalue table.  Physical addresses are located in the ABMultivalueEntry table.  The definitions of the values in the '*value*' tables are located in the "*label" and "*key" tables discussed.

What is the query to produce a complete contact list?  Still working on that one...