Tuesday, October 9, 2012

Addressing the iOS6 Address Book and SQLite Pitfalls

Now that I have a basic handle on the iOS6 sms.db, it's time to look at the iOS6 Address Book.  After all, I now know what's being said when I examine the sms.db, but I don't have a real good picture, other than the phone number, of who's sending the message.  The iOS AddressBook.sqlitedb is the place to look for data related to the phone number.

A brief look at AddressBook.sqlitedb

The iOS6 AddressBook.sqlitedb database has 29 tables:


ABAccount                        ABPersonFullTextSearch_segdir  
ABGroup                          ABPersonFullTextSearch_segments
ABGroupChanges                   ABPersonFullTextSearch_stat    
ABGroupMembers                   ABPersonLink                   
ABMultiValue                     ABPersonMultiValueDeletes      
ABMultiValueEntry                ABPersonSearchKey              
ABMultiValueEntryKey             ABPhoneLastFour                
ABMultiValueLabel                ABRecent                       
ABPerson                         ABStore                        
ABPersonBasicChanges             FirstSortSectionCount          
ABPersonChanges                  FirstSortSectionCountTotal     
ABPersonFullTextSearch           LastSortSectionCount           
ABPersonFullTextSearch_content   LastSortSectionCountTotal      
ABPersonFullTextSearch_docsize   _SqliteDatabaseProperties  

The abperson table seems to be the obvious target for the data we want.  Here's its schema:


CREATE TABLE ABPerson (ROWID INTEGER PRIMARY KEY AUTOINCREMENT, First TEXT, Last TEXT, Middle TEXT, FirstPhonetic TEXT, MiddlePhonetic TEXT, LastPhonetic TEXT, Organization TEXT, Department TEXT, Note TEXT, Kind INTEGER, Birthday TEXT, JobTitle TEXT, Nickname TEXT, Prefix TEXT, Suffix TEXT, FirstSort TEXT, LastSort TEXT, CreationDate INTEGER, ModificationDate INTEGER, CompositeNameFallback TEXT, ExternalIdentifier TEXT, ExternalModificationTag TEXT, ExternalUUID TEXT, StoreID INTEGER, DisplayName TEXT, ExternalRepresentation BLOB, FirstSortSection TEXT, LastSortSection TEXT, FirstSortLanguageIndex INTEGER DEFAULT 2147483647, LastSortLanguageIndex INTEGER DEFAULT 2147483647, PersonLink INTEGER DEFAULT -1, ImageURI TEXT, IsPreferredName INTEGER DEFAULT 1)

If you bothered to read through that, you'll notice there is nothing about phone numbers or email addresses.  We need to find the table containing the phone numbers, and somehow join that to this table that has the name data.

Take a look at a sample (fictitious) row:

                  ROWID = 1
                  First = Some
                   Last = Name
                 Middle = 
          FirstPhonetic = 
         MiddlePhonetic = 
           LastPhonetic = 
           Organization = SomeCompany
             Department = 
                   Note = 
                   Kind = 0
               Birthday = 
               JobTitle = 
               Nickname = 
                 Prefix = 
                 Suffix = 
              FirstSort = -'- 1'?7=W +'= 17I/ )'MM'=7CA +57/1
               LastSort = 1'?7=W -'- +'= 17I/ )'MM'=7CA +57/1
           CreationDate = 350880583
       ModificationDate = 369882045
  CompositeNameFallback = 
     ExternalIdentifier = 
ExternalModificationTag = 
           ExternalUUID = 68516E99-7C39-4C4D-8871-BDE114EDD6B4
                StoreID = 0
            DisplayName = 
 ExternalRepresentation = 
       FirstSortSection = -
        LastSortSection = 1
 FirstSortLanguageIndex = 0
  LastSortLanguageIndex = 0
             PersonLink = -1
               ImageURI = 
        IsPreferredName = 1

A little more digging, we find the ABMultivalue table contains phone numbers, email addresses, URLs to social networking sites, and potentially more.  There are only six fields (making it a little easier on the eyes), but there's a minor issue.  Reading the table flat (that is, not related to any other tables) we don't know whom the numbers and email addresses belong to, nor to we know what type of data it is.  For example, is it a home, work, mobile, or FAX phone number? Take a look at its makeup and some sample rows and you'll see what I mean:

CREATE TABLE ABMultiValue (UID INTEGER PRIMARY KEY, record_id INTEGER, property INTEGER, identifier INTEGER, label INTEGER, value TEXT)

       UID = 643
 record_id = 1
  property = 4
identifier = 0
     label = 3
     value = someguy@somedomain.com

       UID = 1026
 record_id = 1
  property = 3
identifier = 0
     label = 1
     value = (###) ###-####

That's a little easier on the eyes.  You can see that the 'value' field at the end of the row contains the phone numbers and email addresses we're looking for.  The problem comes from the fact its the only text field in the table.  All other values are integers.  So, how do we relate the names in the ABPerson table to the numbers in the ABMultivalue?  I made it pretty obvious here, but ABperson.ROWID = ABMultivalue.record_id.

So, what about that other problem: what kind of data is that?  Here comes the ABMultiValueLabel table to the rescue!  Take a look at this:

value = iPhone

value = _$![MOBILE]!$_

value = _$![HOME]!$_

value = mobile1

value = mobile2

value = DAD

value = MOM

If that looks a bit strange, its because the iOS user can create their own labels.  Some, like mobile1 and mobile2, are going to be default.  Others like Mom and DAD are custom.  But something else should catch your eye, too: there are no values in the ABMultiValueLabel table with which to match the label integer in the  ABMultivalue table!  Now what?

I introduced the SQLite rowid field in an earlier article.  In sum, every SQLite table has a rowid field, which is a unique integer, whether or not it was specified when the table was created.  In the case of the ABMultiValueLabel table, the rowid is the relation to the label integer in the ABMultivalue table to inform us of the data type.  Observe:


$ sqlite3 -line AddressBook.sqlitedb 'select rowid, value from abmultivaluelabel where rowid = 1 or rowid = 3'

rowid = 1
value = iPhone

rowid = 3
value = _$![Home]!$_

Now that we know how these tables interrelate, we can compose a query to generate a simple contacts list.

A Simple Contact List

To quickly recap, the ABPerson, ABMultivalue, and ABMultivalueLabel tables are interrelated: ABPerson contains the contact names, ABMultivalue houses the phone numbers, email addresses, and potentially other values (e.g. URLs), and ABMultivalueLabel defines the values.  We can create a list of contacts, sorted by contact identification number, thusly:

$ sqlite3 -header AddressBook.sqlitedb 'select p.rowid, p.first, p.last, p.organization, l.value as Type, m.value from abperson p, abmultivalue m, abmultivaluelabel l where p.rowid=m.record_id and l.rowid=m.label order by p.rowid;'

ROWID|First|Last|Organization|Type|value

1|Some|Guy|SomeCompany|_$![Home]!$_|someguy@somedomain.com
1|Some|Guy|SomeCompany|iPhone|(###) ###-####

(Note: Keep reading! Though it might not be apparent, this query above is flawed.)

Checking our work

While the query above works nicely, it does not return all the data available in the tables.  Let's take a look some database stats and you'll see what I mean:

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abperson;' 
542


$ sqlite3 AddressBook.sqlitedb 'select count(*) from abmultivalue;'
1588

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abmultivaluelabel;'
14


The ABPerson table has 542 entries.  We'd expect at least 542 lines in our query, assuming one phone number, email or other value in the ABMultivalue table.  Considering there are 1588 ABMultivalue rows, we actually expect an average of three rows per contact.  But, how many rows are returned with our "Simple Contact List" query?

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abperson p, abmultivalue m, abmultivaluelabel l where p.rowid=m.record_id and l.rowid=m.label;'
305

"Ruh-row, Raggy!"  What happened?  Maybe there are contacts in ABPerson not found in ABMultivalue?  

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abperson where rowid not in (select record_id from abmultivalue);' 
0

No, all 542 rowids from ABPerson are found in the record_ids of ABMultivalue.  We certainly want all the ABPerson records represented in our output, and clearly we're not getting them with the query as written.  But how many of the ABMultivalue records relate to the ABPerson records?

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abmultivalue where record_id in (select rowid from abperson);' 
1588

 All of them.  So, now we know the score: our query won't be correct until we have 1588 rows of data.  So, where's the breakdown?  To find out, we need to look carefully at our original query:

select              # here we a just selecting the columns we desire
    p.rowid, 
    p.first, 
    p.last, 
    p.organization, 
    l.value as Type, 
    m.value 
from                # here we list the source tables, and assign an alias
    abperson p, 
    abmultivalue m, 
    abmultivaluelabel l 
where               # here we filter the data, and this is our danger area
    p.rowid=m.record_id 
    and l.rowid=m.label 
order by            # here we are simply ordering the results
    p.rowid;'

So, it would appear we need to concentrate on the where clause which has the effect of filtering our data.  We are doing two things here:
  1. p.rowid=m.record_id - ABPerson's rowid must match a ABMultivalue record_id, which we have already shown occurs in all cases
  2. l.rowid=m.label - There must be a matching rowid in ABMultivalueLabel for each ABMultivalue label.
It appears that the second statement in where clause is our culprit.  Can we demonstrate this to be the case and verify our assessment of the original query?

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abperson, abmultivalue where rowid=record_id;' 
1588

Replacing our count function with the original field names (replacing the label name with the integer because we don't call to the ABMultivalueLabel table in this instance) doesn't change the count because the select statement is the location where we define the fields to be displayed, not filter the results.  I'll demonstrate:

$ sqlite3 AddressBook.sqlitedb 'select p.rowid, p.first, p.last, p.organization, m.label as Type, m.value from abperson p, abmultivalue m where p.rowid = m.record_id;' | wc -l
1588

So, it was the final filter that resulted in the reduced data set.  That means there must be undefined labels in the ABMultivalue table (and in fact, 1283 of the records have NULL label values)!  So, can we just delete that filter from our original query and call it good?

$ sqlite3 AddressBook.sqlitedb 'select p.rowid, p.first, p.last, p.organization, l.value as Type, m.value from abperson p, abmultivalue m, abmultivaluelabel l where p.rowid=m.record_id order by p.rowid;' | wc -l
22233

What?!  Over 22,000 rows are returned?  How does that happen?  It's simple, but its also easily misunderstood.  Without our last filter, our select statement says to show the ABMultivaluelabel value field.  There are 14 rows in that table, so we get 14 rows for each of the 1588 records in the ABMultivalue table.  I'm not a math genius, but 542+1588+14= ... two plus eight plus four is fourteen, carry the one... no where near 22K!

So, we need a way to display label name if it exists in the ABMultivaluelabel table, otherwise display the ABMultivalue label integer.  Even if we don't know what it means, we don't want to ignore an ABMutlivalue entry because we may come to understand its meaning through another phase of our investigation.  We can accomplish this with a pair of nested select statements in the primary select statement.

select 
    p.rowid, 
    p.first, p.last, 
    p.organization, 
    case 
        when m.label in (select rowid from abmultivaluelabel) 
        then (select value from abmultivaluelabel where m.label = rowid) 
        else m.label 
    end as Type, 
    m.value  
from 
    abperson p, 
    abmultivalue m, 
where 
    p.rowid=m.record_id 
order by 
    p.rowid;

What we have going on in the highlighted portion is a case statement.  Case is the if/then/else of SQLite.  What we are saying above is, if the ABMultivalue label is in the ABMultivalueLabel table, then print the english translation from the ABMultivalueLabel table.  Otherwise, just print the label integer.  The 'as Type' suffix just labels the column 'Type' in the output.

Let's see what this does to our results:

$ sqlite3 AddressBook.sqlitedb 'select p.rowid, p.first, p.last, p.organization, case when m.label in (select rowid from abmultivaluelabel) then (select value from abmultivaluelabel where m.label = rowid) else m.label end as Type, m.value  from abperson p, abmultivalue m where p.rowid=m.record_id order by p.rowid;' | wc -l
1588

Bingo!  We now have our expected results.  One more tiny addition to our query cleans up our output.  It turns out, that a lot of the entries in the ABMultivalue table have NULL values and are useless for our purpose.  We can remove them with a filter in the where clause and get a final product:

$ sqlite3 AddressBook.sqlitedb 'select p.rowid, p.first, p.last, p.organization, case when m.label in (select rowid from abmultivaluelabel) then (select value from abmultivaluelabel where m.label = rowid) else m.label end as Type, m.value  from abperson p, abmultivalue m where p.rowid=m.record_id and m.value not null order by p.rowid;' | wc -l
725

So, we improved our results greatly over the initial 305 records that "seemed right".  

The Larger Lesson Learned

You likely tuned into this article because of the iOS6 AdressBook.sqlitedb topic.  And I hope I've helped you extract the results you sought.  But the bigger takeaway comes from the analysis after the initial query that resulted in 305 records.  And, it is this: SQLite will do what you tell it, so long as you use proper syntax, no matter how stupid your query is.  In otherwords, you ask the question, it answers that question, not the one you were thinking, the one you typed.  Make sure you really understand the question before you rely on the answer!



Addendum: What About Street Addresses?

It's great that we've figured out the phone numbers email addresses, etc. of our contacts, but what if we want to find their physical addresses?  The ABMultivalue table I examined did not have street addresses in its values.  Instead, I found Street addresses in the ABMultivalueEntry and ABPersonFullTextSearch_content tables.  I would not preclude addresses from existing ABMultivalue, however, since we have seen there can be custom labels and the value field can contain any text data.  But, I digress.

The ABPersonFullTextSearch_content has the following structure:

CREATE TABLE 'ABPersonFullTextSearch_content'(docid INTEGER PRIMARY KEY, 'c0First', 'c1Last', 'c2Middle', 'c3FirstPhonetic', 'c4MiddlePhonetic', 'c5LastPhonetic', 'c6Organization', 'c7Department', 'c8Note', 'c9Birthday', 'c10JobTitle', 'c11Nickname', 'c12Prefix', 'c13Suffix', 'c14DisplayName', 'c15Phone', 'c16Email', 'c17Address', 'c18SocialProfile', 'c19URL', 'c20RelatedNames', 'c21IM', 'c22Date')

At first glance, the structure of this table might cause you to think you can simply dump the data from this table for a very complete list of contacts.  But this table was intended for searching if you take its name at face value.  Looking at the data, this assertion is supported by the fact that phone numbers are merged into one field and then recorded in a variety of formats (i.e., international, with/without area code, with/without punctuation, etc).  Thus, it is difficult to read and there is no distinction in the type of data (i.e, home/work phone number, etc).

EDIT:
I have discovered how the ABMultivalueEntry relates to the ABMultivalue table and solved the mystery of the blank ABMultivalue "values" fields.

ABMultivalueEntry contains physical addresses which are distinguished a "key" integer.  The key is is given its meaning, such as "Country," "Street," "Zip," "City," etc., by the ABMultivalueEntryKey table.  The key in ABMultivalueEntry relates to the rowid in ABMultivalueEntryKey.  The address data itself is stored in the "value" field of the ABMultivalueEntry table, which in turn relates by the "parent_id" field to the ABmultivalue table by a matching UID field.  Confused?

Let me try it this way:  Contact names are in the ABperson table.  Contact phone numbers, emails addresses, and URLs are located in the ABmultivalue table.  Physical addresses are located in the ABMultivalueEntry table.  The definitions of the values in the '*value*' tables are located in the "*label" and "*key" tables discussed.

What is the query to produce a complete contact list?  Still working on that one...