Tuesday, May 6, 2014

Searching for Searches

In a recent examination of smart phone content, it became necessary to know the personal interests of the device's owner.  You can browse internet and app history, but that can be extensive to review every URLs to every clicked link and served page.  To get directly to the point, I decided to search for his browser/app search query history.  I was hoping to craft a regular expression (or several) that would assist in giving me a good idea of the person's interests.

I studied some top search engine results and reviewed some browser history and crafted the following GNU extended regular expression:

[?&](k|p|q|query)=[a-zA-Z0-9+_%-]+

This search, run against strings output of files, found search queries for Google, Yahoo!, Bing, Ask, Aol, Faceboot, YouTube, Vimeo and some x-rated sites as well as app content such as Twitter.  Search results appear (depending on what you feed and how you configure GNU grep) similar to:

https://www.google.com/search?q=you+found+me
http://m.youtube.com/results?q=some%20video%20i%20like
https://m.facebook.com/search/?query=that%20guy%20

An added benefit to this expression is that it also hits on additional page results, Google images page refreshes, etc.   With little command line wiz-bangery, it's even possible to sort and count the results to get a histogram of searches:

strings History.plist | egrep -o '[?&](k|p|q|query)=[a-zA-Z0-9+_%-]+' | sed 's/.*=//' | sort | uniq -c | sort -nr

I'll explain the command above:
  1. strings History.plist # extract ascii strings from the iPhone Safari History.plist
  2. egrep -o '[?&](k|p|q|query)=[a-zA-Z0-9+_%-]+' # grep for the regular expression described above
  3. sed 's/.*=//' # strip off the query tag at the front of the user typed query
  4. sort # sort the results alphabetically
  5. uniq -c  # count the matching lines
  6. sort -nr # reverse sort, placing the most frequent query terms first.
Results of the command look similar to the following:

   21    I+search+for+this+a+lot
   11    this%20one%20a%20little%less
    2    why+would+anyone+read+linux+sleuthing
    1    testing%20one%20two

The expression could be run against all logical files in a device and against unallocated space, if applicable.  I only demonstrate it using the History.plist because it's easy illustrate.

I post this short article both because I want to remember this regular expression (the whole reason for my blog in the first place) and to solicit favorite search box/engine regular expressions you might have. Please share them in a comment if you get a chance.  Happy searching!