Pages

Tuesday, February 24, 2015

URLs : U R Loaded with Information

In my early days of forensics, I considered URLs in web histories as nothing more than addresses to websites, and strictly speaking, that’s true. But URLs often contain form information supplied by the user and other artifacts that can be relevant to an investigation, too. Most of us in the business know this already, at least it concerns one commonly sought after ingot: the web search term.

Consider the following URL:

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing

Most examiners would key in on the domain google.com and the end of the url, q=linuxsleuthing, and conclude this was a Google search for the term "linuxsleuthing", and they’d be right. But is there anything else to be gleaned from the URL? Just what do all those strings and punctuation mean, anyway?

What’s in a URL

Let’s use the URL above as our discussion focus. I’ll break down each element, and I’ll mention at least one value of the element to the forensic investigator (you may find others). Finally, I’ll identify and demonstrate a Python library to quickly dissect a URL into its constituent parts.

Protocol

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing

The URL starts with the protocol, the "language" the browser must speak to communicate with the resource. In the Python urllib module that I will introduce later, the protocol is referred to as the "scheme".

Examples:

  • http: - Internet surfing

  • https: - Secure Internet surfing

  • ftp: - File transfer operations

  • file: - Local file operations

  • mailto: - Email operations

The forensics value of a protocol is that it clues you into the nature of the activity occurring at that moment with the web browser.

Domain

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing

The domain can be thought of as the place "where the resource lives." Technically, it can consist of three parts: the top-level domain (TLD), second-level domain, and the host name (or subdomain). If you are more interested in those terms, I’ll leave it to you to research. Suffice it to say that we think of it as the "name" of the website, and with good reason. The names exist in this form because they can be easily memorized and recognized by humans. You may also encounter the domains evil twin in a URL, the Internet Protocol (IP) address, which domain names represent.

The Python urllib module referes to the domain as the "netloc" and identifies it by the leading "//", which is the proper introduction according to RFC 1808.

The forensic value of a domain is that you know where the resource defined in the remainder of the URL can be found or was located in the past.

Port

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing

The port is not listed in this url, nor is it often included in URLs intended for human consumption. However, if you see something like www.google.com:80, the ":80" indicates communication is occurring across port 80. You’ll often see port numbers for URLs to video servers, but port numbers are by no means limited to such uses. The Python urllib module incorporates the port in the "netloc" attribute.

The chief forensic value of a port is that it can clue you into the type of activity occurring on the domain because many port numbers are well known and commonly used for specific tasks.

Path

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing

In terms of a web server, the path indicates the path to the resource on the server. If the "file:" protocol is seen in the URL, then the path signifies the logical location of the file on the local machine. In fact, there will not be a domain, though the domain preamble is present, which is why you see three forward slashes for a file:

file:///path.

The Python urllib module also uses the name "path" to describe this hierarchal path on the server. Please understand that both hard paths and relative paths are possible. In addition, Python describes "params" for the last path element which are introduced by a semicolon. This should not be confused with the parameters I describe in the next section.

The principle forensic value of the path is the same as the over riding principle of real estate: location, location, location.

Parameters

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing

Parameters are information passed to the web server by the browser. They are also referred to as "query strings". Parameters can include environment information, web form data, window size, and anything else the web site is coded to pass on. Parameter strings are indicated by a leading "?" followed by key:value pairs. Multiple parameters are separated by "&". Python calls parameters the "query."

Consider our sample URL. It can be seen to have four parameters:

  • sourceid=chrome-instant

  • ion=1

  • espv=2

  • ie=UTF-8

Parameters are really the meat and potatoes of URL analysis, in my opinion. It is here I find the most interesting details: the user name entered on the previous web page; in the case of mobile devices, the location of the device (lat/lon) when the Facebook post was made; the query on the search engine, etc.

Despite what I said in the preceding paragraph, note that query string is not present the case of our sample URL. The search was conducted through the Google Chrome browser address bar (sourceid=chrome-instant). Thus, it is not safe to assume that all search engine search terms or web form data are to be found in the URL parameters.

To throw a little more mud on the matter, consider that the entry point of the search and the browser make a difference in the URL:

Search for linuxsleuthing from the Ubuntu start page, FireFox
https://www.google.com/search?q=linuxsleuthing&ie=UTF-8&sa=Search&channel=fe&client=browser-ubuntu&hl=en&gws_rd=ssl

Here, we see the same search, but different parameters:

  • q=linuxsleuthing

  • ie=UTF-8

  • sa=Search

  • channel=fe

  • client=browser-ubuntu

  • hl=en

  • gws_rd=ssl

Caution
Parameters will mean different things to different sites. There is no "one-definition fits all" here, even if there be obvious commonality. It will take research and testing to know the particular meaning of any given parameter even though it may appear obvious on its face.

Anchor

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing

The anchor links to some location within the web page document itself. If you’ve ever clicked a link and found yourself halfway down a page, then you understand the purpose of the anchor. Somewhere in the html code of that page is a bookmark of sorts to which that anchor points. Python calls the anchor a "fragment."

In the case of our sample URL, the anchor is the search term I entered in the address bar of the Google Chrome browser.

The forensics value of an anchor is that you know what the user saw or should have seen when at that site. It might demonstrate a user interest or that they had knowledge of a fact, depending on your particular circumstances, of course.

Making Short Work of URL Parsing

Python includes a library for manipulating URLs named, appropriately enough, urllib. The python library identifies the components of a URL a little more precisely than I described above, which was only intended as an introduction. By way of quick demonstration, we’ll let Python address our sample URL

iPython Interative Session, Demonstrating urllib
In [1]: import urllib

In [2]: result = urllib.parse.urlparse('https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing')

In [3]: print(result)
ParseResult(scheme='https', netloc='www.google.com', path='/webhp', params='', query='sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8', fragment='q=linuxsleuthing')

In [4]: result.query
Out[4]: 'sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8'

In [5]: result.query.split('&')
Out[5]: ['sourceid=chrome-instant', 'ion=1', 'espv=2', 'ie=UTF-8']

In [6]: result.fragment
Out[6]: 'q=linuxsleuthing'
Note
The Python urllib calls the parameters I discussed a query and the anchor a fragment.

If you have a little Python knowledge, then you can see how readily you could parse a large list of urls. If not, it is not much more difficult to parse a url using BASH.

Parsing URLs using BASH variable substitution

$ url="https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing"
$ anchor=${url##*\#}
$ parameters=${url##*\?}
$ parameters=${parameters//#$anchor/}
$ echo ${parameters//&/ }
sourceid=chrome-instant ion=1 espv=2 ie=UTF-8
$ echo $anchor
q=linuxsleuthing

Finding Parameters

If you want to narrow your search for URLs containing parameters and anchors, you need only grep your list for the "&" or "#" characters. If you are processing a history database such as the Google Chrome History SQLite database, you can export the relevant urls with the following query:

SQLite query for Google Chrome History
select * from urls where url like "%?%" or url like "%#%";

What’s All the Fuss?

So, why go to all this length to study a URL? I’ll give two simple illustrations:

In the first case, I had the computer of a person suspected of drug dealing. I found little relevant data on his computer doing basic analysis, including an analysis of search engine search terms. When I examined URL parameters, however, I found searches at website vendors that demonstrated the purchase of materials for growing marijuana.

In the second case, a stolen computer was recovered in close proximity to a suspect who claimed to have no knowledge of the device. The Google Chrome browser in the guest account was used since the date of the theft, so analysis was in order. URL parameters showed a login to the suspect’s Apple account 12 hours after the left. There was no useful data in the cache, only the URL history.

Finally, bear in mind that the URL history is the only artifact you may have of secure website activity. Browsers, by default, do not cache secure elements. Understanding the contents of a URL can clue you into activity for which may find no other artifacts.

It is good to know what’s in a URL!


No comments:

Post a Comment