Pages

Thursday, August 9, 2012

File "Cliffs" Notes: Abbreviating File Reads

Sometimes a forensics task is pretty narrow.  Determining file type means reading the first few 10's of bytes in most cases.  Luckily there are tools that do just that.  But for other tasks we might face, our standard tools don't really give us a method for narrowing their scope.  You might want to search files for a string that you expect to find in the first 1000 bytes of the file.  Grep will allow you to stop searching a file once the string is found, but it won't stop searching until it finds the match.  This might not seem like a big deal until you consider large files like videos.  You really want to search 1k of data, but you end up searching 2 GB!

Head

The head command is used to output the first part of a file.  By default, it outputs the first 10 lines of a files (best when applied to text files), but it can be used to output a user-defined amount of bytes.
$ head -c 1000 file # Export the first 1000 bytes of "file".  "1k" can be used as a shortcut for "1000".  Similarly, the tail command can be used to export data from the end of a file.  
You can stop reading here if you are only needing to ouput a fixed amount of bytes from one file.  But, usually the forensic examiner is concerned with searching through hundreds of thousands of files, if not millions!

A Real-World Example

Let's say we have a disk image containing hundreds of thousand graphics, but we only want to view photographs created with a digital camera.  Graphics viewing programs like Gthumb have the ability to recurse through a file system and find graphics images, and even provide for some search limits like file size, modified date, etc.  But these filters don't really help us here.

We need a way to limit the graphics we view to digital photographs.  The best way for us to do this is to limit our search for images based on the presence of Exif data.  The term Exif stands for Exchangeable Image File Format.  Exif data contains camera and setting information and was developed to encourage interoperability between digital devices (see exif.org).  It is found before the image data in a jpeg file.  Most digital cameras use this standard.

Exif data is preceed by the header 'Exif' found 25 bytes into a jpeg file.  So, it seems logical to search graphics files for the Exif header.  But how do we search for graphics files, and then search only the first 28 bytes for the term 'Exif'?

Find and File

The find command can be used to search recursively through a file system.  It's a very robust tool, but I'll only focus on one of the the simplest invocations here: searching a path recursively for any file:
$ find path/to/search/ -type f  # The "f" argument to the "-type" option returns only files to standard output.  The "d" argument would limit output to directories, etc.
We can further limit our output based on file name, if we wish:
$ find path/to/search -type f -iname ".jp*g" # The -iname option matches the following argument with case insensitivity.  Note that standard wildcards are accepted.  The -name option would produce case sensitive results.
We might be looking for graphics that have been renamed, however, and limiting our search by name is too narrow.  This is where the file command comes in.  File reads a files header and returns its type.  We can pipe files located with find and read their type thusly:
$ find path/to/search -type f | file -i -  # The -i option returns the files MIME type, which is more standardized than the "plain language" descriptions otherwise provided.  The file command requires a file argument, and the '-' after the -i option is a shortcut to stdin.  In otherwords, the - is replaced by the file path returned by the find command.
We now have a list of files with their types being returned to the standard output.  What we want to do is filter those files for images, and then get a subset of images with exif data.  To do so,  insert a while loop and add a test:
$ find path/to/search -type f | while read i; do file -ib "$i" | grep "image"; [ $? = 0 ] && echo "$i"; done # Here, each file located with find is assigned to variable $i in the while loop.  Inside the loop, the file type of $i is determined and if the type contains the string "image", the file path is printed.  The "-b" option was added to the file command to produce "brief" output, that is output that includes the file type absent the file name (this avoids false hits where the term "image" is in the file path).
If you're not familiar with bash tests, please review this post, or on the BASH command line, enter "help test".  In brief, we are testing the grep exit code, and if it is '0', which means success, we execute the echo command to print the file path.

We've now managed a list of files that are classified as images.  We need to determine which have Exif data to complete the our quest.  We do this by integrating one more test:
$ find path/to/search -type f | while read i; do file -ib "$i" | grep "image"; [ $? = 0 ] && head -c 28 "$i" | grep "Exif"; [ $? = 0 ] && echo "$i"; done  # Here, if the file is determined to be an image, the first 28 bytes are grepped for the string "Exif".  If that results in a match, the filename is printed.
Okay, you caught me: I only produced a list of files with Exif headers and I can't see the content of the images by reading a list.  This is easily rectified with the ln or 'link' command.  We can replace the last echo with a command to create link files to the images we located allowing for all the files to be viewed from one location:
$ find path/to/search -type f | while read i; do file -ib "$i" | grep "image"; [ $? = 0 ] && head -c 28 "$i" | grep "Exif"; [ $? = 0 ] && ln -s "$i" "${i##*/}"-$(stat -c %i "$i"); done # The -s option creates a symbolic link to file $i in the current directory.  The link will be named for the file's basename and have the inode number appended to prevent errors the occur if the files have the same name.  
Don't worry about the "${i##*/}"-$(stat -c %i "$i") razzle-dazzle for now.  I'll explain that in another post.  For now, just mimic that part of the command if you are concerned about files having the same name.

A Little More Detail for Those Still Conscious

It turns out that Exif is really not the best string on which to filter.  More effective was filtering on an exif creation date.  Consider a recent case where I carved jpegs from unallocated clusters.  Here, I could filter on file name over file type because the carver automatically appended the .jpg file extension.

Total number of files carved:

$ find recovered/ -type f | wc -l # The wc or "word count" command can be used to count lines of output
231331

My total number of jpeg images:
$ find recovered/ -name "*.jpg" | wc
61432
Total number of jpeg images with Exif data:

$ find recovered/ -name "*.jpg" | while read i; do head -c 28 $i | grep Exif ; done | wc -l
3815
Total number of jpeg imags with Exif creation dates:

$ find recovered/ -name "*.jpg" | while read i; do head -c 256 $i | grep -E '[0-9]{4}:[0-9]{2}:[0-9]{2}\ [0-9]{2}:[0-9]{2}:[0-9]{2}'; done | wc -l # The regular expression matches output like "2012:08:09 17:11:56"
841


As you can see, filtering on a date code was much more effective at finding files with meta data associated with cameras (file creation dates) the on the Exif header alone.

I know that there is a lot of information here, but I hope by presenting it completely you are able to understand and adapt the concepts to your own uses.

4 comments:

  1. John, I'm curious why you decided to use "file" over exiftool for this?

    ReplyDelete
  2. Hal, I'm aware of exiftool and find it excellent. The main premise was working with subsets of data, and i wanted to also highlight branching with exit codes.

    I probably should have chosen a different data set as an example, but its what I faced over the past couple of days and is what came to mind. There is certainly more than one way to skin this cat!

    ReplyDelete
  3. I think that instead of:
    find . -type f | while read i; do file -ib "$i" | grep "image"; [ $? = 0 ] && echo "$i"; done

    you could write:
    find path/to/search -type f | while read i; do file -ib "$i" | grep "image" && echo "$i"; done

    Is there any reason for using [ $? = 0 ]

    ReplyDelete
  4. You're absolutely correct, rweglarz, that the control operator '&&' can replace the exit code test of '[ $? = 0 ]' and frankly, would be preferred as there is a performance increase!

    ReplyDelete