Tuesday, March 8, 2011

Taming the NSRL Hash DB Beast


I've been trying to extend Sleuthkit in a variety of ways, such as by adding a table to the Sleuthit databese to capture file/mime types and improve strings extraction over that of the traditional 'blkls | strings > file' variety.  I came to the realization that elimination of files through hashing could speed these processes.  This is not a new concept to me or anyone else in data forensics, but I do believe hash elimination is under used.  I think this is due in large part to the unwieldy size of the National Software Reference Library (NSRL) hash database.

The Problem

As of January 2011, The NSRL data is distributed in 4 .iso files and contains a whopping  62,294,036 file entries.  The NSRL tries to help you manage the data volume by directing you to concatenate the NRSLFile.txt files (each containing over 4.7 million rows) found in each .iso and then export only rows with unique hash values.  Such an action reduces the total number of rows to a paltry 18,840,521. However, the unique operation (easily performed with 'uniq' on the command line) hampers the ability of knowing the application or operating system from which the file came.

Size is not the only difficulty, however.  After all, in most cases, large size just means longer processing time.  But the CSV format of the database is a major obstacle.  The field separator used in the database is the comma (',').  But, many of the filenames listed in the db contain commas, and while, text fields are quoted, numeric fields are not.  These conditions are not an obstacle for spread sheets, but keep in mind that OpenOffice Calc can handle only 1,000,000 rows.  And, these conditions make it difficult to work with traditional command line text tools like 'cut', 'sed', and 'awk'.

Ideally, it would be best to have the NSRL db in a database like SQLite, where hash sets could be created based on products and operating systems.  In this way, one could easily export a hash set for Windows XP if that was the type of system being investigated.  However, SQLite splits on all field separators it encounters regardless of what comes before them (meaning they cannot be escaped).  Therefore, trying to import the NSRLFile.txt data fails because there are filenames with commas in them.

The Solution

I am happy to report that there is a solution to the problem with SQLite, however.  A CSV module has been written to overcome this difficulty.  It allows SQLite to query a CSV file directly Using the module is not immediately apparent, so I document the process here:

  1. Decompress the csvfile.zip containing the module source code.
  2. Change to the csvfile/ directory and issue the 'make' command.  If you get an error, it is probably because you don't have the SQLite headers installed (libsqlite3-dev package in Debian/Ubuntu).
  3. Copy the resulting virtcsv.so module to somewhere common, like /usr/local/lib because you will need to load the module each time you query the database you create.
  4. Create a new database with 'sqlite3 nrsl.db'.
  5. Load the CSV module at the SQLite prompt with '.load /usr/local/lib/virtcsv.so'
  6. Create a virtual table to connect to the NSRLFile.txt at the SQLite prompt with 'CREATE VIRTUAL TABLE hashes USING CSVFILE(NSRLFile.txt, CP1252, ',', '"');' where:
    • hashes is the table name you are creating
    • NSRLFile.txt is the CSV file you are attaching
    • CP1252 is the code page Windows Latin 1 (default)
    • , (comma) is the field separator in the CSV file (default is ';' [semicolon])
    • " (double quote) is the string delimiter (default)
  7. View a sample of the data with 'select * from hashes limit 10;'
The process above shows the process for importing one file.  The fact of the matter is, that if you concatenate the four NSRLFile.txt files provided by the NSRL, the file is too large for SQLite to handle.  The files need to be imported to individual tables and queried together to produce hash sets.  Further, Indexes should be created to speed queries (Edit: SQLite will not allow virtual tables to be indexed, so the data must be imported to regular table.  I have written a script to automate this, see below.).  I won't go into those process here, but a good online tutorial for using SQL can be found at W3Schools.  Keep in mind that SQL (shown at W3Schools) and SQLite have some differences, but there are few inconsistencies for basic database query.

Edit:  I've written two scripts to automate the process.  The first, 'getnsrl', will mount the four RDS .iso files and extract/rename the NSRLFile.txt csv files to ready them for processing.  The second, 'mknsrl', will create virtual tables from each csv file and import them into a 'hashes' table.  It will also index the table based on the operating system the files come from to speed searches and hash table exports.