Friday, March 11, 2011

More NSRL Talk

I've spent a bit more time study the NSRL problem.  It is a fantastic but unwieldy database of hash values that is under utilized (at least in my experience).  I spoke in my last post about importing the NSRL database into SQLite so the data can be better managed.  I added at the end of the post that I had created a pair of shell scripts, 'getnsrl' and 'mknsrl' to automate SQLite creation.

I am happy to report that I have extended mknsrl to fully handle the NSRL database.  My first version only imported the file table which contains the hash values.  I thought, incorrectly, that this would be sufficient for generating has sets.  I had not fully understood the NSRL Reference Data Set (RDS) construction, which actually consists of five CSV files:

  • NSRLFile.txt
  • NSRLMfg.txt
  • NSRLOS.txt
  • NSRLProd.txt
  • version.txt

The NSRLFile.txt is broken into four parts that need to be concatenated together.  Currently, this database contains over 62 million lines, far exceeding the capabilities of the spreadsheet programs for which it is formatted.  The database consists of 'known' files, not be confused with 'known good' files.  The database contains hash values for well know hacker programs, for example, and eliminating them through hashing may be problematic to the investigation.  There needed to be a way to export specific hash sets from the RDS for the data to be truly useful.

getnsrl

I wrote getnsrl to make processing the four iso files containing the NSRL RDS easy.  The files must be mounted, and the database files extracted from a zip file.  getnsrl does this automatically, appending the names of the NSRLFile.txt CSV files to distinguish them and prevent overwriting.  On my system, this takes about four minutes.  The user need only put the iso files in a common directory (not required, but easier on the fingers) and issue 'getnsrl RDS*'.  The tool provides feedback and start/stop times.

mknsrl


I wrote mknsrl to allow for easy importation of the CSV files to a standalone SQLite database.  It creates virtual tables from the CSV files and then imports them into regular tables.  The NSRLFile.txt files are not (and should not be) concatenated before processing as this is done by mknsrl while inserting the data into the database.  Though importing the files to a standard table is not required to use the CSV files with SQLite, I favored this approach because the database can be used independently of the CSV files.  I just found it simpler to manage, and it has the added benefit of indexing if the user so chooses, since indexing virtual tables is not possible with SQLite.  mknsrl reports its progress, start/stop times, and some database statistics.  It no longer does any indexing (as in the first version).  I leave that to the user to decide what to index if indexing is desired.  On my system, it took about 12 minutes to import the NSRL RDS data.  The statistical analysis can be commented out to save a few minutes, if desired.

Why use mknsrl?

The chief advantage of mknsrl is database control.  Hash sets can be produced of any product in the database, which is not easy to do with the traditional methods for handling the RDS data.  NSRL has recommended that the four NSRLFile.txt be concatenated and then reduced to uniq hash values, leaving about 18 million records.  The problem is, you can't reliably know the source of the file this way, and you may hash out a program file of interest.

mknsrl allows any hash set or group of hash sets to be extracted from the database with a sqlite query in the following form:

$ sqlite3 -csv -header nsrl.db "select * from file where [expression];"

I used the following expression to create a hash set of all "Windows" operating systems in about 5 minutes (note the nested SELECT statement that makes this possible):

$ sqlite3 -csv -header nsrl.db "select * from file where opsystemcode in (select opsystemcode from os where opsystemname like '%Windows%'"

Your ability to create hash sets is limited only by your ability to craft SQLite queries.

If you would like to look at or use 'getnsrl' or 'mknsrl', you can find them at https://sites.google.com/site/slosleuth/.

Tuesday, March 8, 2011

Taming the NSRL Hash DB Beast

Background

I've been trying to extend Sleuthkit in a variety of ways, such as by adding a table to the Sleuthit databese to capture file/mime types and improve strings extraction over that of the traditional 'blkls | strings > file' variety.  I came to the realization that elimination of files through hashing could speed these processes.  This is not a new concept to me or anyone else in data forensics, but I do believe hash elimination is under used.  I think this is due in large part to the unwieldy size of the National Software Reference Library (NSRL) hash database.

The Problem

As of January 2011, The NSRL data is distributed in 4 .iso files and contains a whopping  62,294,036 file entries.  The NSRL tries to help you manage the data volume by directing you to concatenate the NRSLFile.txt files (each containing over 4.7 million rows) found in each .iso and then export only rows with unique hash values.  Such an action reduces the total number of rows to a paltry 18,840,521. However, the unique operation (easily performed with 'uniq' on the command line) hampers the ability of knowing the application or operating system from which the file came.

Size is not the only difficulty, however.  After all, in most cases, large size just means longer processing time.  But the CSV format of the database is a major obstacle.  The field separator used in the database is the comma (',').  But, many of the filenames listed in the db contain commas, and while, text fields are quoted, numeric fields are not.  These conditions are not an obstacle for spread sheets, but keep in mind that OpenOffice Calc can handle only 1,000,000 rows.  And, these conditions make it difficult to work with traditional command line text tools like 'cut', 'sed', and 'awk'.

Ideally, it would be best to have the NSRL db in a database like SQLite, where hash sets could be created based on products and operating systems.  In this way, one could easily export a hash set for Windows XP if that was the type of system being investigated.  However, SQLite splits on all field separators it encounters regardless of what comes before them (meaning they cannot be escaped).  Therefore, trying to import the NSRLFile.txt data fails because there are filenames with commas in them.

The Solution

I am happy to report that there is a solution to the problem with SQLite, however.  A CSV module has been written to overcome this difficulty.  It allows SQLite to query a CSV file directly Using the module is not immediately apparent, so I document the process here:

  1. Decompress the csvfile.zip containing the module source code.
  2. Change to the csvfile/ directory and issue the 'make' command.  If you get an error, it is probably because you don't have the SQLite headers installed (libsqlite3-dev package in Debian/Ubuntu).
  3. Copy the resulting virtcsv.so module to somewhere common, like /usr/local/lib because you will need to load the module each time you query the database you create.
  4. Create a new database with 'sqlite3 nrsl.db'.
  5. Load the CSV module at the SQLite prompt with '.load /usr/local/lib/virtcsv.so'
  6. Create a virtual table to connect to the NSRLFile.txt at the SQLite prompt with 'CREATE VIRTUAL TABLE hashes USING CSVFILE(NSRLFile.txt, CP1252, ',', '"');' where:
    • hashes is the table name you are creating
    • NSRLFile.txt is the CSV file you are attaching
    • CP1252 is the code page Windows Latin 1 (default)
    • , (comma) is the field separator in the CSV file (default is ';' [semicolon])
    • " (double quote) is the string delimiter (default)
  7. View a sample of the data with 'select * from hashes limit 10;'
The process above shows the process for importing one file.  The fact of the matter is, that if you concatenate the four NSRLFile.txt files provided by the NSRL, the file is too large for SQLite to handle.  The files need to be imported to individual tables and queried together to produce hash sets.  Further, Indexes should be created to speed queries (Edit: SQLite will not allow virtual tables to be indexed, so the data must be imported to regular table.  I have written a script to automate this, see below.).  I won't go into those process here, but a good online tutorial for using SQL can be found at W3Schools.  Keep in mind that SQL (shown at W3Schools) and SQLite have some differences, but there are few inconsistencies for basic database query.

Edit:  I've written two scripts to automate the process.  The first, 'getnsrl', will mount the four RDS .iso files and extract/rename the NSRLFile.txt csv files to ready them for processing.  The second, 'mknsrl', will create virtual tables from each csv file and import them into a 'hashes' table.  It will also index the table based on the operating system the files come from to speed searches and hash table exports.