Thursday, January 19, 2012

Unoconv is number one!

I discussed a recent case where I was seeking Enhanced Metafiles and discovered that LibreOffice could be used to open and view them (see, LibreOffice: An Unlikely Image Viewer).  I mentioned at the end of that post that unoconv could be used to automate the process of taking the difficult to view EMF files and convert them something very portable, like PDF.

Today, I actually undertook that task when carving form EMF files produced over 1000 files to be examined.  Doing so exposed me to the full capabilities of unoconv, and I'm quite excited about the possibilities.

What, exactly is unoconv?

From the man page: "unoconv is a command line utility that can convert any file format that OpenOffice can import, to any file format that OpenOffice is capable of exporting."  This begs the question: what can OpenOffice (or LibreOffice) import and export?  Glad you asked:

$ unoconv --show
The following list of document formats are currently available:

  bib      - BibTeX [.bib]
  doc      - Microsoft Word 97/2000/XP [.doc]
  doc6     - Microsoft Word 6.0 [.doc]
  doc95    - Microsoft Word 95 [.doc]
  docbook  - DocBook [.xml]
  html     - HTML Document (OpenOffice.org Writer) [.html]
  odt      - Open Document Text [.odt]
  ott      - Open Document Text [.ott]
  ooxml    - Microsoft Office Open XML [.xml]
  pdb      - AportisDoc (Palm) [.pdb]
  pdf      - Portable Document Format [.pdf]
  psw      - Pocket Word [.psw]
  rtf      - Rich Text Format [.rtf]
  latex    - LaTeX 2e [.ltx]
  sdw      - StarWriter 5.0 [.sdw]
  sdw4     - StarWriter 4.0 [.sdw]
  sdw3     - StarWriter 3.0 [.sdw]
  stw      - Open Office.org 1.0 Text Document Template [.stw]
  sxw      - Open Office.org 1.0 Text Document [.sxw]
  text     - Text Encoded [.txt]
  txt      - Plain Text [.txt]
  vor      - StarWriter 5.0 Template [.vor]
  vor4     - StarWriter 4.0 Template [.vor]
  vor3     - StarWriter 3.0 Template [.vor]
  xhtml    - XHTML Document [.html]

The following list of graphics formats are currently available:

  bmp      - Windows Bitmap [.bmp]
  emf      - Enhanced Metafile [.emf]
  eps      - Encapsulated PostScript [.eps]
  gif      - Graphics Interchange Format [.gif]
  html     - HTML Document (OpenOffice.org Draw) [.html]
  jpg      - Joint Photographic Experts Group [.jpg]
  met      - OS/2 Metafile [.met]
  odd      - OpenDocument Drawing [.odd]
  otg      - OpenDocument Drawing Template [.otg]
  pbm      - Portable Bitmap [.pbm]
  pct      - Mac Pict [.pct]
  pdf      - Portable Document Format [.pdf]
  pgm      - Portable Graymap [.pgm]
  png      - Portable Network Graphic [.png]
  ppm      - Portable Pixelmap [.ppm]
  ras      - Sun Raster Image [.ras]
  std      - OpenOffice.org 1.0 Drawing Template [.std]
  svg      - Scalable Vector Graphics [.svg]
  svm      - StarView Metafile [.svm]
  swf      - Macromedia Flash (SWF) [.swf]
  sxd      - OpenOffice.org 1.0 Drawing [.sxd]
  sxd3     - StarDraw 3.0 [.sxd]
  sxd5     - StarDraw 5.0 [.sxd]
  tiff     - Tagged Image File Format [.tiff]
  vor      - StarDraw 5.0 Template [.vor]
  vor3     - StarDraw 3.0 Template [.vor]
  wmf      - Windows Metafile [.wmf]
  xhtml    - XHTML [.xhtml]
  xpm      - X PixMap [.xpm]

The following list of presentation formats are currently available:

  bmp      - Windows Bitmap [.bmp]
  emf      - Enhanced Metafile [.emf]
  eps      - Encapsulated PostScript [.eps]
  gif      - Graphics Interchange Format [.gif]
  html     - HTML Document (OpenOffice.org Impress) [.html]
  jpg      - Joint Photographic Experts Group [.jpg]
  met      - OS/2 Metafile [.met]
  odd      - OpenDocument Drawing (Impress) [.odd]
  odg      - OpenOffice.org 1.0 Drawing (OpenOffice.org Impress) [.odg]
  odp      - OpenDocument Presentation [.odp]
  otp      - OpenDocument Presentation Template [.otp]
  pbm      - Portable Bitmap [.pbm]
  pct      - Mac Pict [.pct]
  pdf      - Portable Document Format [.pdf]
  pgm      - Portable Graymap [.pgm]
  png      - Portable Network Graphic [.png]
  pot      - Microsoft PowerPoint 97/2000/XP Template [.pot]
  ppm      - Portable Pixelmap [.ppm]
  ppt      - Microsoft PowerPoint 97/2000/XP [.ppt]
  pwp      - PlaceWare [.pwp]
  ras      - Sun Raster Image [.ras]
  sda      - StarDraw 5.0 (OpenOffice.org Impress) [.sda]
  sdd      - StarImpress 5.0 [.sdd]
  sdd3     - StarDraw 3.0 (OpenOffice.org Impress) [.sdd]
  sdd4     - StarImpress 4.0 [.sdd]
  sti      - OpenOffice.org 1.0 Presentation Template [.sti]
  stp      - OpenDocument Presentation Template [.stp]
  svg      - Scalable Vector Graphics [.svg]
  svm      - StarView Metafile [.svm]
  swf      - Macromedia Flash (SWF) [.swf]
  sxi      - OpenOffice.org 1.0 Presentation [.sxi]
  tiff     - Tagged Image File Format [.tiff]
  vor      - StarImpress 5.0 Template [.vor]
  vor3     - StarDraw 3.0 Template (OpenOffice.org Impress) [.vor]
  vor4     - StarImpress 4.0 Template [.vor]
  vor5     - StarDraw 5.0 Template (OpenOffice.org Impress) [.vor]
  wmf      - Windows Metafile [.wmf]
  xhtml    - XHTML [.xml]
  xpm      - X PixMap [.xpm]

The following list of spreadsheet formats are currently available:

  csv      - Text CSV [.csv]
  dbf      - dBase [.dbf]
  dif      - Data Interchange Format [.dif]
  html     - HTML Document (OpenOffice.org Calc) [.html]
  ods      - Open Document Spreadsheet [.ods]
  ooxml    - Microsoft Excel 2003 XML [.xml]
  pdf      - Portable Document Format [.pdf]
  pts      - OpenDocument Spreadsheet Template [.pts]
  pxl      - Pocket Excel [.pxl]
  sdc      - StarCalc 5.0 [.sdc]
  sdc4     - StarCalc 4.0 [.sdc]
  sdc3     - StarCalc 3.0 [.sdc]
  slk      - SYLK [.slk]
  stc      - OpenOffice.org 1.0 Spreadsheet Template [.stc]
  sxc      - OpenOffice.org 1.0 Spreadsheet [.sxc]
  vor3     - StarCalc 3.0 Template [.vor]
  vor4     - StarCalc 4.0 Template [.vor]
  vor      - StarCalc 5.0 Template [.vor]
  xhtml    - XHTML [.xhtml]
  xls      - Microsoft Excel 97/2000/XP [.xls]
  xls5     - Microsoft Excel 5.0 [.xls]
  xls95    - Microsoft Excel 95 [.xls]
  xlt      - Microsoft Excel 97/2000/XP Template [.xlt]
  xlt5     - Microsoft Excel 5.0 Template [.xlt]
  xlt95    - Microsoft Excel 95 Template [.xlt]
So, what's missing?  The newer Microsoft 'x' formats: docx, xlsx, etc. (Microsoft Office XML) are not listed, but conversion is still possible!  Let you mantra be  "unoconv is a command line utility that can convert any file format that OpenOffice can import, to any file format that OpenOffice is capable of exporting."

Using unoconv

To use unoconv, you first have to start the listener:

$ unoconv -l &  #the '&' backgrounds the process and returns control of the 
                 terminal winodow to your
[1] 9998        #9998 is the process number of the listener.

We can see that the listener is a python program and the killall command to cancel the listener would have to be directed at python.  To avoid killing other processes, 'kill 9998' should be used rather than 'killall python':

$ ps 9998
  PID TTY      STAT   TIME COMMAND
 9998 pts/0    Sl     0:00 /usr/bin/python /usr/bin/unoconv -l


With the listener running, conversion of documents is straight forward, as we can see from the help:

$ unoconv -h
usage: unoconv [options] file [file2 ..]

Convert from and to any format supported by OpenOffice

unoconv options:
  -c, --connection=string  use a custom connection string
  -d, --doctype=type       specify document type
                             (document, graphics, presentation, spreadsheet)
  -e, --export=name=value  set export filter options
                             eg. -e PageRange=1-2
  -f, --format=format      specify the output format
  -i, --import=string      set import filter option string
                             eg. -i utf8
  -l, --listener           start a listener to use by unoconv clients
  -o, --outputpath=name    output directory
      --pipe=name          alternative method of connection using a pipe
  -p, --port=port          specify the port (default: 2002)
                             to be used by client or listener
  -s, --server=server      specify the server address (default: localhost)
                             to be used by client or listener
  -t, --template=file      import the styles from template (.ott)
  -T, --timeout=secs       timeout after secs if connections to OpenOffice fail
      --show               list the available output formats
      --stdout             write output to stdout
  -v, --verbose            be more and more verbose


 So, in its simplest form, conversion takes the following form:

$ unoconv test.docx

The command will finish silently if successful.  It creates a .pdf by default in the same directory as the document.  Add the -f [fmt] option to convert to a different format, for example:

$ unoconv -f txt test.docx

When your conversion work is done, close the listener with:

$ kill 9998

Now you see why unoconv is number one!