Monday, June 15, 2009

Creating a multi-page PDF from images

It is often convenient to pour a series of JPEG (or PNG, or GIF) files into a PDF, for example for printing or for e-mailing. Given the power of the Linux command line, this is surprisingly difficult, but I found a fairly straightforward way to do it. Skip to the bottom if you just want the oneliner.

Many websites will tell you the following:
convert *.jpg output.pdf
Easy, no? Don't do this. Why? Look at this:

-rw-r--r--  1 thomas thomas 129826204 2009-06-15 15:29 output.pdf
-rw-r--r--  1 thomas thomas    947022 2009-06-15 15:04 page1.jpg
-rw-r--r--  1 thomas thomas    962956 2009-06-15 15:05 page2.jpg
-rw-r--r--  1 thomas thomas    925291 2009-06-15 12:54 page3.jpg
-rw-r--r--  1 thomas thomas    952717 2009-06-15 12:54 page4.jpg
-rw-r--r--  1 thomas thomas    642471 2009-06-15 15:08 page5.jpg

The original JPG files are less than 5 MB altogether, but the resulting PDF is a whopping 124 MB! Clearly, convert (from the otherwise excellent ImageMagick bundle) re-encodes the images somehow, instead of embedding them straight into the PDF file.

Enter the little-known utility sam2p. It comes in an Ubuntu package of the same name. In its simplest form, it converts a single image file into a PDF by embedding the image file into the PDF file. For example:
sam2p page1.jpg page1.pdf
One of the shortcomings of sam2p is that it does not allow you to set the page size directly, so you'll end up with PDFs that exactly fit the original images.

Now we can generate all the pages as separate PDFs, but sam2p cannot create a PDF with multiple pages. Enter pdfjoin from the pdfjam package (available in Ubuntu under that name). It is simple to use:
pdfjoin page*.pdf --outfile output.pdf
This will use a consistent page size, so it is no problem that sam2p spit out pages of arbitrary size. It defaults to A4 paper; specify --paper letterpaper to use the Letter format.

Because I'm lazy, I wrote a little bash oneliner to do the trick, then let my readers improve upon it (thanks Mark, thanks Eamon!). It is now a twoliner, but who cares:
find . -maxdepth 1 -iname 'page*.jpg' -exec sam2p '{}' '{}'.pdf \;
pdfjoin page*.pdf --outfile output.pdf
This assumes that your input images are named page1.jpg, page2.jpg etcetera, and that there are no files named like page*.pdf in the current directory. If you have more than 9 pages, remember to prefix a zero to keep them in order. If you want to do this for PNG or other images, remember to change the extension in both places.