Monday, June 15, 2009

Creating a multi-page PDF from images

It is often convenient to pour a series of JPEG (or PNG, or GIF) files into a PDF, for example for printing or for e-mailing. Given the power of the Linux command line, this is surprisingly difficult, but I found a fairly straightforward way to do it. Skip to the bottom if you just want the oneliner.

Many websites will tell you the following:
convert *.jpg output.pdf
Easy, no? Don't do this. Why? Look at this:

-rw-r--r--  1 thomas thomas 129826204 2009-06-15 15:29 output.pdf
-rw-r--r--  1 thomas thomas    947022 2009-06-15 15:04 page1.jpg
-rw-r--r--  1 thomas thomas    962956 2009-06-15 15:05 page2.jpg
-rw-r--r--  1 thomas thomas    925291 2009-06-15 12:54 page3.jpg
-rw-r--r--  1 thomas thomas    952717 2009-06-15 12:54 page4.jpg
-rw-r--r--  1 thomas thomas    642471 2009-06-15 15:08 page5.jpg

The original JPG files are less than 5 MB altogether, but the resulting PDF is a whopping 124 MB! Clearly, convert (from the otherwise excellent ImageMagick bundle) re-encodes the images somehow, instead of embedding them straight into the PDF file.

Enter the little-known utility sam2p. It comes in an Ubuntu package of the same name. In its simplest form, it converts a single image file into a PDF by embedding the image file into the PDF file. For example:
sam2p page1.jpg page1.pdf
One of the shortcomings of sam2p is that it does not allow you to set the page size directly, so you'll end up with PDFs that exactly fit the original images.

Now we can generate all the pages as separate PDFs, but sam2p cannot create a PDF with multiple pages. Enter pdfjoin from the pdfjam package (available in Ubuntu under that name). It is simple to use:
pdfjoin page*.pdf --outfile output.pdf
This will use a consistent page size, so it is no problem that sam2p spit out pages of arbitrary size. It defaults to A4 paper; specify --paper letterpaper to use the Letter format.

Because I'm lazy, I wrote a little bash oneliner to do the trick, then let my readers improve upon it (thanks Mark, thanks Eamon!). It is now a twoliner, but who cares:
find . -maxdepth 1 -iname 'page*.jpg' -exec sam2p '{}' '{}'.pdf \;
pdfjoin page*.pdf --outfile output.pdf
This assumes that your input images are named page1.jpg, page2.jpg etcetera, and that there are no files named like page*.pdf in the current directory. If you have more than 9 pages, remember to prefix a zero to keep them in order. If you want to do this for PNG or other images, remember to change the extension in both places.

10 comments:

Mark IJbema said...

A bit cleaner:

ls *.jpg | sed -r 's/(.*)\.(\w{3,4})/\1.\2 \1.pdf/' | xargs -n2 sam2p 2>&1 | grep OutputFile | perl -pe 's/.*: //' | xargs pdfjoin --outfile out.pdf

This way you only have to change the filenames in one place, and it supports multiple extensions.

Thomas ten Cate said...

Nice! I especially like the "xargs -n2" trick :)

Georg Muntingh said...

Nice post, I tested it to generate a nice pdf from the Order of the Stick images I wgot some time ago (when it was still possible).

BTW, why not put Mark's suggestion in an "update"?

Eamon Nerbonne said...

sed's nasty + unreadable; and you may have issues with odd characters (say, spaces...) in filenames...

IME often a more robust approach is to use "find -print0" and "xargs -i -n1 -0" for any complex cases.
Or, in this simpler instance, just "find -exec" which would simply be:

find . -maxdepth 1 -iname '*.jpg' -exec sam2p '{}' '{}'.pdf \; ; pdfjoin *.pdf

Eamon Nerbonne said...
This comment has been removed by the author.
Eamon Nerbonne said...

Whoop's typo...
I'm NOT behind a unix here, so there's probably some error in the previous code, but I'm sure you get the idea - xargs/sed/find without the special nullchar break easily in the presence of filenames with spaces or other garbled characters.

Thomas ten Cate said...

... And Eamon wins the prize for the prettiest command so far :)

mfyahya said...

How about this:

ls *.jpg | while read f; do sam2p "$f" ${f%jpg}pdf; done && pdfjoin *.pdf

It's not as robust as find -print0 , but works for most filenames including those with spaces.

Thomas ten Cate said...

Wow, I learn something new every day! That ${f%jpg} construction is new to me, and is extremely useful in these sorts of common tasks. Thanks!

Job said...

If it's just a package of JPGs, why not simply zip it as a .cbz file?
Comic Book Archive file (Wikipedia)

Worst case scenario: one has to unzip the file