← Back to Upcase

Combine multi-page PDFs into one PDF with ImageMagick

(Jessie Young) #1

This isn’t exactly a rails question, but it is about something that I would like to do within a rails app, so here it goes:

I am trying to use ImageMagick (6.8.0) to combine several multi-page PDFs into a single PDF. This command:

$ convert multi-page-1.pdf multi-page-2.pdf merged.pdf

Returns merged.pdf, which contains the first page of multi-page-1.pdf and the first page of multi-page-2.pdf.

This command:

$ convert multi-page-1.pdf[2] multi-page-2.pdf[2] merged.pdf

Returns merged.pdf, which contains the third page of multi-page-1.pdf and the third page of multi-page–2.pdf.

I would like to merged.pdf to contain all of the pages of each multi-page pdf. I have so far not found a way of telling the convert command to use a range of pages, although I have tried adding [0-1] and [0,1] at the end of the filenames.

Interestingly, this ghostscript command (which I found via StackOverflow but cannot re-find) does work as I would like it to:

$ gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=merged.pdf multi-page-1.pdf multi-page-2.pdf

The problem is, the ImageMagick ‘convert’ command takes urls as inputs and ghostscript does not, and I need my program to take url input rather than file paths.

Is it possible to get the result of the above ghostscript command using ImageMagick?

(pat brisbin) #2

I’ve always used some form of your gs command to convert PDFs.

If all you need is to read form urls, the following but of unix voodoo should work:

gs -q dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=merged.pdf \
  <(curl 'http://some/thing.pdf') \
  <(curl 'http://some/other/thing.pdf') \
  <(curl 'http://and/so/on.pdf')

(Jessie Young) #3

Thanks for helping out, Pat!

Here is what I ran:

gs -q dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=merged.pdf \
<(curl "http://s3.amazonaws.com/arthritis-foundation/documents/documents/000/000/007/original/blank_2.pdf?1371677496") \
<(curl "http://s3.amazonaws.com/arthritis-foundation/documents/documents/000/000/008/original/blank_1.pdf?1371677510")

This is the output:

Error: /undefinedfilename in (dNOPAUSE) Operand stack:

Execution stack: %interp_exit .runexec2 --nostringval–
–nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push Dictionary stack: --dict:1159/1684(ro)(G)–
–dict:0/20(G)-- --dict:77/200(L)-- Current allocation mode is local Last OS error: No such file or directory GPL Ghostscript 9.07:
Unrecoverable error, exit code 1 [~/Downloads/toolkit_docs] % Total
% Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:–:-- --:–:--
–:--:-- 0 % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed 100 7461 100 7461 0 0 8590 0 --:–:-- --:–:--
–:--:-- 16543 100 7628 100 7628 0 0 8360 0 --:–:-- --:–:-- --:–:-- 15225

So not working for me quite yet, but I can feel that we’re getting closer!

(pat brisbin) #4

I typoed from your original, did you try with -dNOPAUSE as you had, but with the <(...)s that I added?

(pat brisbin) #5

Hmm, I tried it here now, and it fails anyway.

I’d guess it’s due to this:

Why won’t Ghostscript read piped PDF files?

Portable Document Format is a file format which contains forward
and backward links. It is not a stream format like PostScript.
You cannot pipe PDF files to the stdin of Ghostscript. Instead
you must either give the PDF filename on the command line, or
use it as the argument of the Ghostscript run command.
Ghostscript 8.00 and later can read piped PDF by copying
to a temporary file.

I could easily write some bash that would download the pdfs to tempfiles then feed them to gs, but I’m guessing that won’t work since you can’t save files on heroku, right?

(Daniel Collis-Puro) #6

Well, you can save files on heroku but they are only guaranteed to be around for the duration of the request and may or may not be available on subsequent requests. I believe the determining factor is going to be whether or not the dyno has been spun down.

Either way, you should be able to use locally saved files reliably for short-lived processes. docs here

(pat brisbin) #7

Good point. You could probably just download the PDFs to Tempfiles in ruby then pass their paths to system("gs ...").

(Jessie Young) #8

Thanks for weighing in, djcp/pat!

We were hoping to avoid saving to Heroku’s tmp directory, but that is probably what we’re going to have to do. If that is the case, we can just loop through each PDF and still use ImageMagick’s convert command.

I still wonder, though: If ghostscript can merge multi-page PDFs and ImageMagick uses ghostscript, then why can’t we use ImageMagick to convert PDFs the way that ghostscript does? (eg: no index numbers needed to select page numbers).

(Jessie Young) #9

Here is the custom paperclip processor I wrote to solve the problem: https://gist.github.com/jessieay/5832466

Still needs to be refactored a bit, but it is functional. It finds the files I would live to merge via the attachment instance (the object paperclip is created the attachment for). Those files are themselves paperclip attachments. I download each of those documents into the ‘tmp’ directory, then use the ‘pdf-reader’ gem to find the page count and use that page count to iterate over each page of each PDF in order to add it to the ‘convert’ command input.

I could go into more detail on the rest, but I think I will save that for a blog post. :100: