Embedding pdf into PowerPoint by usage of OpenXml

So what’s the fun part of mainly working into Azure when you¬†not have to go back to really¬†oldschool coding with OLE?

real oldschool
Doing it oldschool. Sometimes a pleasure ūüėČ

Having worked with these old-fashioned OLE libraries means (again) being able to really worship working into brand-new Azure environment where pretty much everything is moreless clear.

I currently write software for Books. This is automation software for office products, namely is able to take over your Excel workbooks, pdf documents and word documents and create pdf or PowerPoint presentations out of it. It is finally possible to insert images from that documents, link the documents or even embed the documents in slides.

Creating of PowerPoint presentations is problematic with PowerPoint Interop as it is not scalable and slow. Books is a multithreaded application. PowerPoint is a OLE Server, meaning is able to tunnel requests but when using too heavily, it will just reject calls. Certainly that slows down whole processing. The caller then has to retry calls until PowerPoint again is able to react.

So I decided to implement the creation in OpenXml. This has some big advantages

  • completely file based
  • multithreading by design
  • lightning fast

But certainly it has some disadvantages

  • pretty complex
  • not transparent at all. Yes, every single xml fragment is documented here. Finding out the cross references between the necessary parts is even for date type cell values an¬†adventure. The 4rd edition of the specification contains¬†over 5000 pages.
  • For comparison or for better understanding what the output of OpenXml is, there is a¬†SDK¬†available. As this auto generates code … let’s say … it is better than nothing.
  • OpenXml does not render at all. That means the hard part is up to you.

What’s exactly the hard part?

When trying to embed a document in a PowerPoint slide, PowerPoint will open the application in question, ask for a “screenshot” and the document to be embedded. When working with OpenXml, this has to be accomplished by us. At this current stage for the product I program for, embedding Excel and Word is not a problem at all, as anyway these applications are opened with the according documents,¬†“screenshot” and documents have been taken¬†and¬†are ready for being embedded.

Pdf is a little different. There is no interop available, surely we could use pdfium¬†to create the picture to avoid messing around with adobe directly. But there is one¬†single thing that makes¬†it really problematic: All other types of documents can just be embedded “like they are”. There is little to no difference between the actual original document and what will be placed inside of a slide. The original pdf document and the document to be embedded have huge differences. Have a look at the winmerge diff:

Pretty different pdf files

So actually there is no way around using the OLE server to create picture and document.

There are various blog posts about this topic, best summary you can find here. Additional I asked a question on StackOverflow. Typically this kind of question doesn’t lead to too much traffic, these kind of problems¬†don’t come up too often for most programmers.

Actually all the findings did a good job to give me an idea what is about to be accomplished, but the ugly truth is: This doesn’t work with current versions of PDF. Referring to the post above, I experienced exactly the same issue like these guys: On a 64 bit OS, this seems to work only with Abobe version 9.Higher version fails with error code 0x8000FFFF which translates to Catastrophic failure.

After doing a lot more search, I did find something interesting, again on StackOverflow. It actually uses the same procedure, but has exactly one difference.

OLE32.IStorage storage;
int result = OLE32.StgCreateStorageEx(oleOutputFileName,
ref OLE32.IID_IStorage,
out storage

The actual different is OLE32.STGMT.STGMT_DOCFILE. The other examples use STGFMT_STORAGE. That actually did the trick and let the code work even with newer versions of Adobe.

Another hint: All the samples beside the last link do not handle closing the handles/ files correctly. Last link gives a good hint:

var storagePointer = Marshal.GetIUnknownForObject(storage);
int refCount;
refCount = Marshal.Release(storagePointer);
} while (refCount > 0);

This has to be accomplished actually for all com objects in question.