It’s time to write documents and of course, me being me, I can’t just open up Word and start typing. I prefer to open up Notepad++ and start writing Markdown, then convert it to my final format whatever that might be. I’ve written up directions on how to generate DOCX documents from Markdown using Pandoc, but the standard conversion from Markdown to DOCX just isn’t working for me. I need a fancier title page and a lot more control over the formatting than I can get using the standard conversion, so I’m going to use an approach that I saw in a Stackoverflow answer: convert the Markdown to LaTeX, add in a LaTeX title page and formatting, then convert both of those into a DOCX via Pandoc. The LaTeX should give me the control over the final deliverable that I want, and hopefully the conversion from LaTeX to DOCX won’t be awful. If it is, well, we’ll just have to go straight to PDF.

The purpose of this post is to capture my notes and initial outputs for generating LaTeX documents before I get all specific and fancy.

Directory Structure

The first thing I’m going to do is to create a directory for this project with the following

.
├── docx
└── res
    ├── tex
    └── txt

The root directory contains subdirectories that will contain the final output. These directories are separated by output format and currently only the docx directory is prsent. The res subdirectory contains all of the resources needed to generate the final output. In this subdirectory, the tex subdirectory will contain any .tex files and the txt directory will contain Markdown files that will be converted into .tex files. In the future, if necessary, there would be an img folder or a gv folder. For now, I’m going to generate a straight LaTeX document in the tex folder and then work on turning it into a title page that can be integrated into a .tex file generated by Pandoc from the text.

Basic LaTeX Document

Here is a basic LaTeX document:

% Stephen Friederichs

% LaTeX Document example

% These are comments BTW (obviously)


\documentclass{report}

% Between documentclass and begin is the preamble section


\begin{document}

\title{A Basic LaTeX Document}

\section{Introduction}

Hello. Allow me to introduce myself: I am a LaTeX document.

\end{document}

I’ve saved this as simpleReport.tex in the tex subdirectory. Then, I open a command prompt and navigate to the root directory and type this command to generate the DOCX:

pandoc -f latex -t docx -o docx\basicReport.docx res\tex\simpleReport.tex

And… SUCCESS! I have a very simple document. Let’s regroup for a second and discuss each of the commands.

documentclass

This command defines what type of document we’re creating. Apparently most people go with article but for engineering documents with lots of content I’m going to try report. The full list of available document classes can be found here.

begin{document}

This command starts the content of the document. What happens betwee documentclass and begin is called the preamble and it contains document-wide settings. Currently we don’t have any such settings, but we will soon!

title

Yeah… this is the title of the document. In the DOCX it gets put in big bold letters at the start of the document.

section

section denotes a… section (man this is redundant!). The title of the section goes in the curly braces and the text goes under.

end{document}

The document is now over, done, finished.

Combining Markdown and LaTeX

The above is a nice document and all, and it demonstrates a basic ‘Hello World’ functionality for the LaTeX->DOCX workflow, but I need some specific things for this to be useful to me. The first thing I need to do is to figure out how to combine LaTeX files and Markdown files into one DOCX file. To do this, I’m going to generate a basic Markdown document and place it in the txt subdirectory, turn it into a LaTeX file, then pass those both to Pandoc to generate a DOCX.

Here’s the markdown file:

# Section the second #


This should be my second section. It *really* ought to come after the first section, all things considered. Reasons for this:

* It's section 2 and 2 comes after 1
* I'm going to pass the .tex file generated by Pandoc back to it as the second argument in the command line
* Why do I need three bullets? 

The first step is to convert this to a .tex file with Pandoc:

pandoc -f markdown -t latex -o res\tex\mdSection2.tex res\txt\section2.md

And this is the LaTeX fragment that is produced:

\hypertarget{section-the-second}{
\section{Section the second}\label{section-the-second}}

This should be my second section. It \emph{really} ought to come after
the first section, all things considered. Reasons for this:

\begin{itemize}
\tightlist
\item
  It's section 2 and 2 comes after 1
\item
  I'm going to pass the .tex file generated by Pandoc back to it as the
  second argument in the command line
\item
  Why do I need three bullets?
\end{itemize}

Looks legit, so let’s use this command line to combine the two files:

pandoc -f latex -t docx -o docx\combinedFile.docx res\tex\basicReport.tex res\tex\mdSection2.tex

This produces a combined DOCX file. We’re doing great so far! (When are we going to utterly fail?)

It’s worth noting at this point that the basicReport.tex file contains all the markup necessary for a standalone document, but is still capable of being combined with another .tex fragment file into a useable DOCX file by Pandoc. I do not know whether Pandoc reads the document-wide settings contained in the basicReport.tex file and applies them to the whole document or discards them and uses its own. Let’s find out!

One of the most basic and obvious settings that can be set by the documentclass command is the ability to specify a document-wide font size. 12pt font is pretty standard, but if we want to make sure that the settings are getting to the final document we need to put something absurd in there so we can see the result.

You can modify the basicReport.tex file to change the font. It looks like this:

% Stephen Friederichs

% Customized LaTeX Document example


% Standard font size is set to 24pt to make it obvious if these settings are

% being applied to the combined final document


\documentclass[24pt,titlepage,letterpaper]{report}

% Between documentclass and begin is the preamble section


\begin{document}

\title{A Customized LaTeX Document}

\section{Introduction}

Hello. Allow me to introduce myself: I am a customized LaTeX document

\end{document}

After I save this as customReport.tex in the proper subfolder, I run this command to combine them:

pandoc -f latex -t docx -o docx\customFile.docx res\tex\customReport.tex res\tex\mdSection2.tex

Aaaaand… (this is the point I start to fail, though not necessarily utterly) It doesn’t work: the font size is 12pt in the DOCX. The whole point of this exercise is to figure out how to customize the DOCX output via LaTeX, so this is a problem that needs to be fixed NOW!

Customizing LaTex Output

After a bit of research I’ve determined that what I want is to generate a custom LaTeX template that has all of the options that I want and then pass that to Pandoc to format my document properly.

I found a Stackoverflow answer that supposedly tells me how to do that. I’m going to document my attempt to follow its advice.

The first thing that I need to do is to generate the Pandoc LaTeX template so I can see what I’m dealing with. I’ve added a new template directory in the res directory to hold the template. You can generate the template like this:

pandoc -D latex > res\template\template.tex

I won’t paste the template here because it’s large, but looking at it tells me that there’s lots of variables I can mess with to customize the output without necessarily changing the template. Let’s start with one setting and then work our way to the others.

The first thing that I want to change is the document class. The default document class for Pandoc LaTeX output is apparently the article class where I want report. Let’s figure out how to change that.

I’ve found a question on a Stackoverflow site that answered itself about how you set the various options available in the Pandoc LaTeX template. Basically, you add YAML front matter to the markdown document to specify the settings. I’ve generated a new Markdown file called customLaTeX.md and given it the following content:

---

documentclass: report

fontsize: 24pt

---


# Section the second #


This should be my second section. It *really* ought to come after the first section, all things considered. Reasons for this:

* It's section 2 and 2 comes after 1
* I'm going to pass the .tex file generated by Pandoc back to it as the second argument in the command line
* Why do I need three bullets? 

I know I only wanted to change the documentclass but the fontsize will be an obvious change in the document that I can see immediately in the DOCX output.

Below are the commands I use to generate the DOCX from the input markdown file.

The first command generates a standalone LaTeX file from the Markdown. This ensures that the front matter options are present in the LaTeX output:

pandoc -f markdown -t latex –standalone -o res\tex\mdSection2.tex res\txt\section2.md

This command line takes the standalone LaTeX and generates the DOCX.

pandoc -f latex -t docx -o docx\customReport.docx res\tex\mdSection2.tex

And the result of this is… failure (NOW I’m utterly failing)

Let’s look at some reasoning. I’ve found a Stackoverflow question asking why they can’t use a LaTeX template with DOCX output. Turns out, it’s not possible. DOCX support in Pandoc is basic and apparently the only the resulting DOCX is via the –reference-docx command line option.

That’s not too bad or difficult (and I’ve used the reference docx option before to okay effect) but I still need to incorporate a fancy title page with Markdown content. This should be as basic as centering and font size. Surely that works, yes?

Let’s try a simple LaTeX file:

\begin{center}
This is my centered text
\end{center}

{\huge This is big text}

And then we’ll convert this to DOCX via Pandoc:

pandoc -f latex -t docx -o docx\titlePage.docx res\tex\titlePage.tex

Aaaaaaand….

Sigh. No luck: no formatting in the DOCX. At this point the best I can hope for is PDF output that Word can read and process properly (i.e., keeping formatting).

So, let’s figure out how to get PDF output!

Markdown/Latex to PDF Conversion via Pandoc

As a Windows user, the first thing I need to do is install a LaTeX distribution. I’ve chosen MiKTeX for no reason other than I’ve used it before. There are instructions on how to download and install on Windows and it’s a pretty basic process so I won’t include those here. (If I run into any difficulties I’ll document them)

FIrst thing is to get ANY PDF output, so I’m just using one of the files I generated previously here as the guinea pig for conversion. Here’s the command line:

pandoc -t markdown -f latex -o pdf\test.pdf res\txt\customLatex.md

Aaaaand (man this is a recurring theme here…) it takes FOREVER. Seriously, like a minute so far. Nuts. Will it ever end? With any luck this is a first-time thing and not going to happen all the time.

So I Ctrl+C’d that and tried again - turns out there were popups for updating packages that I wasn’t seeing. The packages that were installed are:

  • upquote
  • microtype
  • url

And it keeps asking to install that. Repeatedly. Something’s wrong.

The log file shows this:

2018-01-10 12:42:44,881-0700 FATAL pdflatex - Windows API error 87: The parameter is incorrect.

What the heck is this?

[This] Stackoverflow answer says that there’s multiple MiKTeX’s running. Sure enough, there are. Finishing off those processes fixes the issue and I get to install more packages.

And it keeps happening. Sometimes you just have to click Install again, sometimes you have to restart the entire task after it fails. But eventually, all the packages get installed and the PDF file gets created. Finally.

But does Word properly conver the PDF to something that I can edit?

No. The answer is no. And Pandoc can’t read PDFs and convert them to DOCX.

Sigh. What a waste of a morning. Can I live with PDF output? Or can I use ODT and then convert to DOCX much easier?

What’s worse than this: pandoc ALWAYS munges LaTeX when converting it to other formats. My font sizes and centering tags are GONE from the output no matter what it is. If you do a LaTeX to LaTeX conversion, Pandoc ‘helpfully’ completely obliterates your formatting.

So, I’m stuck with PDF output or nothing. Worse, if I use Pandoc to produce that output then I lose all of my precious formatting. Of course, I can just:

  1. Generate a LaTeX template which Pandoc can ingest to roughly generate the document I want
  2. Create a hard-coded LaTeX title page which will be hard-included by the template (so that Pandoc keeps its dirty mitts off of it)
  3. Convert the markdown to PDF through pandoc using LaTeX output format using the LaTeX template and pass a PDF file name so it invokes pdflatex itself to generate a PDF which should have all of the styles that I want with a custom title page.

The sad part is that this still leaves me with a PDF when I really want a DOCX.

Anyway, I figured out how to do the steps above. Here’s a rundown:

  1. Create res\tex\titlePage.tex and put something in it that you will be able to find. My content looked like this:
\begin{center}
TITLEPAGE@
\end{center}

{\huge This is big text}
  1. Edit the template.tex file that you generated and stored in res\template\template.tex by finding the document begin command and add this in (the input directive is the added line):
\begin{document}
\input{res/tex/titlePage}
$if(title)$
  1. Run this command:

    pandoc -o pdf\test.pdf –template=res\template\template.tex res\txt\section2.md

This produces a PDF including the title page formatting from titlePage.tex intact with the content from the Markdown file in the PDF. This is what I wanted (other than DOCX output… and a pony…).

LaTeX Template Variables and Modifications

Okay, so the whole goal of this is to have professional-looking documents with a minimum of effort, right? Seems to me I have to start customizing the hell out of these templates and this process. I’m going to work one setting at a time until I have something I wouldn’t mind submitting for credit in front of an auditor. I’ve done it before, I’ll do it again. Except this time I’ll document it. I’m going to start with the front matter YAML settings that I can change that work with the existing template.

Here’s a list of settings that I’ve changed:

Setting Description Desired Value Other settings
fontfamily Picks the    
headerincludes      

I’ve generated a ‘raw’ list of variables that can be configured in the YAML front matter by exporting the Pandoc LaTeX template and grep’ing out all of the variables. Here’s the list:

YAML LaTeX Variable Type Description Desired Value
abstract Text Document abstract  
author Text Author name  
author-meta Text Hyperlink author meta-information  
biblatex Yes/No Include biblatex and produce bibliography  
biblio-files List List of files included for bibliography  
biblio-style Text Content of \bibliographystyle command  
biblio-title Text Bibliography title  
body Text Document body  
book-class Yes/No Document is a book-class  
citecolor Text    
classoption List List of options for the documentclass  
date Date Content of the \date command  
documentclass      
euro      
fontfamily      
fontsize      
geometry      
graphics      
header-includes      
highlighting-macros      
include-after      
include-before      
lang      
lhs      
linestretch      
linkcolor      
links-as-notes      
listings      
lof      
lot      
mainfont      
mainlang      
mathfont      
monofont      
natbib      
numbersections      
papersize      
sansfont      
sep      
strikeout      
subtitle      
tables      
title      
title-meta      
toc      
toc-depth      
urlcolor      
verbatim-in-note      

Generating Figure References

This code example creates a figure anchor and reference in the final PDF document:

---

header-includes:

  - \usepackage{cleveref}

---

# Section

![A Markdown logo\label{markdown}](img\markdown.png)

See \cref{markdown}.

The following command line will generate the appropriate PDF:

> pandoc -o ..\pdf\output.pdf txt\cleverref.md

Execute the Pandoc command above and the resulting PDF will have a figure reference after the ‘See’.

Generating Formatted Code Snippets

You can either use a tab at the beginning of each line to denote a code block:

    if (a > 3) {
        moveShip(6*gravity,DOWN);
    }
    
Spaces or tabs works for this method.

This is a fenced code block - the most basic kind (no highlighting):

~~~~~~~ 
if (a > 3) {
  moveShip(5 * gravity, DOWN);
}
~~~~~~~

This fenced code snippet has a language specifier, so it will have appropriate highlighting:

~~~~~~~ c
if (a > 3) {
  moveShip(5 * gravity, DOWN);
}
~~~~~~~

This code snippet has attributes: it will number the lines of source starting at 100 and highlight the source as if it were Haskell:

~~~~ {#mycode .haskell .numberLines startFrom="100"}
qsort []     = []
qsort (x:xs) = qsort (filter (< x) xs) ++ [x] ++
               qsort (filter (>= x) xs)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The list of languages that can be highlighted with Pandoc can be found by executing this command-line:

Resources