New Workshop, Some Data-ish Pipeline Tricks

three

                January 23, 2023

            New Workshop, Some Data-ish Pipeline Tricks

            Lots of admin stuff today! First, we have a new blogpost, the full version of the complexity preview I shared last week. I'm also announcing a new TLA+ workshop! Or more precisely, three workshops. To make it easier on people's schedules, there are three dates you can sign up for: March 20, May 15, and June 12. And there's no fee to move between classes if something comes up and you can't make your session. Use the code C0MPUT3RTHINGS for 15% off..¹
Anyway, on to the main thing. A couple of years ago I started work on a Logic for Programmers pamphlet, then ADDed into some other project. I started work on it again last week with the hope (the hope) of having an early version available by the end of winter. I'm writing the book in Sphinx but compiling it to LaTeX and then a pdf. I like using Sphinx because it's (relatively) easy to create "directives", or new types of content with special processing rules. 
The main technical challenge so far has been making an "exercises" directive. The pamphlet will have a lot of exercises. I want solutions to automatically be placed in the back of the book, and I want exercises and solutions to hyperlink to each other.² So I need to turn this:
.. exercise:: Implication
  :name: impl-1

  My Exercise!

  .. sol::

    My Solution!

\begin{Exercise}
  [label={ex-impl-1},title={Implication}]
\label{\detokenize{basics/implication:ex-impl-1}}
My Exercise!
\hyperref[ex-impl-1-Answer]{Solution}
\end{Exercise}

\begin{Answer}[ref={ex-impl-1}]
My Solution!
\end{Answer}

Fortunately putting the solutions at the end of the book isn't too hard; there's lots of "exercise" LaTeX packages and they all have this feature. So the problem is now just figuring out Sphinx's poorly-documented and undebuggable build process so I can convert my directives into notoriously-finicky LaTeX markup.
Joy.
This looks like a data pipeline of sorts: text becomes doctree becomes latex becomes PDF. Each step can independently fail, and sometimes the "fix" is to tweak a distance ancestor step. I've run into these kinds of pipelines in a lot of projects.
The foundation of any development process is the feedback loop. The faster I can get feedback on what I'm doing, the more productive I'll be. I want to make it easy to change one step of the pipeline and see how it affects all downstream outputs. Here's some things I find helpful when making these pipelines:
Mise en Place
I don't develop the pipeline logic on my book project. Instead I have a separate Sphinx project which just consists of a single exercise and solution.  It's a lot easier to see impacts on just an isolate sample. I also take notes on everything I do and any changes I might need to make to the target project's configuration. 
Once the code looks good, I transfer it over to my main project. The main book has a lot more exercises and a bunch of actual content, so it's likely the pipeline will break. I do not fix the code here. Once I've isolated the source of the break, I reproduce it in the toy environment, fix it there, and then repeat the transfer. This keeps the "production environment" in sync with "dev".
(It'd probably be easier to do this by making the pipeline into a package that the book project imports, but that's a lot of extra overhead rn.)
Avoid manual steps
Manual steps lengthen the feedback cycle. I'm pretty sensitive to small friction and if I can't do a step from pure muscle memory, it's going to break my concentration. Even something like this isn't great:
cp -Force /path/to/mise/en/code.py /project/code.py

The paths are too long to type out each time. I could searching the shell history for say mise/en, but then I'd have to check each search result to make sure it's the one I want. I instead made update-exercise.ps1 with that one single line. Then it's just up<tab><enter>, which I can do automatically.
Avoid manual cleanup
It's okay to tweak inputs to the whole pipeline, but you shouldn't modify anything in the intermediate stages. If your intermediate stages can be wiped before every pipeline run, then you have to repeat your manual cleanup every single time. If intermediate stages are persistent, then your data pipeline is stateful. You do not want a stateful pipeline.
Instead, try to turn the manual cleanup into its own pipeline step. When compiling computer things 2021, I used pandoc to convert 50+ markdown files into a single tex file. I ran into a problem: unicode symbols in the files were being placed verbatim in the tex, and my latex engine couldn't handle them. Rather than manually change the latex every single time I ran the pipeline, I added another pipeline layer:
replacements = (
    (r'⋀', r'$\\wedge$'),
    (r'≤', r'$\\leq$'),
    (r'(\w)₀', r'$\1_0$'),
    (r'(\w)₁', r'$\1_1$'),
    (r'(\w)₂', r'$\1_2$'),
    (r'(\w)₃', r'$\1_3$'),
    # …
)

file = Path(path)
text = file.read_text(encoding="utf-8")

for old, new in replacements:
    text = sub(old, new, text)

file.write_text(text, encoding="utf-8")

Hooray for regexes!

Not feeling too well and this is already ~1000 words, so I'm just gonna cut the newsletter early. The point is that I got this all figured out for the exercise directive so I can include a lot of exercises in the logic book. Hope you all have a good week!

There's one person who had to drop out of the December session and asked for a raincheck ticket, but I can't find who it was! I've trawled my email, my Eventbrite notes, and even logged into Twitter DMs, but no sign of it anywhere. If this is you, please let me know so I can give you a free ticket! ↩

One cool thing I can do with Sphinx: build both "computer" and "printable" PDFs, where the computer PDFs have hyperlinks and the printable PDFs have page references. ↩

            If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.

Don't miss what's next. Subscribe to Computer Things:

Start the conversation: