Converting a website to an ebook

2022.05.30 :: {web, linux} :: #readability-cli #pandoc #scripting

I recently ran through the ups and downs of taking a very well written blog and converting it to an EPUB ebook in order to read it on my Kindle.

Using a few command line applications I was able to produce a rather well formatted EPUB file that I could send to my Kindle using Amazon's Email to Kindle service.

The general principal can be applied to almost any website that renders using HTML on the server side (throws shade at SPAs), so blogs are often perfect for this, or Wikipedia articles, whatever suits your fancy.

Step 1 - Collect your URLs§

Start by collecting all of the URLs that you want into a single file, with each URL separated by a newline. I called my file urls (original I know).

Example of urls file:

https://austindw.com/harden-openwrt/
https://austindw.com/precompressed-assets-caddy/

Step 2 - Create the book§

Wow, two steps and we're done? Kinda.

I wrote up a script to do the rest of the steps in one fell swoop.

It requires a few tools be installed, specifically:

readability-cli
- Homepage
- For Arch Linux users, this package is provided in the AUR, so a simple yay -Sa readability-cli should be all you need.
pandoc
- The macdaddy of document processors as far as I can tell. It's available on most Linux distro's in the package manager.
- For Arch Linux users, a simple pacman -S pandoc will do the trick.

Now that we have the required dependencies, let's take a look at the script:

#!/bin/sh

set -eu

# Point this at whatever file your URLs are stored in.
urls="urls"

# Make the directory where we'll store the clean HTML for each post.
mkdir -p posts

# Iterate over the URLs, download them and clean them up with the readability-cli
# We use a count here to ensure that we organize the output posts in the same order that they are specified
# in the input file. This is helpful as you can lay out the full order of your book by just editing the URLs file.
count=1
cat "$urls" | while read url
do
    output=$(printf "posts/%03d.html" $count)
    readable -q --low-confidence force "$url" -o "$output" 2>&1 > /dev/null
    count=$((count+1))
done

# Take all of the posts and put them into a book.
pandoc -o TheBook.epub posts/*.html

Save, make executable (chmod +x script.sh), and run the script and you'll have a fresh new EPUB named TheBook.epub at the end of it.

Open that up in your preferred ebook reader and you'll see a nicely formatted output like the following:

Screenshot of the zathura application rendering an ebook of blog posts derived from austindw.com

Obviously, you should edit the script to your needs. This post just acts as a guide, demonstrating the power of the readability-cli and pandoc.

Note: I used the --low-confidence force option when executing the readable command as I noticed a few posts that failed to extract with the default keep setting. Using force here worked just fine for my use case. Obviously, this may or may not work for your use case, so experiment.