In part two of my XKCD font saga I was able to separate strokes from the XKCD handwriting dataset into many smaller images. I also handled the easier cases of merging some of the strokes back together – I particularly focussed on “dotty” or “liney” type glyphs, such as i, !, % and =.

Now I want to attribute a unicode character to my segmented images, so that I can subsequently generate a font-file. We are well and truly in the domain of optical character recognition (OCR) here, but because I want absolute control of the results (and 100% accuracy) I’m going to take the simple approach of mapping glyph positions to characters myself.

If you’d like to follow along, the input files for this article may be found at https://gist.github.com/pelson/b80e3b3ab9edbda9ac4304f742cf292b, the notebook and output may be found https://gist.github.com/pelson/1d6460289f06acabb650797b88c15ae0.

As a reminder, here is a downsampled version the XKCD handwriting file:

First, I want to read in the files that we produced previously. These can be found at https://gist.github.com/pelson/b80e3b3ab9edbda9ac4304f742cf292b.

I get metadata from the filename using the excellent parse library.

In [1]:

import glob
import os

import matplotlib.pyplot as plt
import numpy as np
import parse


pattern = 'stroke_x{x0:d}_y{y0:d}_x{x1:d}_y{y1:d}.png'
strokes_by_bbox = {}
for fname in glob.glob('../b80e3b3ab9edbda9ac4304f742cf292b/stroke*.png'):
    result = parse.parse(pattern, os.path.basename(fname))
    bbox = (result['x0'], result['y0'], result['x1'], result['y1'])
    img = (plt.imread(fname) * 255).astype(np.uint8)
    strokes_by_bbox[bbox] = img

In order to map strokes to glyphs, I’m going to want to sort them first by line, then from left to right.

At this point, I could either hard code the approximate y location of each baseline or attempt to cluster them. I opt for the latter, using a k-means implementation from scipy.cluster.vq.kmeans.

To confirm I have the correct information and to visualise the data before clustering, I plot the baseline point (maximum y pixel location) of the stroke against its left-most x position:

In [2]:

import matplotlib.pyplot as plt
import numpy as np

xs, ys = zip(*[[bbox[0], bbox[3]]
               for bbox, img in strokes_by_bbox.items()])
xs = np.array(xs, dtype=np.float)
ys = np.array(ys, dtype=np.float)
plt.scatter(xs, ys, s=50)
plt.gca().invert_yaxis()
plt.autoscale(tight=True)
plt.show()

There appears to be enough structure to group the glyphs into lines… let’s go for it.

In [3]:

import scipy.cluster.vq

n_lines = 11

lines, _ = scipy.cluster.vq.kmeans(ys, n_lines, iter=1000)
lines = np.array(sorted(lines))

print('Approx baseline pixel of lines:', repr(lines.astype(np.int16)))

# Plot the strokes, and color by their nearest line.
plt.scatter(xs, ys, s=50,
            c=np.argmin(np.abs(ys - lines[:, np.newaxis]), axis=0), cmap='Paired')
plt.gca().invert_yaxis()
plt.autoscale(tight=True)

# Draw the line positions
plt.hlines(lines, xmin=0, xmax=xs.max())
plt.show()

Approx baseline pixel of lines: array([ 447, 1103, 1708, 2316, 2932, 3380, 3830, 4276, 4713, 5584, 6045], dtype=int16)

Using the baseline position, group the strokes into their respective lines, and then sort each line from left to right.

In [4]:

glyphs_by_line = [[] for _ in range(n_lines)]
for bbox, img  in list(strokes_by_bbox.items()):
    nearest_line = np.argmin(np.abs(bbox[3] - lines))
    glyphs_by_line[nearest_line].append([bbox, img])

# Put the glyphs in order from left-to-right.
for glyph_line in glyphs_by_line:
    glyph_line.sort(key=lambda args: args[0][0])

Let’s take a look using the same notebook trick we did in part 1…

In [5]:

from IPython.display import display, HTML
from io import BytesIO
import PIL
import base64


def html_float_image_array(img, downscale=5, style="display: inline; margin-right:20px"):
    ""
    Generates a base64 encoded image (scaled down) that is displayed inline.
    Great for showing multiple images in notebook output.

    ""
    im = PIL.Image.fromarray(img)
    bio = BytesIO()
    width, height = img.shape[:2]
    if downscale:
        im = im.resize([height // downscale, width // downscale],
                       PIL.Image.ANTIALIAS)
    im.save(bio, format='png')
    encoded_string = base64.b64encode(bio.getvalue())
    html = ('<img src="data:image/png;base64,{}" style="{}"/>'
            ''.format(encoded_string.decode('utf-8'), style))
    return html

To keep this somewhat short, I’ll just show a few lines:

In [6]:

for glyph_line in glyphs_by_line[::4]:
    display(HTML(''.join(html_float_image_array(img, downscale=10)
                         for bbox, img in glyph_line)))

Now the tedious part – mapping glyph positions to characters (and ligatures)…

In [7]:

paragraph = r""
a b c d e f g h i j k l m n o p q r s t u v w x y z
u n a u t h o r i t a t i v e n e s s   l e a t h e r b a r k   i n t r a co l i c   m i c r o c h e i l i a   o f f s i d e r
g l as s w e e d   r o t t o l o   a l b e r t i  t e   h e r m a t o r r h a c h i s   o r g a n o m e t a l l i c
s e g r e g a t i o n i s t   u n e v a n g e l i c   c a m p s t oo l
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z I-pronoun
U N A U T H O R I T A T I V E N E S S   L EA T H E R B A R K   I N T R A CO L I C   M I CR OCH E L I A
O F F S I D ER   G LA S S W EE D   R O TT O L O   A LB E R T I T E   H ER M A T O RR H A C H I S
O R G A N O M E T A LL I C   S E G R E G A T I ON I S T   U N E V A N G E L I C   CA M PS TO O L
+ - x * ! ? # @ $ % ¦ & ^ _ - - - ( ) [ ] { } /  < > ÷ ± √ Σ
1 2 3 4 5 6 7 8 9 0 ∫ = ≈ ≠ ~ ≤ ≥ |> <| ? . , ; : “ H I ” ’ ‘ C A N ' T ' "
É Ò Å Ü ≪ ≫ ‽ Ē Ő “ ”
"".strip()
paragraphs = [[char for char in line.replace('   ', ' ').split(' ') if char]
              for line in paragraph.split('n')]

As I mentioned in the previous post, joining together all strokes into constituent glyphs wasn’t easily achievable. We therefore have four glyphs that each require two strokes to be merged before we can map them to the appropriate unicode character(s). The glyphs in question are |>, <|, ≪ and ≫.

In [8]:

glyphs_needing_two_strokes = ['≪', '≫', '|>', '<|']

We need the function from part 2 which merges together two images:

In [9]:

def merge(img1, img1_bbox, img2, img2_bbox):
    bbox = (min([img1_bbox[0], img2_bbox[0]]),
            min([img1_bbox[1], img2_bbox[1]]),
            max([img1_bbox[2], img2_bbox[2]]),
            max([img1_bbox[3], img2_bbox[3]]),
            )
    shape = bbox[3] - bbox[1], bbox[2] - bbox[0], 3
    img1_slice = [slice(img1_bbox[1] - bbox[1], img1_bbox[3] - bbox[1]),
                  slice(img1_bbox[0] - bbox[0], img1_bbox[2] - bbox[0])]
    
    img2_slice = [slice(img2_bbox[1] - bbox[1], img2_bbox[3] - bbox[1]),
                  slice(img2_bbox[0] - bbox[0], img2_bbox[2] - bbox[0])]

    merged_image = np.zeros(shape, dtype=np.uint8)
    merged_image.fill(255)
    merged_image[img1_slice] = img1
    merged_image[img2_slice] = np.where(img2 != 255, img2, merged_image[img2_slice])
    return merged_image, bbox

In [10]:

characters_by_line = []
for line_no, (character_line, glyph_line) in enumerate(zip(paragraphs,

About author

Specialties: Software (development, maintenance and support), Scientific software (Python/SciPy, Fortran, C++), HPC (benchmarking, optimisation, cluster / distributed compute, compilers, MPI), Data analysis & visualisation, Technical writing, Statistics

Classifying segmented strokes as characters – Part 3 of an XKCD font saga

9 Open Source LLMs and Agents to Watch

Conversational Data Analysis: Cutting Through the Noise to Find the Real Deal

Building Production-Grade Vector Search

The Moto Razr+ is a 2024 standout for the Android Police podcast

The user-facing side of Android 16 starts taking shape in this week’s news

Apple could soon be forced to make AirDrop work with Android

LEAVE A REPLY Cancel reply

Most Popular

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

This is my surprise phone of the year [Video]

Recent Comments

EDITOR PICKS

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

POPULAR POSTS

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

POPULAR CATEGORY

ABOUT US

FOLLOW US