Let’s do the time warp again.

A while ago a friend of mine told me about a really cool problem he’d worked on, and described an algorithm to solve it that was both very clever and very novel. As a problem statement, let’s say that you’ve got two sets of sequential data to compare which may be of different lengths and may each contain different kinds of non-linear distortions. To borrow an example, think of a series of measurements of the positions of two objects as they move, one object might move faster than the other, or they may accelerate at different rates and for different durations; all we have is measurements of their positions at set time intervals. How would you determine the ‘similarity’ of these sequences? How could you compute some kind of ‘distance’ between these two temporal sequences?

It’s not at all obvious, or at least it wasn’t to me when I first heard the problem described. But it turns out there’s a really interesting way to solve this problem called Dynamic Time Warping. DTW has been around since the 80’s, and it’s both a really satisfying solution and kinda unlike any other algorithm that I can think of.

Here’s the idea: Let’s say that we have a sensor that makes $n$ measurements every second, and we use it to measure two phenomena for (potentially different) amounts of time.1 This gives us two data sets $x = (n \times t_x)$ and $y = (n \times t_y)$. We use these data to compute a distance matrix, $D$, where we compare the sensor measurements at each time step. We can use any method we want to compute the distances, for the sake of example let’s use the $L_1$ norm:

\[D_{ij} = \sum_0^n |x_k - y_k|\]

After this we end up with $D = (t_x \times t_y)$, where each element is the $L_1$ norm applied to the $i^{\rm{th}}$ value of $x$ and $j^{\rm{th}}$ value of $y$. One immediate thing we can see is that if $x = y$ then $\rm{trace}(D) = 0$ with non-zero elements off the diagonal, because when $i = j$ we are comparing identical observations.

Now comes the core idea of the algorithm, based on a nice piece of dynamic programming. We begin at the top left corner of $D$, index $(0, 0)$, and advance through the matrix, incrementing $i$ first then $j$2, and at each step computing:

\[\Delta_{ij} = D_{ij} + min(\Delta_{i-1, j}, \Delta_{i, j-1}, \Delta_{i-1, j-1})\]

This feels like one of those ideas that’s clearer in code than in maths

delta = np.array((tx, ty))
for i in range(dist_matrix.shape[0]):
    for j in range(dist_matrix.shape[1]):
        delta = dist_matrix[i][j] + np.min(
                         (delta[i-1][j], delta[i][j-1], delta[i-1][j-1]))

So why do we do this? Well, if you think about it for a second you’ll see that in the end $\Delta_{ij}$ will contain the minimum total distance (the running sum through $D$) that you need to traverse to get to coordinates $(i, j)$ by taking only steps rightwards, downwards, or diagonally.

Now that we have $\Delta$ we’ve done the hard work and we’re ready for the last step. We now go to index $(0, 0)$ in $\Delta$ and walk towards index $(i, j)$ by only incrementing $i$, or $j$, or both, each step, each time moving to the minimum of the three values.3 Once we hit the edge of the matrix we return the value we land on. Again in code4

def step(delta, i, j):
    moves = [
        (mi, mj)
        for mi, mj in ([i + 1, j], [i, j + 1], [i + 1, j + 1])
        if mi < delta.shape[0] and mj < delta.shape[1]
    vals = [delta[mi, mj] for mi, mj in moves]
    return moves[np.argmin(np.array(vals))]

def walk_delta(delta):
    i, j = 0, 0
    i_max, j_max = dist.shape
    while i < i_max and j < j_max:
        i, j = step(dist, i, j)
    return delta[i][j]

And there we have it, the value comes out in the end is that distance between our two time series. It takes some thinking about but it’s not hard to see why this works the way that it does: if $x$ and $y$ are equal then, as we said, the main diagonal of $D$ will be $0$’s, the main diagonal of $\Delta$ will be $0$’s, and so when we walk we’ll follow that seam and return a distance of $0$. If $x$ is some warped version of $y$ – maybe the same at the beginning, then stretched or compressed a little – we’ll still have a seam of $0$’s but now they’ll sit a little off the main diagonal of $D$, but nevertheless our algorithm will find them and follow them when walking through $\Delta$ and report a small distance between $x$ and $y$.

I’m not sure that this algorithm is something that’s easily grasped by explanation alone (at least not my explanation alone), it’s defintely worth coding it up yourself and playing around with it. One unfortunate fact is that is this algorithm is $O(t_x \times t_y)$, and so quadratic. As a result I discovered at first bash, writing what I thought was some decent Python, I got performance so bad as to make the code almost unusable. I ended up re-writing it in Cython and got speed-ups of ~300x.5 You can find all the code here.

  1. The classic example for this algorithm, the one cited on Wikipedia, is speech recognition. Let’s say you have two utterances with different stresses, Heeeello and Helloooo, how do you know they’re the same word? 

  2. In other words, $[(0, 0), (1, 0), (2, 0)\dots(1, t_y), (2, 0)\dots, (i, j)]$ 

  3. In other words we only take steps, rightwards, downwards, or diagonally. Or in other other words, we only move forward in time through our sequence, we don’t allow backtracking. 

  4. I will say that while this code is right in essence I’ve left out all of the house-keeping, so caveat emptor. 

  5. !!!!!