A big map of many bands.

For good and ill, recommender systems are everywhere these days. These systems often work by a process known as collaborative filtering,1 they take a look at the aggregate behaviors of a large number of people and then make recommendations based on the similarity between your behavior and theirs; the classic ‘people who bought this also bought’ system. In very loose terms these systems work by building a co-occurrence matrix, an (item $\times$ item) matrix whose entries tell us how many times a given pair of items have transacted together. We can then do linear algebra with this (probably very sparse matrix), and end up with something that we can use to generate recommendations.

In this post I’m going to focus on this co-occurrence matrix, and try to visualise it to see what connections between items arise. The data that I’ll look are which bands people listen to, and so what we’ll end up with it a nice visaulisation of how different bands relate to one another. Thankfully a data set to do this is available online, here. The data is from last.fm, and comes from about 10 years ago.2 The data themselves look like this:

0 223b die Ärzte 1099
1 daf1 melissa etheridge 897
2 c289 elvenking 717
3 223b juliette & the licks 706
4 223b red hot chili peppers 691

We have usernames, bands, and number of plays, and lots of them. There are 360,000 users and 17 million rows (user-band associations). This is enough data for us to get a good understanding of the connections between bands.

The first thing to do is to create a co-occurrence matrix, and that means deciding what we mean by co-occurrence. To begin with I clip the number of bands to the top listened 1,000.3 I then pivot the table, and iterate over it (in a really gross nested loop, this problem is inherently quadratic) to populate the cells of our co-occurrence matrix. The matrix will be ($1000 \times 1000$), and each cell will contain the similarity between the group of users who listened to band $i$ and the users who listened to band $j$. There are a lot of ways that we could compute this similarity, but in the end I settled on,

$W_{ij} = \frac{\mu(U_i \cap U_j)}{\mu(U_i \cup U_j)} \\ \mu(s) = \sum_{u \in s} \frac{1}{20 + n(u)}$

Here we compute the weight by looking at the ratio of $\mu$ for the intersection of users who’ve listened to both bands, and the union of users who’ve listened to both bands. Without the $\mu$ this would be exactly the Jaccard similarity, the $\mu$ function (which uses the function $n(u)$, which returns the number of bands listened to by user $u$) serves to down-weight users who listen a lot, preventing these users from dominating the counts. After having done this I trim the weaker connections between points, to give us only the strongest and most relevant relationships.

Applying this we have a nice 2D matrix of band similarities. Now we need to figure out how to visualise it. There are a lot of ways we can visualise data like this, but one tool that I’ve found useful is Gephi, a graph layout and visualisation tool. It can take in a distance matrix, like the one we’ve just computed, and it uses a force-based algorithm to determine a 2D layout for the points. In very hand-wavy terms this algorithm treats each pair of point as connected by springs whose springiness is set by the similarity between the points, and then tries to find the minimal energy configuration of the springs. Having done this we can then output the 2D positions of the points and then plot them up in D3.

A full-screen verison of the map can be found here, it’s a little easier to steer in full-screen mode. This is a lot of fun to play around with, and if you spend a bit of time looking at it you can see bands of the same genre group together; there are a couple of different hubs for metal (I can’t say I know a lot about metal, but it seems that different individual sub-genres group together), rap, indie, jazz, etc. It even works as a sort of manual recommender system, looking at a band you like and following their links to others can turn up some interesting new finds.

Some of the other connections are very sensible, for instance it’s satisfying that the only connection Sting has is to The Police. This is pretty cool to see, based on the fact that people who listen to Sting tend to listen to The Police we can learn to associate those artist with one another. This gives us a glimpse of how a recommender system makes the decisions that it does, and an idea of how this fairly simple idea can lead to some really useful predictions.

1. It would be remiss of me at this stage not to point out that item-item collaborative filtering was developed by my current employer, Amazon

2. This turns out to be a stroke of good luck because my music knowledge definitely peaked some time in the mid-00’s.

3. As you might expect the data here have a very long tail, so reducing 1,000 bands means we’re still capturing a lot of the plays.