Self-organizing maps, or SOMs, are a category of machine-learning (ML) algorithm used for clustering data points with similar variables. They are useful both for exploring the structure of unlabeled data sets and for creating classifiers for complex, messy data that may be problematic for more traditional ML algorithms. This is because they lend themselves to visualization in ways that allow humans to easily pick out important features in the training data.
SOMs are a type of “shallow” neural net, i.e., they only have two layers in contrast to “deep” learning networks that have many hidden layers. SOMs are described and visualized by different people in many different ways. Some describe them topologically, like a landscape of valleys and mountain ranges. Others visualize them like heat maps. Still others describe them like groups of people moving about the room at a party, seeking out their friends.
Each of these different descriptions is grounded in what SOMs do, but each focuses on different aspects of how the algorithm works. The various analogies used to describe SOMs are often interesting and useful, but ultimately insufficient for anyone trying to grasp exactly how SOMs can be used in a serious machine learning context, what their potential applications are, and why they’re an exciting branch of AI/ML research.
This article is my attempt to fill that gap: briefly exploring the algorithm at the heart of SOMs, and an exciting application being actively researched by Galois – creating classifiers for complex data challenges where more traditional ML models struggle.
Let’s dive in.
The Algorithm
Like the input layer of other neural nets, nodes in a SOM have weights over the data features. However, the nodes also have topological relationships to other nodes, and node-to-node distances change over the course of training. This adds an additional semantic layer to what SOMs can capture compared to neural nets where the structure of the output layer is fixed. Also unlike traditional neural nets that require labeled data (e.g., an input image of a cat plus its correct output label, “domestic shorthair”), SOMs function on unlabeled data – their goal is to identify which training data points are most similar to each other and then spatially cluster them by similarity. This groups data in ways that makes it visually easy to identify higher-level features, which is extremely useful if we don’t know how many categories we’re looking for. However, that’s not to say they can’t be used with labeled data where we already know the categories of interest. In fact, SOMs’ human-interpretability offers an interesting possibility for creating classifiers over data that is problematic for more traditional neural nets.
SOMs are initialized by randomly assigning feature weights for each node. Next, we test a training data point against all of the nodes. One node will be activated more than the others by that data point. We call this node the winner for that data point. The winner node then influences other nodes that it is topologically connected to according to a neighborhood function that takes node proximity into account (since two nodes can be topologically connected by an edge, but technically “far apart” in terms of their distance measure). As this process is repeated over more data points, nodes that react to the same data points are drawn closer together and those that respond to different data grow farther apart. This training process is sometimes visualized as an animation showing the nodes moving around to respect the changing distances associated with their connecting edges.
Individual nodes can be the winner for multiple training data points. This means that an individual node can be a cluster in itself – but the topology offers a higher level grouping of those small clusters into larger ones. For unlabeled data, this can be used to explore the data and characterize groups that emerge at multiple levels in the clustering. For labeled training data, we can use those features to create classifiers for novel data points.
A Simple SOM
Suppose we have some data that is mostly, but not perfectly separable in its feature space – in other words, most of the time we can easily tell which label to assign to a point just by knowing what region it lies in, but there is a bit of fuzziness for points at the edges of those regions. For example, if we were trying to classify vertebrates into the three categories of mammals, fish, and birds based on just a few features like whether they have wings, legs, and/or fins, we would get the vast majority of species right – but we might have trouble with bats, penguins, and whales. If we train a SOM on a such a data set, where we would code each data point as a 3-element vector along the features of interest – e.g., (1, 1, 0) could be input data for a vertebrate that has wings, has legs, and does not have fins. With three categories occupying the three mostly unique regions in the feature space, we get a result like this:
The blue figure shows the SOM in terms of nodes’ neighbor distances. Thus, in our example, vertebrates with similar features would appear closer to one another, while those with more dissimilar features would be farther away. This works a lot like a topological map. The green figures are activation maps – the nodes are in the same locations as for the distance map, but this representation is more like a heat map where a darker color indicates that a node is a winner for more data points. Back to our example: A darker color here could indicate clusters of vertebrates with similar features, such as feathers, legs, or fins. Because this data is mostly separable (birds don’t usually have fins, and fish don’t usually have legs), the SOM forms three clear regions that are associated with the different kinds of data points. It would be easy to turn this SOM into a classifier for novel data points from the same population. Other traditional ML approaches also perform well with this kind of relatively straightforward data.
SOM Application: Detecting Suspicious Ship Behavior
Despite the fact that SOMs don’t consider data labels during their training process, they offer an interesting way to create classifiers with a bit of human oversight. This is particularly true for cases where more traditional ML methods struggle, such as when using complex behavioral data where meaningful clusters are difficult to identify and many initializations may be necessary to find meaningful results. In this situation, SOMs also have the added benefit of being exceedingly fast to train compared to those other methods. This means we can easily generate many unique SOMs with different random initializations and choose the best one to use as a classifier.
Let’s explore a particular example where SOMs shine: helping to anticipate bad actors at sea. The Global Fishing Watch provides data on Automated Identification System (AIS) disabling behavior for fishing vessels – events when a ship disables its AIS transponder, which may be a simple malfunction but is also frequently done for nefarious reasons (such as illegal fishing or cargo transfers). The ultimate hope is that analyzing transponder data could reveal behavioral patterns that allow researchers to determine when a vessel is behaving innocently and when it’s up to no good. Yet, determining whether and why a vessel is about to turn off its transponder is a needle-in-a-haystack problem, since even ships with a significant history of such events have their transponders on most of the time.
This is known as the “class imbalance” problem–unlike our earlier example where a dataset might have an even mix of mammals, birds, and fish, datasets like this one are characterized by a small number of “hits” in a sea of “misses.” Success at this task has wider-reaching implications for behavioral analysis, since predicting rare events on short notice is always more difficult than working with more balanced data.
Using both SOMs and feed-forward neural networks (FFNNs), we attempted to predict whether a ship will disable its transponder in the near future using data on 1. vessel location, 2. lifetime AIS disabling history-to-date for the vessel, and 3. statistics on AIS disabling “hot spots” – geographic areas where more than one ship has recently disabled their transponders.
A simple FFNN with two hidden layers can easily achieve 97% accuracy at this task – but there’s a catch to that statistic: the 3% of failure cases are the positive AIS disabling cases, which is the subset of the data that we’re most interested in. Essentially, the FFNN has learned to classify all data in the “no AIS disabling event predicted” category because for 97% of the data that is the correct assessment. SOMs provide some useful insight into why a simple FFNN fails at the task.
These figures are different views of a SOM trained on the AIS-disabling data. The distance map shows that there are some regions of similarity, but most of the divisions aren’t very clean. The distribution of positive and negative activations (winner nodes for subsets of the data) shows that the data has separability issues: the nodes that are dominated by negative data points (which are also most of the data) are also winners for a significant number of positive cases. For the positive cases elsewhere in the SOM, they are very dispersed – indicating that positive cases don’t necessarily look much like each other. With the combination of poor separability and a needle-in-a-haystack type task (also indicated by the fact that the whole data set activation map looks a lot like the negative activation map), it’s not surprising that the FFNN had trouble clustering data points and finding meaningful patterns
The ideal fix for this separability problem would be to include additional data about the ships and their behavior to hopefully help distinguish the positive cases. However, absent additional data, we can still use this SOM to help with positive point identification. If positive point identification is more important than some false positives, we can mark out the region of nodes occupied by the negative points and then treat all other nodes as indicative of a positive. Using this type of approach, our SOM can identify 62% of positive points correctly and 80% of negative points correctly, which is a more balanced classification than the simple FFNN provided.
Conclusion
Are these results perfect and immediately actionable for the problem at hand? Well, no. Are they promising? Absolutely.
In our experiments, SOMs have performed better with ambiguous, dynamic, or particularly complex data sets (both labeled and unlabeled) than more popular machine learning algorithms. Because SOMs are fast to train, they offer a unique opportunity in domains where concepts may shift over time, permitting the possibility of retraining on new data much more frequently and easily with fewer computational resources needed than for other popular neural-net-based algorithms. The training can also take place on a single device—no cloud access required—making it potentially useful for doing machine learning in situations where time is of the essence and connectivity is limited. Our exploration of SOMs in these kinds of situations is ongoing.