Computers are reasonably good at analyzing large datasets, but there is one class of problem where they require a bit of help from puny humans – high dimensional datasets. By “high-dimensional” we mean “wide”, as in lots of columns. When we have wide data, it’s very hard to spot commonalities across a number of those columns. For example, if we have data from a large number of sensors, and all of them have something to say about what’s going on, it’s very hard to detect what is similar about all those readings when a particular type of event occurs.
A typically difficult problem is predicting cancer outcomes based on levels of gene expression. We have measures of hundreds (sometimes thousands) of genes, all of which interact to influence medical outcomes – e.g. survival rates, response to particular chemotherapies, growth rates, etc. We also have thousands of patients, so we end up with a table that is hundreds (or thousands) of columns wide and thousands of rows deep. It goes without saying that it’s impossible for a human to work out what combinations of levels of gene expression predict a given outcome. However, it’s also a very difficult problem for a computer to solve as it has to conduct huge numbers of comparisons of all the possible data values – this is known as the “Curse of Dimensionality”. A computer also needs to know what to look for if it’s going to come up with anything useful in a reasonable timeframe, which pretty much means you need to know the outcome before you run the analysis. Fat lot of good that is.
One answer to this problem is to combine human cognition with machine learning to circumvent some of the heavier computation. The computer organizes the data in an unbiased fashion and present it as a graphical model to the analyst. At this point, the (as yet) uniquely human skill of pattern recognition comes into play, and the analyst is able to spot shapes and clusters in the data that indicate something interesting is going on. The tricky bit is in getting the computer to present something that helps the analyst. The traditional approach is to conduct what is known as a dimensional collapse (or dimension reduction) on the data. This sounds like a cataclysmic event from an episode of Doctor Who, but is actually just a way to simplify the data.
Before looking at dimension reduction, let’s look at a simpler example; a dataset with three columns of numbers. We can treat each row as a coordinate in 3D and plot the points. When the points are visualized, we can immediately tell if there is a random distribution (a cloud of points) or if some clusters are present. If we’re getting clusters, it’s a good indication that there is some commonality in the data. If we then colour-code our points based on some other variable (e.g. patient survival rate) we can immediately see if any of the clusters have a dominant colour. If they do, then we’ve found a pattern in our data that is worth investigation.
But what if I’ve got a thousand columns? How do I visualise that? This is where the dimensional collapse comes in. Traditional approaches such as Principal Component Analysis (PCA), and more recent machine learning algorithms such as T-SNE, allow us to collapse thousands of dimensions down to 2 or 3 whilst still (hopefully) maintaining proximity between the data points – i.e. if two points were close in n-dimensions, they’ll also be close to each other in 2 or 3 dimensions. What this means is that if we see clusters in 2D or 3D it is direct result of there being some similarity in the data, and we may have something worth investigating.
Such approaches have been around for years, and work well for spotting clusters. They’re not without their drawbacks though. PCA can have the effect of introducing a bias to the data if not used carefully. T-SNE is a more neutral approach, but if the data is noisy, and the signal is weak, it becomes very hard to find the signal in a cloud of points. What is required is a way to increase the contrast and reveal the patterns in a more obvious way to the user.
This is what topological data analysis (TDA) is all about. TDA attempts to build a topological model of the data by grouping and linking data points that are similar in n dimensions. This can then be visualized as a network chart to show the “shape of the data”. This has the effect of making the clusters more obvious to the human eye, and is also able to pick out smaller clusters that would be just lost in the noise in a traditional dimensional collapse approach. It also reveals a shape to the data which tends to draw the eye to particularly interesting features.
The market leader in this sort of analytic is Ayasdi, and they make some bold, but supportable, claims about what the shape of the data can reveal – e.g. circular plots are indicative of time series data.
The company I work for (illumr) takes a slightly different approach, but still with an emphasis on using shapes to draw the analyst’s eye to interesting aspects of the data. We don’t emphasise TDA, though our approach results in a topological model of the data. Debate rages (as only mathematicians can rage) on the interwebs about how TDA differs from dimensional collapse. This is somewhat missing the point, as the end result is still a dimensional collapse but with topology of the n-dimensional shape retained and presented in 2D (Ayasdi) or 3D (illumr).
There are different ways to achieve this, but Ayasdi tend to focus on a persistent homology approach – extracting the overall topology and displaying it as nodes and links. In the case of Ayasdi, each node corresponds to either a single data point or a cluster of points depending on how they connect (the topology). The illumr approach results in a node for every data point, as the emphasis here is on finding weak signals in data.
For instance, illumr worked with Metropolitan, a leading social housing provider in the south of England who manage over 50,000 homes. Metropolitan have a legal obligation to safeguard their tenants, many of whom are on low incomes and have behavioural problems.
Using all existing methodologies, they failed to find any signal in their data that could be indicative of homes that would likely be the subject of domestic abuse.
Using illumr’s technology, we were able to identify four small but very key clusters of homes, where the tenants were 10-50 times more likely to be the subject of abuse. The Head of Insight at Metropolitan is quoted as saying, “I can certainly recommend illumr; their ability to reveal unique insights was ground breaking”.
These signals would be impossible to identify without illumr’s exploratory data-analysis guided by self-organising the data in 3D space. Most importantly, as the analysis is based on the natural inherent structure within the data, and not blind machine learning, the interaction can be summarised in human-understandable format.
Such findings are now empowering housing providers to better understand and predict the patterns of behaviour associated with their tenants; something they had previously been unable to do effectively.