A simple tutorial for visualization of large, high-dimensional data
I recently showed some examples of using Datashader for large scale visualization (post here), and the examples seemed to catch people’s attention at a workshop I attended earlier this week (Web of Science as a Research Dataset). Based on that interest, I’ve decided write up a little tutorial here to share with people. Hope it’s useful!
So, the overall goal here is to take a big text dataset where each document is represented by a high-dimensional feature vector, and arrive at a meaningful 2D visualization of those points.
For the example here, I’m using the abstracts of 20M+ publications from the Web of Science (WoS has a total of 50-60M papers, but only abstracts for those published after ~1991) to generate a Doc2Vec model that represents each document (abstract) as a 200 dimensional feature vector. I then visualize a random sample of 5M of these papers using a tool called LargeVis (doing more is totally possible, but the system I ran this particular analysis on had “only” 48GB RAM, and attempting more than 5M documents was giving me memory errors). You can of course use a similar procedure with other ways of representing documents (e.g. LDA).
Step 1: Generate feature vectors
The particulars of the preprocessing I do here are unique to my data format, and to the fact that I’m using Gensim’s implementaion of Doc2Vec, so I won’t go into too much detail. The important thing (again assuming you’re using Doc2Vec) is that you simply generate a single text file with one document per line. Don’t forget to save some sort of index file so you know which line in your document file corresponds to which original document (we’ll need that later when we want to enrich our visualization with metadata from the original documents).
Now we run the actual Doc2Vec Model and save the results, which is all pretty straightforward. I’m using pretty standard parameters for Doc2Vec here, but there are good resources out there on parameter tuning if you’re curious..
Step 2: Generate our 2D embedding
Both t-SNE and LargeVis have particular data formats they require, which are quite simple. T-SNE’s python wrapper is actually easy to modify so that it reads the numpy array we generated above directly, but in any case I’m going to focus on using LargeVis in the standard way (though you could probably get it to work on numpy arrays directly if needed). This all assumes you’ve compiled and installed LargeVis following the instructions on their Github repo.
Now we have out 2D embedding!
Step 3: Visualization
Now let’s walk through an example visualization pipeline. There are of course plenty of different things you could do here, but this example will hopefully give you the basics. We’ll start with some preliminary imports, and define a little legend function we’ll use below.
Now, let’s load our learned embedding indices of our random sample of documents into a Pandas dataframe:
Now, for my case I have some Graphlab sframes in which I’ve previously stored a bunch of metadata on the abstracts we’re modeling. Your data will surely be different (and the use of Graphlab here is totally incidental), but the gist of what I’m doing here is joining the xy coordinates from the LargeVis 2D embedding with the metadata.
However you get there, waht you want is a pandas dataframe with one row per document, with the xy coordinates and whatever metadata you care about using to enrich your visualization. (Datashader, which we’ll be using below, also works nicely with out-of-core Dask dataframes, so don’t feel like you’re limited to data structures that fit in memory!). For the purposes of this example, the only metadata I care about is the WoS subheading for each paper, and the WoS-assigned subject categories.
|-0.947329||-3.782668||Life Sciences & Biomedicine||Pharmacology & Pharmacy|
|-6.668430||7.766551||Life Sciences & Biomedicine||Biology; Mathematical & Computational Biology|
|10.640243||-16.039352||Life Sciences & Biomedicine||Oncology; Radiology, Nuclear Medicine & Medical…|
|-21.193598||0.940371||Criminology & Penology||null|
Now, datashader requires that we explicitly represent categorical data as categorical data types, so let’s do that
Let’s also simplfify the category data by just looking as the first category assigned to each paper. We won’t convert to categorical data yet, as we need to do some other preprocessing of the category data below first.
Now we’re (almost) ready to actually generate some visuals! There’s a bit of boilerplate we’ll want to start with for datashader. MOst of is adapted from this tutorial.
Now we can make an image with datashader. In the words of datashaders’s readme:
Datashader is a graphics pipeline system for creating meaningful representations of large amounts of data. It breaks the creation of images into 3 main steps:
- Projection: Each record is projected into zero or more bins, based on a specified glyph.
- Aggregation: Reductions are computed for each bin, compressing the potentially large dataset into a much smaller aggregate.
- Transformation: These aggregates are then further processed to create an image.
In terms of code, that requires that we (1) define a canvas, (2) define our (aggregated) glyphs, and (3) define the transformation function that turns the aggreagation into an image. Note that I’m just saving all my images as “tempfile” since I’m just working in the notebook anyway.
Pretty neat (and as you’ll see when you try running it, fast too). But because the point density is pretty low for any given pixel, the structure is hard to see. We can fix this by using a non-linear shading scheme, like eq_hist (logarithmic shading is also an option). From the tutorial linked above:
…let’s try the image-processing technique called histogram equalization. I.e., given a set of raw counts, map these into a range for display such that every available color on the screen represents about the same number of samples in the original dataset. The result is similar to that from the log transform, but is now non-parametric – it will equalize any linearly or nonlinearly distributed integer data, regardless of the distribution:
Ok, that looks better, and we have some nice structure. But of course we don’t know yet if this is just a pretty picture, or actually captures something meaningful about our data. Now let’s look at some methods for incorporating metadata that will help resolve this.
To accomplish, we’re going to define a more sophisticated plotting function that will colorize points based on either (a) the WoS subheading, or (b) the top category for each paper. In the latter case, there are more categories than distinguishable colors, so we’ll design the function to only draw the top N most popular categoeries (for the example I’ll use 9, which is about the max number of colors I’ve found that is easy to distinguish).
So let’s run this code, colorizing by WoS subheading:
There we go! While it’s not perfect, we can clearly see that papers that cluster together in the Doc2Vec feature space also tend to be within the same broad category according to WoS (note that blue is blank because it corresponds to the social sciences and humanities top level WoS heading, for which there are no subheadings). To see the organization at a finer granularity, let’s do the same thing, but colored by the top 9 most common subject categories:
While it’s not quite as clean, and of course sparser (because we’re looking at a much smaller subset of papers), we can again see an impressive level of coherent organization, even though we had to go through a lot of preprocessing steps to get here.
To wrap up, I want to close with some further ideas on cool directions to go with datashader visualization. First off, for those you who are used to working with Matplotlib (like me!) datashader is undoubtedly awesome, but doesn’t make it easy to do seemingly basic things like annotation, axis labels, etc. Luckily there is a workaround. Because datashader generates rasterized images, it’s actually pretty straightforward to import pull them into a matplotlib figure with
imshow, and then layer traditional matplotlib visual elements on top of them, like so:
From here, you can add anything you want to the image using the matplotlib (or seaborn, or whatever, that you’re used to). Just be cautious about making to scale the matplotlib axis elements appropriately to your raster image. As a more practical example (from a differnt project) here’s a similarity matrix I generated for 10,000 musical artists from last.fm (so yes, datshader here is actually plotting 100 million datapoints, and it handles this scale of data with no problems whatsoever):
Using the same process as above, I imported my datashader png image into a matplotlib figure,defined custom tick labels and locations, and used
axvline to annotate the image. Here artists are sorted by genre, and I label the regions of the image corresponding to each genre.
The last thing I’ll introduce (only briefly) is interactive plotting. Datshader actually interacts very nicely with Bokeh to genrate interactive plots. The results of this won’t display nicely in the browser, so for now I’ll just share the code and some screenshots.
The process is actually fairly simple, and requires that you first define a base Bokeh plot:
Next we define an image callback function (which is essentially the same as the image creation code we used before), and then generate the interactive image:
Again, you’ll have to run this yourself to get the interactive version, but what you end up should look something like this (note the interaction controls in the corner):
Where the real power of combining Datashader and Bokeh shows is when you zoom in:
When you zoom, Datashader dynamically re-renders the visualizaion to increase the granulrity. It’s awesome to see in action, and I strongly recommend you look at the Datashader census tutorial to see some exceptionally elegant illustrations of how data rendering at different zoom levels works.
Well, that’s all for now. Please feel free to reach out to jared.j.lorince (at) gmail (dot) com with any thoughts or questions! The notebook of this tutorial is available here, but unfortunately I am not able to share the raw data.