27 min read

Exploring the State Papers with Word Embeddings

Text and Models

Digital Humanities is often concerned with creating models of text: a general name for a kind of representation of text which makes it in some way easier to interpret. TEI-encoded text is an example of a model: we take the raw material of a text document and add elements to it to make it easier to work with and analyse.

Models are often further abstracted from the original text. One way we can represent text in a way that a machine can interpret is with a word vector. A word vector is simply a numerical representation of a word within a corpus (a body of text, often a series of documents), usually consisting of a series of numbers in a specified sequence. This type of representation is used for a variety of Natural Language Processing tasks - for instance measuring the similarity between two documents.

This post uses a couple of R packages and a method for creating word vectors with a neural net, called GloVe, to produce a series of vectors which give useful clues as to the semantic links between words in a corpus. The method is then used to analyse the printed summaries of the English State Papers, from State Papers Online, and show how they can be used to understand how the association between words and concepts changed over the course of the seventeenth century.

What is a Word Vector, Then?

Imagine you have two documents in a corpus. One of them is an article about pets, and the other is a piece of fiction about a team of crime fighting animal superheroes. We’ll call them document A and document B. One way to represent the words within these documents as a vector would be to use the counts of each word per document.

To do this, you could give each word a set of coordinates, \(x\) and \(y\), where \(x\) is a count of how many times the word appears in document A and \(y\) the number of times it appears in document B.

The first step is to make a dataframe with the relevant counts:

library(ggrepel)
library(tidyverse)
word_vectors = tibble(word = c('crufts', 'feed', 'cat', 'dog', 'mouse', 'rabbit', 'cape', 'hero' ),
      x = c(10, 8, 6, 5, 6, 5, 2, 1),
      y = c(0, 1, 3, 5, 8, 8, 10, 9))

word_vectors
## # A tibble: 8 x 3
##   word       x     y
##   <chr>  <dbl> <dbl>
## 1 crufts    10     0
## 2 feed       8     1
## 3 cat        6     3
## 4 dog        5     5
## 5 mouse      6     8
## 6 rabbit     5     8
## 7 cape       2    10
## 8 hero       1     9

This data can be represented as a two-dimensional plot where each word is placed on the x and y axes based on their x and y values, like this:

ggplot() + 
  geom_point(data = word_vectors, aes(x, y), size =4, alpha = .7) + 
  geom_text_repel(data = word_vectors, aes(x, y, label = word)) + 
  theme_bw() + 
  labs(title = "Words Represented in Two-dimension Space") + 
  theme(title = element_text(face = 'bold')) + 
  scale_x_continuous(breaks = 1:10) + 
  scale_y_continuous(breaks = 1:10)

Each word is represented as a vector of length 2: ‘rabbit’ is a vector containing two numbers: {5,8}, for example. Using very basic maths we can calculate the euclidean distance between any pair of words. More or less the only thing I can remember from secondary school math is how to calculate the distance between two points on a graph, using the following formula:

\[ \sqrt {\left( {x_1 - x_2 } \right)^2 + \left( {y_1 - y_2 } \right)^2 } \]

Where \(x\) is the first point and \(y\) the second. This can easily be turned into a function in R, which takes a set of coordinates (the arguments x1 and x2) and returns the euclidean distance:

euc.dist <- function(x1, x2) sqrt(sum((pointA - pointB) ^ 2))

To get the distance between crufts and mouse, set pointA as the \(x\) and \(y\) ccoordinates for the first entry in the dataframe of coordinates we created above, and pointB the coordinates for the fifth entry:

pointA = c(word_vectors$x[1], word_vectors$y[1])
pointB = c(word_vectors$x[5], word_vectors$y[5])

euc.dist(pointA, pointB)
## [1] 8.944272

Representing a pair of words as vectors and measuring the distance between them is commonly used to suggest a semantic link between the two. For instance, the distance between hero and cape in this corpus is small, because they have similar properties: they both occur mostly in the document about superheroes and rarely in the document about pets.

pointA = c(word_vectors$x[word_vectors$word == 'hero'], word_vectors$y[word_vectors$word == 'hero'])

pointB = c(word_vectors$x[word_vectors$word == 'cape'], word_vectors$y[word_vectors$word == 'cape'])

euc.dist(pointA, pointB)
## [1] 1.414214

This suggests that the model has ‘learned’ that in this corpus, hero and cape are semantically more closely linked than other pairs in the dataset. The difference between cape and feed, on the other hand, is large, because one appears often in the superheroes article and rarely in the other, and vice versa.

pointA = c(word_vectors$x[word_vectors$word == 'cape'], word_vectors$y[word_vectors$word == 'cape'])

pointB = c(word_vectors$x[word_vectors$word == 'feed'], word_vectors$y[word_vectors$word == 'feed'])

euc.dist(pointA, pointB)
## [1] 10.81665

Multi-Dimensional Vectors

These vectors, each consisting of two numbers, can be thought of as two-dimensional vectors: a type which can be represented on a 2D scatterplot as \(x\) and \(y\). It’s very easy to add a third dimension, \(z\):

word_vectors_3d = tibble(word = c('crufts', 'feed', 'cat', 'dog', 'mouse', 'rabbit', 'cape', 'hero' ),
      x = c(10, 8, 6, 5, 6, 5, 2, 1),
      y = c(0, 1, 3, 5, 8, 8, 10, 9),
      z = c(1,3,5,2,7,8,4,3))

Just like the plot above, we can plot the words, this time in in three dimensions, using Plotly:

library(plotly)

plot_ly(data = word_vectors_3d, x =  ~x, y = ~y,z =  ~z, text = ~word) %>% add_markers()

You can start to understand how the words now cluster together in the 3D plot: rabbit and mouse are clustered together, but now in the third dimension they are further away from dog. We can use the same formula as above to calculate these distances, just by adding the z coordinates to the pointA and pointB vectors:

pointA = c(word_vectors_3d$x[word_vectors_3d$word == 'dog'], word_vectors_3d$y[word_vectors_3d$word == 'dog'], word_vectors_3d$z[word_vectors_3d$word == 'dog'])
pointB = c(word_vectors_3d$x[word_vectors_3d$word == 'mouse'], word_vectors_3d$y[word_vectors_3d$word == 'mouse'], word_vectors_3d$z[word_vectors_3d$word == 'mouse'])

euc.dist(pointA, pointB)
## [1] 5.91608

The nice thing about the method is that while my brain starts to hurt when I think about more than three dimensions, the maths behind it doesn’t care: you can just keep plugging in longer and longer vectors and it’ll continue to calculate the distances as long as they are the same length. This means you can use this same formula not just when you have x and y coordinates, but also z, a, b, c, d, and so on for as long as you like. This is often called ‘representing words in multi-dimensional euclidean space’, or something similar. Which means that if you represent all the words in a corpus as a long vector (series of coordinates), you can quickly measure the distance between any two.

In a large corpus with a properly-constructed vector representation, the semantic relationships between the words start to make a lot of sense. What’s more, because of vector math, you can add, subtract, divide and multiply the words together to get new vectors, and then find the closest to that. Here, we create a new vector, which is pointA - pointB (dog - mouse). Then loop through each vector and calculate the distance, and display in a new dataframe:

pointC = pointA - pointB

df_for_results = tibble()
for(i in 1:8){
  
  pointA = c(word_vectors_3d$x[i], word_vectors_3d$y[i], word_vectors_3d$z[i])
  u = tibble(dist = euc.dist(pointC, pointA), word = word_vectors_3d$word[i])
  df_for_results = rbind(df_for_results, u)
}

df_for_results %>% arrange(dist)
## # A tibble: 8 x 2
##    dist word  
##   <dbl> <chr> 
## 1  0    mouse 
## 2  1.41 rabbit
## 3  5.39 cat   
## 4  5.39 cape  
## 5  5.92 dog   
## 6  6.48 hero  
## 7  8.31 feed  
## 8 10.8  crufts

We see that the closest to dog minus mouse is hero using this vector representation.

From Vectors to Word Embeddings

These vectors are also known as word embeddings. Real algorithms base the vectors on more sophisticated metrics than that I used above. Some, such as Word2Vec or GloVe, record co-occurrence probabilities (the likelihood of every pair of words in a corpus to co-occur within a set ‘window’ of words either side), using a neural network, and pre-trained over enormous corpora of text. The resulting vectors are often used to represent the relationships between modern meanings of words, to track semantic changes over time, or to understand the history of concepts, though it’s worth pointing out they’re only as representative as the corpus used (many use sources such as Wikipedia, or Reddit, which tend to be produced by privileged and so there’s a danger of biases towards those groups).

Word embeddings are often critiqued as reflecting or propogating bias (I highly recommend Kaspar Beelen’s post and tools to understand more about this) of their source texts. The source used here is a corpus consisting of the printed summaries of the Calendars of State Papers, which I’ve described in detail here. As such it is likely highly biased, but if the purpose of an analysis is historical, for example to understand how a concept was represented at a given time, by a specific group, in a particular body of text, I argue that the biases captured by word embeddings can be seen as a research strength rather than a weakness.

The data is in no way representative of early modern text more generally, and, what’s more, the summaries were written in the 19th century and so will reflect what editors at the time thought was important. In these two ways, the corpus will reproduce a very particular wordview of a very specific group, at a very specific time. Because of this, can use the embeddings to get an idea of how certain words or ideas were semantically linked, specifically in the corpus of calendar abstracts. The data will not show us how early modern concepts were related, but it might help to highlight conceptual changes in words within the information apparatus of the state.

The following instructions are adapted from the text2vec package vignette and this tutorial. First, tokenise all the abstract text and remove very common words called stop words:

library(text2vec)
library(tidytext)
library(textstem)
data("stop_words")
set.seed(1234)

Next, load and pre-process the abstract text:

spo_raw = read_delim('/Users/yannryanpersonal/Documents/blog_posts/MOST RECENT DATA/fromto_all_place_mapped_stuart_sorted', delim = '\t', col_names = F )
spo_mapped_people = read_delim('/Users/yannryanpersonal/Documents/blog_posts/MOST RECENT DATA/people_docs_stuart_200421', delim = '\t', col_names = F)

load('/Users/yannryanpersonal/Documents/blog_posts/g')
g = g %>% group_by(path) %>% summarise(value = paste0(value, collapse = "<br>"))

spo_raw = spo_raw %>%
mutate(X7 = str_replace(X7, "spo", "SPO")) %>%
separate(X7, into = c('Y1', 'Y2', 'Y3'), sep = '/') %>%
mutate(fullpath = paste0("/Users/Yann/Documents/non-Github/spo_xml/", Y1, '/XML/', Y2,"/", Y3)) %>% mutate(uniquecode = paste0("Z", 1:nrow(spo_raw), "Z"))

withtext = left_join(spo_raw, g, by = c('fullpath' = 'path')) %>%
left_join(spo_mapped_people %>% dplyr::select(X1, from_name = X2), by = c('X1' = 'X1'))%>%
left_join(spo_mapped_people %>% dplyr::select(X1, to_name = X2), by = c('X2' = 'X1')) 

Tokenize the text using the Tidytext function unnest_tokens(), remove stop words, lemmatize the text (reduce the words to their stem) using textstem, and do a couple of other bits to tidy up. This creates a new dataset, with one row per word - the basis for the algorithm input.

words = withtext %>% 
  ungroup()  %>% 
  select(document = X5, value, date = X3) %>%
  unnest_tokens(word, value) %>% anti_join(stop_words)%>% 
  mutate(word = lemmatize_words(word)) %>% 
  filter(!str_detect(word, "[0-9]{1,}")) %>% mutate(word = str_remove(word, "\\'s"))

Create a ‘vocabulary’, which is just a list of each word found in the dataset and the times they occur, and ‘prune’ it to only include words which occur at least five times.

words_ls = list(words$word)

it = itoken(words_ls, progressbar = FALSE)

vocab = create_vocabulary(it)

vocab = prune_vocabulary(vocab, term_count_min = 5)

With the vocabulary, construct a ‘term co-occurence matrix’: this is a matrix of rows and columns, counting all the times each word co-occurs with every other word, within a window which can be set with the argument skip_grams_window =. 5 seems to give me good results - I think because many of the documents are so short.

vectorizer = vocab_vectorizer(vocab)

# use window of 10 for context words
tcm = create_tcm(it, vectorizer, skip_grams_window = 5)

Now use the GloVe algorithm to train the model and produce the vectors, with a set number of iterations: here we’ve used 20, which seems to give good results. rank here is the number of dimensions we want. x_max is the maximum number of co-occurrences the model will consider in total - giving it a relatively low maximum means that the whole thing won’t be skewed towards a small number of words that occur together hundreds of times. rank sets the number of dimensions in the result. The algorithm can be quite slow, but as it’s a relatively small dataset (in comparison to something like the entire English wikipedia), it shouldn’t take too long to run - a couple of minutes for 20 iterations.

glove = GlobalVectors$new(rank = 100, x_max = 100)

wv_main = glove$fit_transform(tcm, n_iter = 20, convergence_tol = 0.00001)
## INFO  [13:10:58.951] epoch 1, loss 0.0539 
## INFO  [13:11:12.651] epoch 2, loss 0.0318 
## INFO  [13:11:26.593] epoch 3, loss 0.0261 
## INFO  [13:11:40.493] epoch 4, loss 0.0234 
## INFO  [13:11:54.449] epoch 5, loss 0.0217 
## INFO  [13:12:08.337] epoch 6, loss 0.0204 
## INFO  [13:12:22.475] epoch 7, loss 0.0195 
## INFO  [13:12:36.630] epoch 8, loss 0.0187 
## INFO  [13:12:50.738] epoch 9, loss 0.0181 
## INFO  [13:13:04.883] epoch 10, loss 0.0176 
## INFO  [13:13:18.997] epoch 11, loss 0.0172 
## INFO  [13:13:32.954] epoch 12, loss 0.0168 
## INFO  [13:13:46.961] epoch 13, loss 0.0165 
## INFO  [13:14:00.915] epoch 14, loss 0.0162 
## INFO  [13:14:14.873] epoch 15, loss 0.0159 
## INFO  [13:14:28.856] epoch 16, loss 0.0157 
## INFO  [13:14:42.807] epoch 17, loss 0.0155 
## INFO  [13:14:56.780] epoch 18, loss 0.0153 
## INFO  [13:15:10.708] epoch 19, loss 0.0151 
## INFO  [13:15:24.759] epoch 20, loss 0.0149

GloVe results in two sets of word vectors, the main and the context. The authors of the GloVe package suggest that combining both results in higher-quality embeddings:

wv_context = glove$components



# Either word-vectors matrices could work, but the developers of the technique
# suggest the sum/mean may work better
word_vectors = wv_main + t(wv_context)

Reducing Dimensionality for Visualisation

Now that’s done, it’d be nice to visualise the results as a whole. This isn’t actually necessary: as I mentioned earlier, the computer doesn’t care how many dimensions you give it to work out the distances between words. However, in order to visualise the results, we can reduce the 100 dimensions to two or three and plot the results. We can do this with an algorithm called UMAP

There are a number of parameters which can be set - most important is n_components, the number of dimensions, which should be set to two or three so that the results can be plotted.

library(umap)
glove_umap <- umap(word_vectors, n_components = 3, metric = "cosine", n_neighbors = 25, min_dist = 0.01, spread=2)

df_glove_umap <- as.data.frame(glove_umap$layout, stringsAsFactors = FALSE)

# Add the labels of the words to the dataframe
df_glove_umap$word <- rownames(df_glove_umap)
colnames(df_glove_umap) <- c("UMAP1", "UMAP2", "UMAP3", "word")

Next, use Plotly as above to visualise the resulting three dimensions:

plot_ly(data = df_glove_umap, x =  ~UMAP1, y = ~UMAP2, z =  ~UMAP3, text = ~word, alpha = .2, size = .1) %>% add_markers(mode = 'text')