The Networking Archives project is reconciling three separate datasets—Early Modern Letters Online, the Tudor State Papers Online, and Stuart State Papers Online—into one meta-archive. One commonality between the three is that they all, to some degree, contain missing and partial data—potentially a source of anxiety when we come to consider the veracity of our findings. In a recent paper authored by some of the project team, presented at the first Computational Humanities Research Workshop, we outlined some strategies for dealing with missing data, and argued that perhaps we shouldn’t be so worried after all.
First we set out to understand the data in detail, and to this end we’re working on a set of ‘views’, which will visualise the shape of the data along different dimensions. What struck us first is how remarkably similar the State Papers data looks to EMLO, despite their very different origins. These visualisations also help us to analyse the precise ways in which the data is missing and partial—we’re mapping absences as well as presences. Mapping absences has led us to understand, for example, that dates in SPO were more reliable during the secretaryship of the bureaucrat-extraordinaire Joseph Williamson, and less so during the interregnum. They also show that some types of missing data are more correlated than others: statistically, a record in EMLO missing a date is significantly more likely than chance to be missing an author or recipient, but the fact has less bearing on whether that record will be missing an origin or a destination field, for example. Potentially these findings can help us to model in even greater detail the effects of very specific types of absences in the data.
Many of the findings on the Networking Archives project are based on network science metrics. We might, for example, use a ranked list of a particular metric to make a claim about an individual’s proximity to the centre of power, or to find individuals who acted as ‘sustainers’ between different parts of a network. To understand the impact of the missing data, described above, on these kinds of rankings, we ran a series of experiments inspired by the work carried out by Matthew Peeples on archaeological networks. To put it simply, we removed random chunks of letter records from the datasets, re-ran the network algorithms, and compared the ranks of the metrics across the original and ‘sample’ networks. Surprisingly, we found that most metrics were actually pretty similar, even when 60 or 70% of the network was removed.
The last part of the paper is about why we’re interested in studying these joined-up catalogues. One reason is because it allows us to find new, ‘informal’ catalogues at the intersection of the formal collections. Take the example of John Dury: a Scottish minister who worked as a diplomat and towards the promotion of peace amongst Christian factions, he spent much of his life travelling across Europe trying to convince secular leaders of his cause. As such, rather than his correspondence being collected in a single ‘Dury Archive’, his letters are scattered across a number of others (much of it is in the archive of his friend Samuel Hartlib, but we found him in eight other catalogues in EMLO as well as in the Stuart State Papers). Computational methods allow us to find other individual like this, and in the case of Dury, gather his dispersed correspondence into a single, informal catalogue, and through this get a more complete picture of his role in seventeenth-century religious, intellectual and diplomatic networks.
Historians are often—understandably—skeptical about quantitative results of this kind, because working in historical archives makes one only too aware of their partial, often chaotic nature. We suggest that in terms of network science at least, this partiality has less effect than might be expected. In fact, what we’ve discovered is that in most fields using network analysis, complete data is more of an illusion than a fact, and that we should work around absence, rather than without it.