Definition of a ‘source’ in OSINT

The most rewarding part of lecturing is when students come with questions which you cannot answer immediately. While some may feel that a lecturer should be able to answer all questions, I like to be challenged. And student questions often challenge what I may take for granted and in fact many concepts seem clear until you really start thinking about them.

Recently, during my last of a series of lectures on OSINT in the Qualitative Analysis Techniques for Intelligence Studies (part of the Leiden University Intelligence Minor), a discussion ensued on the definition of a source. The key question was ‘What is actually a ‘source’ in open sources?

The relevant context is that we ask the students to validate the ‘sources’ they use for their research. The question was, what should we consider as the ‘source’: the (library) databases which we query (e.g. Leiden University library), the journals or media that contain potentially relevant information (e.g. the Journal of Water & Climate Change), or an actual article in a journal or newspaper which contains relevant information for the task?

All three options (library database, journal and article) appear, in some way or another, to meet the dictionary definition of a source as ‘someone or something that supplies information‘. As a result defining what is a ‘source’ in OSINT may be more complicated than, for example, in HUMINT where the ‘source’ by definition is a human who supplies information, overtly or covertly.

This blogpost briefly dives into the definition of a ‘source’ in open sources and we’ll come to the conclusion that there might not be a simple unambiguous answer.

Example

In order to aid the discussion, let’s take another concrete example. Suppose we’re looking for information on the development of Fractional Orbital Bombardment Systems (FOBS) and the LexisNexis database would be our first stop. You would in LexisNexis then most likely identify a recent article from the Financial Times in which journalist Demetri Sevastopulo, based on ‘five people familiar with the test’, breaks the news that China in August 2021 tested a system that could qualify as a FOBS. (If interested in the specifics, listen to the relevant episode on this matter from the Arms Control Wonk Podcast.)

However, if you’re doing your open-source collection, what do we now consider to be the ‘source’? Initially LexisNexis was queried as a source which revealed that the archive of the Financial Times has a relevant article on the research subject. Is LexisNexis the source? The Financial Times? Or the article itself?

While in general this question seems not to be really relevant as long as you can exactly reference the information found, a challenge arises when you need to evaluate the source, as is a standard element in intelligence analysis (pdf) in order to aid our understanding of the value of the information supplied. See also my blogposts on information grading systems.

Which ‘source’ should however be evaluated? The obvious answer may seem to be ‘the article’ as that is the original source which contains the actual information. If we further consider the problem here, the insight emerges that our confusion most likely has to do with the aggregation of information in OSINT at multiple levels. Going back to the FOBS article, the original source was no doubt the article by Demetri Sevastopulo who amalgamated different pieces of information into the article. In the current information landscape articles are often being collated together and published by newspapers, so at an aggregate level.

At again a higher aggregate level the database of LexisNexis collates information from tens of thousands newspapers, databases and other sources into a single database and therefore also that database is an aggregated source. However, we still do not have a comprehensive answer. Let’s see what the experts say.

What do the experts say?

So what does the OSINT literature say on the definition of a ‘source’? Well, in fact, not much. None of the articles on OSINT in the OSINT Library provide a definition of ‘source’. Many articles do discuss the definition of ‘open’ and on this point the literature does show some discussion. However, I have not been able to identify any discussion on the definition of a ‘source’ in the OSINT literature.

The ICD 206 standard (pdf) does contain a definition of a source which reads: “An originator or discrete parcel of data or information that provides material that comprises, contributes to, affects, or is used to evaluate, the basis for intelligence analysis.” However, this is not a specific definition for open sources as the explanation shows: “A source can be a person, document, passage, quotation, data record, database, tweet, email, book, web page, etc.” Nonetheless, it will be helpful to keep the key element ‘originator or discrete parcel of data or information‘ in mind.

Now it was time to consult an expert so I contacted the nestor of Dutch OSINT, Arno Reuser, on this problem. He confirmed that the OSINT literature likely does not cover this definition discussion and pointed me towards the information specialists’ (formerly known as ‘librarians’) literature.

Arno pointed out – and I agree – that the only true source in OSINT is the original document in which the information was recorded and relayed for the first time. In other words, what the ICD 206 refers to as the ‘originator’. It is therefore most likely helpful to use the adjective ‘primary’ in relation to these original sources, especially in contrast to the other types of sources. Any other source which holds information derived from the primary source, is at best a secondary source. These often offer analysis of restatement of the data contained in primary sources, but also think of translations or collection of primary sources.

So are newspapers therefore secondary sources? That may sound odd as it is very common to cite a newspaper as the ‘source’. However, what is a news’paper’ in the current digital age? Published articles aren’t dependent on a specific paper edition anymore and are more often than not available online long before being actually printed on paper (if they still are).

If we look at the Financial Times, we can see that the digital version exists since 1995 (pdf). So for any reference to an article before that moment, the actual document in which the information was recorded and relayed for the first time was the specific printed edition of the whole newspaper. Newspaper articles then were an inseparable part of the whole newspaper that day (or that edition), so you may argue that the newspaper is then the primary source. However, that has completely changed nowadays to an extent where you could argue that the newspaper(website) becomes a secondary source in which you may find a representation of the primary sources (the individual articles) or information derived from primary sources.

The question whether a newspaper is a primary or secondary source isn’t fully answered though and may depend on the circumstances. To quote Arno: “A primary source however is always a document, broadly meaning any object that serves the purpose of providing information or is assigned that purpose of providing information that is contained in it“. And, as he pointed out to me, while information specialists still have further discussions on the definition of a document, in any case they define document much broader that the general dictionary definition of the word. Any object that can contain information could be a source therefore.

Lastly, how shall we qualify LexisNexis in this context? While it is certainly not a primary source (as it is not recording and relaying original information) there are arguments to say that it is a secondary source as it relays information from primary sources. However, it can also be argued that since LexisNexis (just as similar information providers) has no editorial control over the content, should not be included in the category of sources. Even though they perform some forms of curation (indexing, categorising), classifying it as an information provider or aggregator appears to fit better. In fact, LexisNexis classify themselves indeed as a global content aggregator.

Conclusion

While I’m not completely confident that I have explored the use of the word ‘source’ in OSINT comprehensively, in the interim – with gratitude to Arno Reuser – a working approach appears to speak of ‘primary sources’, ‘secondary sources’, and ‘aggregators’ (which by some are defined as ‘tertiary sources‘).

The primary source in OSINT is always a document (in the broad sense) in which the information was recorded and relayed for the first time. Any sources that hold collections of primary sources or representation (such as translations) are therefore secondary sources.

Meanwhile given the vastness of the OSINT landscape, I’m sure there will be situations where we’ll struggle again with the definition. I will come back to this discussion when I dive further into an evaluation methodology for open sources in a future blog. Meanwhile, comments and suggestions are always welcome.

(updated 10 November 2021)