Challenges in Social Network Analysis of social media data

source: pexels.com

Recently I watched an excellent contribution on Social Network Analysis (SNA) given during the German Open Source Conference (GOSINTCon). In the last presentation of the day Rebecca Zinke explained what SNA is and discussed the possibilities of SNA for investigations using social media data. (NB. The recording of the conference is fully available online, all contributions are in German, however if you even slightly understand the language, you will be able to follow most.)

I was happy to see SNA being discussed in more detail because too often SNA is just referred to as a ‘buzz word’ or some sort of magical tool, without much explanation on what it is actually about. Even worse, it is also sometimes conflated with ‘link-analysis’ which is completely different field altogether.

However, in spite of the concise and understandable manner in which Zinke explained SNA, there is a number of challenges with the use of SNA on social media data for investigations that deserve more attention. In this blog post I will focus on the three key challenges which – I believe – will be encountered in each investigative SNA project and which are most relevant in relation to social media data. These challenges relate to:

  • type and weight of the relations
  • node attributes
  • data limitations

For the sake of brevity I will assume that the readers have a basic understanding of what SNA is, including the most used terminology. I will use nodes/entities/actors and edges/links/ties interchangeably and I will look only at one-mode networks with individuals as nodes and will not go into the mathematical details. If you would like to first read up on the basics of SNA, good introductions into the topic can be found for example in this First Draft article, as well as here, in a UK Home Office guide (pdf) or in Van der Hulst (2009).

Type and weight of relations

The most used elements in social network analyses are the so-called centrality measures. These are the indices that allow us to understand which nodes occupy important positions in the network. The three most used centrality measures are degree centrality, closeness centrality and betweenness centrality which are calculated based on the number of connections and position of the node in the network. (NB. Many more types of centrality measures and other network analysis methods exist, however discussing these goes far beyond the scope of this blog post.)

The standard calculation methods for the centrality measures proceed from a similar weight (or strength) for each link. However, as a link in a social network represents a social relationship, the actual weight is not automatically the same for each link. For example, a married couple, or siblings who happen also to be connected on Facebook obviously have much stronger mutual relationships compared to relationships that people have with Facebook friends that they haven’t actually seen for years (if at all). Being linked or not on social media is a very shallow concept and does not provide enough content to interpret the nature of mutual relationships.

Nonetheless, in SNA based on social media data generally all links are treated as having equal weight. As a result, the importance of some nodes in the network could be easily under- or overestimated. More advanced analyses therefore use weighted links which however requires good predefined understanding of how weight of links should be determined.

Uncertainty about the weight of a link also is the case in networks with directed edges. Zinke showed the difference between directed and undirected edges and the effect on the analysis. For example, analysing a network with directed edges, such as Twitter, can beautifully reveal how information spreads through a network. A very interesting article on this topic is for example Benjamin Strick’s Uncovering a pro-Chinese Government Information Operation. In this article he shows how he analysed a network of Twitterbots spreading and amplifying propaganda.

However, the same type of data may not be equally useful to perform any centrality analysis for an investigation into actual individuals. This is in part because Strick uses social actions such as retweets and likes as edges representing an interaction. For the type of analysis he conducts, that is indeed useful. For other types of analysis, treating social actions such as ‘likes’ as an actual relationship can be problematic.

After all, I may ‘like’ a social media post just because I stumbled upon it and actually liked the content, or because it was posted by a good friend and I want to show my support even if I’m indifferent about the content. Therefore understanding what the social meaning is of the interaction which is represented by the ‘link’ is very important before any meaningful analysis can take place.

Node attributes

Another challenge in SNA is to understand what we (don’t) know about the attributes of the nodes or actors. In the beginning of the presentation Zinke shows how the age of one of the actors in the network could be an explanatory factor for the central position he appeared to occupy. I believe that this point needs more emphasis especially because different research articles have shown the importance of the attributes of the nodes.

In general, social network analysis focuses on the ‘social capital’ of actors, assuming that those with better access to others in the network are more important or influential. However, if our actors are human beings, they may have different skills (in network analysis terms: attributes), and actors with specific skills may not be easy to substitute (Sparrow 1991). Hence they could be more important to the network than is revealed by centrality measures only.

In their research on an organised cannabis cultivation network Duijn and Klerks (2014) applied, in addition to SNA, also a more qualitative perspective, including crime script analysis. They concluded that because cannabis cultivation is a complex and delicate criminal business involving many roles and tasks, centrality measures alone are not sufficient to obtain a proper understanding of who are the most important actors in the network.

The implication of the above for social media data investigation is that SNA at best can give a limited picture of roles and importance of the actors in the network and that a more qualitative analysis of actor attributes should not be forgotten.

Data limitations

The third challenge, and as I believe one of the biggest challenges in SNA (and perhaps in every type of analysis), is data quality. In relation to SNA on data from social media specifically the extent to which the data is complete, is a significant challenge.

If we look at the completeness of data for investigations or intelligence analysis, there is an important difference between what is called ‘Missing At Random’ and ‘Missing Not At Random’ (MNAR). If data is ‘missing at random’, the probability of data points missing is equal for all potentially available data points. If that is the case, it is less likely that the missing data is of significant influence on the outcome of the analysis if it were to be included.

As an example we could look again at the Twitter investigations by Benjamin Strick discussed above. He collected tweets in a certain timeframe based on two specific hashtags. That data collection method could have missed some (re)tweets or likes as a result of the sometimes erratic behaviour of the Twitter platform or random glitches in the Twint collection process. However, given the randomness of the data missing and the subject of his research – i.e., Twitter bot networks that purposefully amplify propaganda – missing some of the (re)tweets or likes will not significantly change the outcome of the analysis. That will, however, be different when data is missing not at random.

If we investigate individuals who shield (some of) their actions and or (mutual) connections, they may not use certain social media platforms or use it in the most private mode to purposefully retain some privacy. Any data missing is then missing based on the specific nature of the data caused by the purposeful (lack of) action of an actor under observation. As a result the probability for data points to be missing from the observed set is not equal for all potentially available data points. Hence, in those cases any missing data is ‘missing not at random’ and therefore can significantly alter the outcome of any SNA applied to the data.

To illustrate what the consequences of ‘missing not at random’ data could be, we will look at an exercise I recently used for a foundational SNA training for a Dutch government entity. In this example we look at a group of ten teenagers who want to throw a private party to escape the Covid lockdown. They know each-other either as friend, family member or neighbour.

In the exercise the participants would record the links between the teenagers based on a text, draw the relations in a sociogram as depicted below and to understand the structure of the group as well as to identify intervention options, calculate different centrality measures. In the sociogram below the size of the figures represents the betweenness degree of the teenagers relative to the others. Heleen appears to occupy a gatekeeping position here and has the largest betweenness degree compared to the others.

Betweenness centrality original

Then, in preparation of the party some of the group members meet and it turns out that actually Brigitte and Jeanine were very close childhood friends but unfortunately had lost track of each other because Jeanine moved to another city with her parents. They now meet again for the first time in 10 years and renew their friendship immediately.

If we add that previously unknown relationship to the sociogram and recalculate the betweenness centrality, we get somewhat of a different overview in which actually Brigitte has the highest betweenness centrality instead of Heleen (NB: also the degree centrality and closeness centrality change).

Betweenness centrality recalculated after link addition

This very simple case shows that one added link can have significant consequences for the centrality measures (particularly in smaller networks I should add). And in fact, in real life we need to take into account that there are many relations we do not see in open source (social media) data. People are increasingly privacy conscious on social media and lock down their accounts so we cannot see (all of) their connections. As a result, the amount of data ‘not missing at random’ is likely to increase.

Conclusion

To conclude, social network analysis can be an important tool for intelligence and investigations and it is good to see some of the concepts and methodology being discussed in conferences such as GOSINTCon.

Nonetheless, as the previous paragraphs have shown, significant challenges with SNA exist which, especially when using social media data, cannot be ignored. Before results of a social network analysis can be used for anything conclusive, a good understanding of these challenges is needed. And then still the outcome of any SNA should always be understood as complementary to other analysis and never as the sole silver bullet to solve a case.

Throughout the text I have already referenced some academic papers on the use of social network analysis in (law enforcement) investigations and associated challenges. These papers are listed below together with a number of other papers that all contain further insight in (other) challenges in relation to the application of SNA for investigations. Of course these papers also show the different approaches used by researchers in applying SNA to investigations, which in turn could be an inspiration for your own work. Feel free to contact me if you have any questions.

Literature

Berlusconi, G. (2013) ‘Do all the pieces matter? Assessing the reliability of law enforcement data sources for the network analysis of wire taps’, in Global Crime, Vol. 14, No. 1, 61–81.

Bichler, G., A. Malm and T. Cooper (2017) Drug supply networks: a systematic review of the organizational structure of illicit drug trade’, in Crime Science, Vol 6:2.

Cavallaro, L., and Ficara, A., and De Meo, P., and Fiumara, G., and Catanese, S., and Bagdasar, O., and Liotta, A. (2020). ‘Disrupting Resilient Criminal Networks through Data Analysis: The case of Sicilian Mafia’, in PLoS ONE 15(8): e0236476. http://arxiv.org/abs/2003.05303v1

Diviák, T. (2019). Key aspects of covert networks data collection: Problems, challenges, and opportunities. Social Networks. https://doi.org/10.1016/j.socnet.2019.10.002

Duin, P. and P. Klerks (2014) ‘Social Network Analysis Applied to Criminal Networks: Recent Developments in Dutch Law Enforcement’ in A.J. Masys Networks and Network Analysis for Defence and Security (pp. 121-159).

Van der Hulst, R. (2009) Introduction to Social Network Analysis (SNA) as an investigative tool’, in Trends in Organised Crime 12:101–121.

Sparrow, M. (1991) ‘The application of network analysis to criminal intelligence: An assessment of the prospects’, in Social Networks 13: 251-274.