---
title: 'SOCI424: Centrality'
author: '(anonymous for peer evaluation)'
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# load the 'igraph' package
library(igraph)
```
## Part 1: Load and inspect the data
### Film actor networks
In ths lab, you will use data sourced from [IMDB](https://www.imdb.com) to examine the co-starring network of director [Bong Joon Ho](https://www.imdb.com/name/nm0094435/) (*Parasite* (2019), *Snowpiercer* (2013), *The Host* (2006)). The network was constructed by enumerating the cast list of all seven of Bong's feature-length films, then measuring how many movies *not* directed by Bong each pair of actors co-starred in. Bong is a Korean director who has moved into English-language (and American-audience) film making in recent years, which yields interesting structures in the professional network of his actors.
> ***Aside:*** This dataset illustrates one of the many ways that data science can implicitly prioritize Western norms. Since many databases record family name and given name as separate variables, public-facing representations of that data must choose the order to present those names in. Korean names are traditionally written with the family name first (e.g. "Song Kang-ho"), while Western European cultures traditionally put the given name first (e.g. "Chris Evans"). However, IMDB (the source of this data) lists Korean actors' names in the Western order (e.g. "Kang-ho Song"), even for actors who only appear in Korean-language films. Moreover, it lists those names in Roman script: "Kang-ho Song" rather than "송강호".
### Task 1A: Loading the data
The data is available in a now-familiar format: two CSV files. The actor information is located at , and the co-appearance data is at .
- *Use R's `read.csv()` function to download the relational data directly from the URLs above. You can name the vertex data frame `actors` and the edge data frame `edges`. Inspect these data frames visually. What vertex attributes are included? What edge attributes?*
```{r 1a_1}
# (your code here)
```
- *Use `graph_from_data_frame()` to make a network object named `costar` from the two data frames (you will want to specify `directed=FALSE`, since there is no meaning to the edge directions here). How many vertices (actors) are there in the network? How many edges (co-appearance relations)?*
```{r 1a_2}
# (your code here)
```
### Task 1B: Inspect the network
- *Look at the `name` vertex* attribute. At first glance, does it seem like there are more Korean or non-Korean names listed? Then look at the `weight` *edge* attribute, which represents the number of films different pairs of actors co-starred in. Is there much variation? What is the largest value in this attribute?
```{r 1b_1}
# (your code here)
```
- *Create a simple visualization of the network to see if there is any obvious structure to it. Adjust the vertex attribute `size` to change the size of the vertices, and the vertex attribute `label.cex` to adjust the font size of the labels (values of `label.cex` less than one will make the labels smaller, and values greater than one will make the labels larger. You can even suppress vertex labels altogether by setting the `label` attribute to `NA`). What patterns do the clustering in the network suggest? Can you think of a possible cause for this kind of structure?*
```{r 1b_2}
# (your code here)
```
## Part 2: Measuring centrality
### Task 2A: Degree centrality
The most straightforward measure of centrality is *degree centrality*. Degree can be measured simply with the `degree()` function that is built in to igraph.
- *First, look at the documentation for `degree()`, either by typing `?degree` into the console or searching for "degree" in the help panel of RStudio. What are the possible values for the `mode` argument? Which `mode` makes the most sense for our co-star network?*
- (your response here)
- *Use the `degree()` function to calculate the degree centrality for each vertex in `costar`. Which actor has the highest value of degree centrality, and what is that value? In plain language (that is, without using networks terminology), what does that value mean?*
```{r 2a_2}
# (your code here)
```
- *The degree you just calculated ignores edge weights. There is a different function in igraph, `strength()`, that measures the _weighted degree_ of the vertices. Use this function to get the weighted degree ("strength") of the vertices in `costar`. Which actor has the highest value of _weighted_ degree centrality, and what is that value? In plain language (that is, without using networks terminology), what does that value mean?*
```{r 2a_3}
# (your code here)
```
- *Re-plot the network, but this time color all the vertices with a degree of at least 30 in red so they will stand out. What does a visual inspection of the new plot tell you about where the high-degree actors are in the network?*
```{r 2a_4}
# (your code here)
```
- ***Bonus question for more experienced R users**: This network contains a lot of isolates* (vertices that have no connections to any other vertices). Since isolates don't tell us much about a network's structure, it is very common practice to simply remove them from visualizations and analysis. Use the igraph function `induced_subgraph()` to make a new network that is a sub-network of `costar`, containing only vertices with a degree of at least 1. You should feel free to use this reduced network for the rest of the worksheet.
```{r 2a_5}
# (your code here)
```
### Task 2B: Eigenvector centrality
Degree is a 'local' measure of an actor's centrality---it pays attention to an actor's immediate network neighbors but does not incorporate any information about those neighbors' neighbors (etc.). *Eigenvector centrality* can be thought of as degree centrality that weights ties by the importance of an actor's neighbors.
- *Use the function `eigen_centrality()` to measure the eigenvector centrality of all of the actors in the network. (Note: a call to`eigen_centrality()` yields a list with several different elements that give some extra detail about the calculation. The element we are interested is called `vector`, so to get just that information you can use the command `eigen_centrality(costar)$vector`.*) Which actor has the highest eigenvector centrality? What is the value of that actor's eigenvector centrality? Does this number mean something on its own, or only in relation to the other actors' scores?
```{r 2b_1}
# (your code here)
```
- *The function `gray()` is a handy way to create a spectrum of colors between white and black for visualization. For instance, `gray(0)` will yield the color black, and `gray(.9)` will yield a shade of gray that is close to white. Use this function to color the vertices in your network depending on their eigenvector centrality. (Eigenvector centrality is always between zero and one, so you do not need to rescale the values.*) When you plot the newly colored network, where are the high--eigenvector centrality actors in the network structure? How does this differ from the degree centrality you analyzed in **2A**?
```{r 2b_2}
# (your code here)
```
- ***Bonus question for more experienced R users**: Plot the network with each vertex's size determined by its eigenvector centrality. Since most centralities are very small and none are greater than one, you will want to both transform the weights (using, e.g. the square root or log) and scale the weights (e.g. by multiplying them by a constant value)*
```{r 2b_3}
# (your code here)
```
### Task 2C: Betweenness Centrality
The final centrality measure we will look at is *betweenness centrality*. Betweenness centrality takes a fundamentally different approach than degree or eigenvector centrality, focusing on *paths* through the network rather than the neighbors of each vertex.
- *Skim the documentation for the `betweenness()` function. Look at the description of the `weights` argument. How does the interpretation of the `weight` edge attribute differ between `eigen_centrality()` and `betweenness()`? Does it make sense to use the same edge weights to calculate betweenness as you did for eigenvector centrality?*
- (your response here)
- *Use the `betweenness()` function to calculate the betweenness centrality of every actor in the network. You will need to specify the edge weights directly in your call to `betweenness()`. (There are several common ways to 'invert' a variable so that low values become high and high values become low. A good one to use for edge weights is to divide 1 by the weights so that a value of 10 becomes 1/10, and a value of 2 becomes 1/2.*).
```{r 2C_2}
# (your code here)
```
- *Look at the actors with the five highest betweenness scores (the `sort()` function might be useful here). Who has the highest score? Why do you think that actor might have a high betweenness score (feel free to look these actors up online!).*
```{r 2C_3}
# (your code here)
```
- ***Bonus question for more experienced R users**: Plot the network with gray-scale vertex colors like you did in **2B**, but this time using the actors' betweenness centrality. (Because `gray()` expects values from 0 to 1, but betweenness centrality can take on very large values, you will need to "normalize" the betweenness values first. This just means dividing every actor's betweenness score by the maximum betweenness score.*)
```{r 2C_4}
# (your code here)
```
## Part 3: Discussion
- *Compare the three centralies you measured above. In the case of the co-star network, what real-world characteristics might the measures reflect about the actors?*
- (your response here)
- *This network incorporates a long time frame, flattening it down into a single 'snapshot'. How could that feature affect the interpretation of centrality for the actors?*
- (your response here)
- *The actors in this network were sampled with a very specific focus. How sensitive are the centrality measures you calculated above to the "boundaries" of this network? What if we looked at a network that included multiple directors' work? Or only one of Bong Joon Ho's movies? Or all actors in IMDB?*
- (your response here)