- Week 14: Network data
Graph theory is the study of graphs, which are mathematical structures used to model
pairwise relationsbetween objects. A graph in this context is made up ofnodes, which are connected byedges, arcs, or lines. A graph may beundirected, meaning that there is no distinction between the two vertices associated with each edge. (from wiki)
There are different kind of graphs.

The criteria to judge whether one graph is the same as others is not the appearance, but the relationship between different nodes.
- Nodes: the fundamental unit of which graphs are formed,also called endpoint. They are connected by the edge like
AandBin above picture. - Edge:the line that connect two nodes. Each edge has two nodes to which it is attached. Edges may be directed or undirected
Case1: Try to count the edge between those circles.
This undirected table is symmetric. It shows that 1 and 2 has one edge. 2 and 3 is the same, but there is no edge between 1 and 3.
While in the directed relationships, like the above picture, 3->2 has one edge, 2->3 has zero edge, this is caused by the edge directions.
There are different ways to show the relationships.
NetworkX is a Python package for study of the structure, dynamics, and functions of complex networks. With which we can analyze the network structure, the relationship between different nodes and generate different kind of graphs.
Import and install:
!pip install networkx
import networkx as nxBasic usage: From the above explanation, we can know that, the network is formed by nodes and edges. Therefore, when using networkx, we need get the nodes and edges first, then we can draw the graph.
#draw an empty graph
g=nx.Graph()
g
#<networkx.classes.graph.Graph at 0x10eb6a1d0>#add nodes
#help(g.add_node) to check out the parameters and syntax
g.add_node('A')
g.add_node('B')
g.add_node('C')It adds the nodes. Theng.nodes to check.
g.nodes
#NodeView(('A', 'B', 'C'))Add edges, draw the graph.
#help(g.add_edge) to check out the parameters and syntax
g.add_edge('A','B')Graph show.
nx.draw(g)#add another node between c and b
g.add_edge('C','B')
nx.draw(g)After that, we can get one simple graph.
In the following notes, we will use characters in book Les Misérables to demo the analysis process. You can download the dataset here
Add all nodes and edges.
import json
data = json.loads(open('miserables.json').read())
data
#data.keys()
#data['nodes'] checkout nodes
#data['links'] checkout links
import networkx as nx
g = nx.Graph()
#add nodes
for n in data['nodes']:
g.add_node(n['id'], group=n['group'])
for l in data['links']:
g.add_edge(l['source'], l['target'])
nx.draw(g)From this graph, we can know that there are some groups, but we don't know who are them. Next step for us is improving the graphs. To solve the following questions:
- Are there some groups in the network?
- Who are in the same group?
Add labels on the graph.
help(nx.draw) #to learn about the function and parameters. What may be useful for us are parameters and see also functions. Those are helpful for optimizing our graphs. Parameters
----------
G : graph
A networkx graph
pos : dictionary, optional
A dictionary with nodes as keys and positions as values.
If not specified a spring layout positioning will be computed.
See :py:mod:`networkx.drawing.layout` for functions that
compute node positions.
ax : Matplotlib Axes object, optional
Draw the graph in specified Matplotlib axes.
kwds : optional keywords
See networkx.draw_networkx() for a description of optional keywords.
Examples
--------
>>> G = nx.dodecahedral_graph()
>>> nx.draw(G)
>>> nx.draw(G, pos=nx.spring_layout(G)) # use spring layout
See Also
--------
draw_networkx()
draw_networkx_nodes()
draw_networkx_edges()
draw_networkx_labels()
draw_networkx_edge_labels()
from matplotlib import pyplot as plt
plt.figure(figsize=(15, 15))
pos =nx.spring_layout(g)
# nx.draw(g, pos=nx.spring_layout(g))
nx.draw_networkx_nodes(g, pos, node_color='#ccccff', alpha=0.5) #change nodes style
nx.draw_networkx_edges(g, pos, width=1.0, alpha=0.3) #change edges style
labels = dict([(n, n) for n in g.nodes]) #add labels
_ = nx.draw_networkx_labels(g, pos, labels=labels, font_color='#666666') #draw labelsGroup the nodes according to the group number in the json. Every node has a group number, we can group those nodes with the color.
import matplotlib
color = matplotlib.cm.Accent
#import color map, there are many color maps, you can checkout the color maps by the following:
# import matplotlib.cm as cm
# dir(cm)
plt.figure(figsize=(15, 15))
pos =nx.spring_layout(g)
nx.draw_networkx_nodes(g, pos, node_color='#ccccff', alpha=0.5)
nx.draw_networkx_edges(g, pos, width=1.0, alpha=0.3)
labels = dict([(n, n) for n in g.nodes])
_ = nx.draw_networkx_labels(g, pos, labels=labels, font_color='#666666')
for group in range(1, 20):
nodelist = [n for n in g.nodes if g.nodes[n]['group'] == group]
# If g.nodes's group = 1, add those nodes into the nodelist. They will be the same color 1 . If g.nodes's group = 2, they will be added to another nodelist ,and be colored 2.
#print(nodelist)
nx.draw_networkx_nodes(g, pos, nodelist=nodelist, node_color=color(group), alpha=0.8)In graph theory, the degree is the number of edges incident to the nodes. The degree usually represent the importance of nodes. Degree can be divided into In-Degree and Out-Degree. In-Degree is how many other nodes point to one node, while Out-Degree is how many other nodes one node points to. Degree is the sum of In-Degree and Out-Degree.
As we mentioned above, Degree can be divided into In-Degree and Out-Degree. The importance of nodes with the same total degree but different In-Degree and Out-Degree are different. That's the reason why We need Centrality Measures.
Centrality is a classical concept in graph analysis. It measures the "importance" of nodes. The notions of "importance" are different. We only provide some samples in following sections.
There are different centrality measures, like Degree, Eigenvector Closeness and Betweenness. The following is the interpretation of measures:
| Centrality measure | Interpretation in social networks |
|---|---|
| Degree | How many nodes can one node reach directly? |
| Betweenness | How likely is this node to be the most direct route between two nodes in the network? |
| Closeness | How fast can this node reach every other nodes in the network? |
| Eigenvector | How well is this node connected to other well- connected nodes? |
You can refer to the documentation and online resources to understand those centrality measures. Try other centrality measures that are not covered in this tutorial. See what interesting findings you can get.
#check out the methods of centrality, generally speaking, the greater centrality of a node, and the more important the node is in the network.
nx.degree_centrality(g)
#nx.closeness_centrality(g)
#nx.betweenness_centrality(g)
#nx.eigenvector_centrality(g)
import pandas as pd
df = pd.DataFrame()
df['degree'] = pd.Series(nx.degree_centrality(g))
df['closeness'] = pd.Series(nx.closeness_centrality(g))
df['betweenness'] = pd.Series(nx.betweenness_centrality(g))
df['eigenvector'] = pd.Series(nx.eigenvector_centrality(g))#draw degree centrality
df_top_nodes = df.sort_values('degree', ascending=False)[:5]
plt.figure(figsize=(30, 15))
# we don't run the spring layout again; to keep the positions in this section
#pos =nx.spring_layout(g)
nx.draw_networkx_nodes(g, pos, node_color='#ccccff', alpha=0.5)
nx.draw_networkx_edges(g, pos, width=1.0, alpha=0.3)
labels = dict([(n, n) for n in g.nodes])
_ = nx.draw_networkx_labels(g, pos, labels=labels, font_color='#666666')
nx.draw_networkx_nodes(g, pos, nodelist=list(df_top_nodes.index), node_color='#ff7700', alpha=0.5)
df_top_nodes
#change degree to other three columns to see the different top nodes.From centrality analysis, we can figure out the key figures and nodes in the network, and get the next step analysis leads.
Relative reading:
In the above session, we mentioned that the degree of a node is the number of edges it has to other nodes. The degree distribution is the probability distribution of these degrees over the whole network.
g.degree
pd.Series(dict(g.degree())).hist(bins=20)
dict(g.degree()) and then Series. Then Draw a picture.
From the histogram, we can see that the minority nodes have large number of edges while majority have less edges. This is a Heave tail distribution, which is famous for rich will be richer and poor will be poorer.
A clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. From wiki
There are different kind of clustering coefficient, including global clustering coefficient, local clustering coefficient, average clustering coefficient. The simplest is global clustering coefficient, which is the number of closed triplets (or 3 x triangles) over the total number of triplets (both open and closed). The larger the clustering coefficient is, the closer one node is wth other nodes. For usage in networkx, you can refer here.
nx.algorithms.clustering(g,['XXX','XXX','XXX']) #pass a set of nodes in the list
nx.average_clustering(g)gorithms.clustering(g, ['Myriel', 'Champtercier', 'Count', 'Cravatte', 'Napoleon', 'Geborand', 'CountessdeLo', 'OldMan'])
# {'Champtercier': 0,
# 'Count': 0,
# 'CountessdeLo': 0,
# 'Cravatte': 0,
# 'Geborand': 0,
# 'Myriel': 0.06666666666666667,
# 'Napoleon': 0,
# 'OldMan': 0}
nx.average_clustering(g)
0.5731367499320134
nx.average_clustering(nx.complete_graph(5))
1.0Simply speaking, a clique is a subset of nodes in an undirected graph. Which is a segmented group in a bigger community. Highlight those cliques may help us know the core figures and groups in one network.
cliques = list(nx.find_cliques(g))
#len(cliques)
cliques[0:2]
plt.figure(figsize=(15, 15))
pos =nx.spring_layout(g)
nx.draw_networkx_nodes(g, pos, node_color='#ccccff', alpha=0.5)
nx.draw_networkx_edges(g, pos, width=1.0, alpha=0.3)
labels = dict([(n, n) for n in g.nodes])
_ = nx.draw_networkx_labels(g, pos, labels=labels, font_color='#666666')
nx.draw_networkx_nodes(g, pos, nodelist=cliques[12], node_color='#ff7700', alpha=0.5)
#draw any clique by changing nodelist=cliques[12]To find those who are not connected by any others.
components =list(nx.connected_components(g))
len(components)In the network analysis, community structure refers to the occurrence of groups of nodes in a network that are more densely connected internally than with the rest of the network From wiki.
Community detection can help us categorize every single node into different groups based on different characteristics so that we can study them as whole. Other importance of community detection is to help us find the missing links and identify the false links in one network.
from networkx.algorithms import community
communities = list(community.label_propagation_communities(g))
#communities[0]
plt.figure(figsize=(15, 15))
pos =nx.spring_layout(g)
#nx.draw_networkx_nodes(g, pos, node_color='#ccccff', alpha=0.5)
nx.draw_networkx_edges(g, pos, width=1.0, alpha=0.3)
for i in range(0, len(communities)):
nodelist = communities[i]
print(nodelist)
nx.draw_networkx_nodes(g, pos, nodelist=nodelist, node_color=color(i), alpha=0.8)
labels = dict([(n, '%s:%s' % (n, g.nodes[n]['group'])) for n in nodelist])
nx.draw_networkx_labels(g, pos, labels=labels, font_color='#666666')Draw the shortest path between two nodes.
sp = nx.shortest_path(g, 'Gribier', 'Child2')
#you can change to any other two nodes
sp
plt.figure(figsize=(15, 15))
#pos =nx.spring_layout(g)
nx.draw_networkx_nodes(g, pos, node_color='#ccccff', alpha=0.5)
nx.draw_networkx_edges(g, pos, width=1.0, alpha=0.3)
labels = dict([(n, n) for n in g.nodes])
_ = nx.draw_networkx_labels(g, pos, labels=labels, font_color='#666666')
nx.draw_networkx_edges(g,
pos,
edgelist=list(zip(sp[:-1], sp[1:])),
width=5,
edge_color='r'
)
#help(nx.draw_networkx_edges)
#edgelist:collection of edge tuples
#list(zip(sp[:-1], sp[1:])) check out the edgelist
# [('Gribier', 'Fauchelevent'),
# ('Fauchelevent', 'Javert'),
# ('Javert', 'Gavroche'),
# ('Gavroche', 'Child2')]References:
- Use Kumu To Make A Relationship Graph For Les Misérables https://dnnsociety.org/2018/01/17/use-kumu-to-make-a-relationship-graph-for-les-miserables/
- 關係圖表速成方案:Google Fusion Tables & Kumu , https://dnnsociety.org/2017/08/28/%E9%97%9C%E4%BF%82%E5%9C%96%E8%A1%A8%E9%80%9F%E6%88%90%E6%96%B9%E6%A1%88%EF%BC%9Agoogle-fusion-tables-kumu/
- 關係圖表速成方案:Google Fusion Tables & Kumu , https://dnnsociety.org/2017/08/28/%E9%97%9C%E4%BF%82%E5%9C%96%E8%A1%A8%E9%80%9F%E6%88%90%E6%96%B9%E6%A1%88%EF%BC%9Agoogle-fusion-tables-kumu/
- Who control the discourse power in 红楼梦 by Group 8 2018S
- Li's family business map and spring layout analysis by Group 7 2018S
If you have any questions, or seek for help troubleshooting, please create an issue here















