R Tutorials
R Tutorial – Wikipedia as a Graph
By Rinke Hoekstra and Willem Robert van Hage.
In this tutorial, we are going to crawl pages in Simple Wikipedia and explore this part of the Web link graph, and apply a number of analytical metrics.
Simple Wikipedia is a variant of Wikipedia that uses “Basic English”:
“Basic English (British American Scientific International Commercial) is a constructed (made-up) language to explain complex thoughts with 850 basic English words chosen by Charles Kay Ogden.”
We will use the open source R language and RStudio program for statistical analysis in combination with Gephi, an open source graph analysis tool. These programs will be used throughout the course, so make sure to familiarize yourself with the user interface.
Getting Started
- Start RStudio. It makes sense to start a new “project” in a directory of your choosing.
- This tutorial uses the RCurl, XML, igraph and bitops libraries. You can install these on Linux of OS X like this:
install.packages(c('RCurl','XML','igraph','bitops'),dependencies=TRUE)
. On Windows you will have to manually download the packages from this directory and install them like this:install.packages(choose.files(), repos=NULL)
and then select the downloaded packages. - Download
crawler.R
, save it in your project directory, and open it in RStudio. - Click on “Source” in the top right corner of the pane in which the file was loaded.
- The “console” should now show something similar to the following:
> source('~/WebInformation/crawler.R')
Loading required package: bitops
Warning message:
package 'graph' was built under R version 2.15.1
Your first crawl
The crawl
function takes as arguments a URL
to start from, a step_limit
(to reduce the depth of the search) and an optional selection
function to filter hyperlinks in pages. It returns a graph object. It is called as follows, e.g.:
g <- crawl("http://simple.wikipedia.org/wiki/Kevin_Bacon", 1)
This will make the crawler start at http://simple.wikipedia.org/wiki/Kevin_Bacon, lookup any hyperlinks in that page, and follow them until it reaches the step limit of 1 (i.e. it will not look at ):
> g <- crawl("http://simple.wikipedia.org/wiki/Kevin_Bacon", 1)
[1] "Getting http://simple.wikipedia.org/wiki/Kevin_Bacon"
[1] "following http://simple.wikipedia.org/wiki/Kevin_Bacon"
The output is a graph that has edges between the Kevin Bacon page and all pages it links to. (Note that g
is a variable, the letter “g” is arbitrarily chosen and may be more informative, such as “kevin_bacon_graph_1”. Choosing more informative names allows you to have access to more than one graph at a time)
You can type plot(g)
get a rough idea of what that graph looks like.
Your first graph analysis
Run the
crawl
function again on the Kevin Bacon page, but now use a slightly larger step limit of two and store the graph in a variable calledkevin_bacon_graph_2
. Again, have a look at the graph by plotting it. Very informative?Save the graph using the
save.graph
function:
> save.graph(kevin_bacon_graph_2,"kevin_bacon_graph_2")
[1] "Saving to: kevin_bacon_graph_2.csv"
[1] "Done"
- Note that the first argument of the
save.graph
function is the variable name of our graph, and the second argument is the name of the file to which the graph is written. (NB: the function appends the extension “.csv” for “comma separated value”) - Now start Gephi (you need at least version 0.8.1), in the Welcome screen, select “New Project”
- Go to the “Data Laboratory”, and click “Import Spreadsheet”
- Click on the “…”, and open the CSV file that was just created
- In the “Separator” drop down, click “Comma”
- In the “As table” drop down, select “Edges table”
- Click “Next”
- Make sure “Create missing nodes” is checked, and click “Finish”
- Go back to the “Overview”
Layout & Labels
- In the “Layout” pane, choose one of the visualization options (e.g. “Force Atlas 2” with “Prevent Overlap” checked), and click “Run”. Click “Stop” at some point where you think your graph looks nice.
- To see what the nodes in the graph are, do the following:
- Select the leftmost “T” at the bottom of the Graph pane.
- Click on the leftmost “A”, and select “Node Size”
- Click on the rightmost icon, and select “Id”
- You can use the rightmost slider to increase or decrease the font size.
Ranking & Node Size
- In the “Ranking” pane, select “Degree” as rank parameter
- Select the diamond-shaped icon, and click “Apply”
- If needed, you can run the layout again, or decrease the label size.
Tip: you can also select the color disc, to give color the nodes in your graph based on a particular rank parameter.
Analyzing & Statistics
- The “Statistics” pane lists a number of analyses that you can run on your graph.
- Run the “Avg. Path Length” to compute the centrality of nodes.
- Color your nodes according to the betweenness centrality.
Take some time to play around!
Question 1
- Compare the nodes in the graph to the pages in Simple Wikipedia, did the crawler do a good job?
- What do you think are the most important pages in the graph of Kevin Bacon?
- What are the most important clusters in the graph?
- Compare the size of the Kevin Bacon graphs generated with a step limit of 1 and a step limit of 2. What do you expect to happen if we increase the step limit to 3, or higher?
- What would be a good way to limit the number of nodes to our graph?
Six Degrees of Kevin Bacon
Now that you’re familiar with the basics of crawling and graph analysis, we will go and test the hypothesis that Kevin Bacon is one of the most central people in movie land.
The crawler.R
script contains two filter functions that can be used as the selection
to the crawl
function:
living_people_filter
ensures that the crawler only adds nodes to the graph for pages that fall in the category http://simple.wikipedia.org/wiki/Category:Living_people.movie_filter
ensures that the crawler only adds nodes to the graph for pages that fall in a category name that contains the word “movie”.living_people_or_movie_filter
combines the two filters.
> g <- crawl("http://simple.wikipedia.org/wiki/Kevin_Bacon",2,living_people_filter)
Question 2
- Compare the output of applying the three filters, either by loading them (separately) into Gephi, or by running the
plot
function from within RStudio. - What is the best filter to use if we want to test our hypothesis? Will the results still be manageable? Discuss.
- What should the step size be if you want to test our hypothesis? Will the results still be manageable? Discuss.
- Run the crawler using the filter and step size you think are best, and analyze the resulting graph in Gephi using some of the metrics we discussed in class. Try to describe some of the properties of the graph.
Question 3
- According to your analysis, what is the most important node in the network around Kevin Bacon. Do you think Kevin Bacon is the right choice if you want to prove that movie land is a small world? Are there alternatives?
- Run your crawler again on the URL that you think is most important, and compare the results with the Kevin Bacon graph.
- What happens if you merge the graph (to do this, simply import both graphs into the edges table of a new Gephi project). Discuss the results, do the graphs overlap?
Finally
- With the merged graph loaded, go to the “Preview” pane in Gephi
- Select “Show Labels”, and click on “Refresh”
- Export the graph as PDF (bottom right).
VU University Amsterdam R Tutorials by Willem Robert van Hage is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
This tutorial has been developed within the COMBINE project supported by the ONR Global NICOP grant N62909-11-1-7060.