VU University Amsterdam
R Tutorials

R Tutorial – Wikipedia as a Graph

By Rinke Hoekstra and Willem Robert van Hage.

In this tutorial, we are going to crawl pages in Simple Wikipedia and explore this part of the Web link graph, and apply a number of analytical metrics.

Simple Wikipedia is a variant of Wikipedia that uses “Basic English”:

“Basic English (British American Scientific International Commercial) is a constructed (made-up) language to explain complex thoughts with 850 basic English words chosen by Charles Kay Ogden.”

We will use the open source R language and RStudio program for statistical analysis in combination with Gephi, an open source graph analysis tool. These programs will be used throughout the course, so make sure to familiarize yourself with the user interface.

Getting Started

    > source('~/WebInformation/crawler.R')  
    Loading required package: bitops  
    Warning message: 
    package 'graph' was built under R version 2.15.1  

Your first crawl

The crawl function takes as arguments a URL to start from, a step_limit (to reduce the depth of the search) and an optional selection function to filter hyperlinks in pages. It returns a graph object. It is called as follows, e.g.:

g <- crawl("http://simple.wikipedia.org/wiki/Kevin_Bacon", 1)

This will make the crawler start at http://simple.wikipedia.org/wiki/Kevin_Bacon, lookup any hyperlinks in that page, and follow them until it reaches the step limit of 1 (i.e. it will not look at ):

    > g <- crawl("http://simple.wikipedia.org/wiki/Kevin_Bacon", 1)  
    [1] "Getting http://simple.wikipedia.org/wiki/Kevin_Bacon"  
    [1] "following http://simple.wikipedia.org/wiki/Kevin_Bacon"

The output is a graph that has edges between the Kevin Bacon page and all pages it links to. (Note that g is a variable, the letter “g” is arbitrarily chosen and may be more informative, such as “kevin_bacon_graph_1”. Choosing more informative names allows you to have access to more than one graph at a time)

You can type plot(g) get a rough idea of what that graph looks like.

Your first graph analysis

    > save.graph(kevin_bacon_graph_2,"kevin_bacon_graph_2")  
    [1] "Saving to: kevin_bacon_graph_2.csv"  
    [1] "Done"  
Layout & Labels
Ranking & Node Size

Tip: you can also select the color disc, to give color the nodes in your graph based on a particular rank parameter.

Analyzing & Statistics
Take some time to play around!

Question 1

  1. Compare the nodes in the graph to the pages in Simple Wikipedia, did the crawler do a good job?
  2. What do you think are the most important pages in the graph of Kevin Bacon?
  3. What are the most important clusters in the graph?
  4. Compare the size of the Kevin Bacon graphs generated with a step limit of 1 and a step limit of 2. What do you expect to happen if we increase the step limit to 3, or higher?
  5. What would be a good way to limit the number of nodes to our graph?

Six Degrees of Kevin Bacon

Now that you’re familiar with the basics of crawling and graph analysis, we will go and test the hypothesis that Kevin Bacon is one of the most central people in movie land.

The crawler.R script contains two filter functions that can be used as the selection to the crawl function:

    > g <- crawl("http://simple.wikipedia.org/wiki/Kevin_Bacon",2,living_people_filter)

Question 2

  1. Compare the output of applying the three filters, either by loading them (separately) into Gephi, or by running the plot function from within RStudio.
  2. What is the best filter to use if we want to test our hypothesis? Will the results still be manageable? Discuss.
  3. What should the step size be if you want to test our hypothesis? Will the results still be manageable? Discuss.
  4. Run the crawler using the filter and step size you think are best, and analyze the resulting graph in Gephi using some of the metrics we discussed in class. Try to describe some of the properties of the graph.

Question 3

  1. According to your analysis, what is the most important node in the network around Kevin Bacon. Do you think Kevin Bacon is the right choice if you want to prove that movie land is a small world? Are there alternatives?
  2. Run your crawler again on the URL that you think is most important, and compare the results with the Kevin Bacon graph.
  3. What happens if you merge the graph (to do this, simply import both graphs into the edges table of a new Gephi project). Discuss the results, do the graphs overlap?

Finally

Creative Commons License
VU University Amsterdam R Tutorials by Willem Robert van Hage is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

This tutorial has been developed within the COMBINE project supported by the ONR Global NICOP grant N62909-11-1-7060.