R Tutorial - Alice in Wonderland

By Rinke Hoekstra and Willem Robert van Hage.

In this tutorial, you are to (partially) reproduce a search engine.

We will use the the open source R language and RStudio program for statistical analysis.

Getting Started

Start RStudio. It makes sense to start a new “project” in a directory of your choosing.
This tutorial uses the RCurl, XML, igraph and bitops libraries. You can install these on Linux of OS X like this: install.packages(c('RCurl','XML','igraph','bitops'),dependencies=TRUE). On Windows you will have to manually download the packages from this directory and install them like this: install.packages(choose.files(), repos=NULL) and then select the downloaded packages.
Download alice.R, save it to your project directory, and open it in RStudio.
Click on “Source” in the top right corner of the pane in which the file was loaded.
Note that alice.R relies on the crawler.R file of the Wikipedia as a Graph tutorial, so they both need to be in the same directory (also check that the working directory is set to the directory that contains the files. You can check this using getwd() and set the working directory using setwd("[THE CORRECT PATH]") replace [THE CORRECT PATH] with the path to the directory containing your files)
Download alice.txt, and save it to your project directory as well.

Alice in Web Science Land

The alice.txt file contains the full text of Alice in Wonderland. We will use this book to observe some of the things we have discussed in class (frequency matrix, Heaps, Zipf), and run a search.

Preparation

First, load the file using the read_plain_text function:

    > alice_text <- read_plain_text("alice.txt")

The variable alice_text now contains all of the text, separated line by line. Since we want to do a search across chapters, we will split the text into separate chapters using the split() function:

    > alice_chapters <- split(alice_text)

The next step is to calculate the term frequency matrix for Alice in Wonderland. We do this by running process_alice. This function tokenizes each chapter, and calculates the term frequency matrix per chapter.

    > alice_index <- process_alice(alice_text,alice_chapters)

Have a look at the term frequency matrix. We first give it a more convenient name, and then call the View function:

    > alice_tfm <- alice_index$tfm
    > View(alice_tfm)

Have a look at the top 30 most common words in the entire collection:

    > sort(alice_index$total$dict,decreasing=TRUE)[1:30]

Have a look at the 30 least common words in the entire collection:

    > sort(alice_index$total$dict,decreasing=FALSE)[1:30]

Question 1

Have a closer look at the term frequency matrix. Given this table, how well do you think the tokenizer did its job? Do you see any room for improvement?
You just saw the top 30 most common words in the collection. What is the proper term for this table if it contained all words? (and why the top 30?)
Looking at the top 30, does it make sense to include all of them in the stop words list?
Looking at the top 30, do you think the frequency follows Zipf’s law? You can run z <- zipf(alice_index$total) to find out. What does it mean again?
Similarly, run h <- heaps(alice_index$total) to determine whether Heaps law applies. What does it mean again?

Doing some Search in Wonderland

To see how well we can search using the term frequency matrix, call the search function, using a query of your choice:

    > search("your query here",alice_index$names,alice_tfm)

Using the term frequency matrix, we can easily compute the inverse document frequency (… how?) and TFIDF score.
Have a look at the IDF value of one (or more) terms from your query:

    > compute_idf(alice_tfm)["your term here"]

Compute the TFIDF matrix as follows:

    > alice_tfidfm <- compute_tfidf(alice_tfm)

Run the same query again, but now using the alice_tfidfm matrix:

    > search("your query here",alice_index$names,alice_tfidfm)

Have a look at the TFIDF matrix by running View:

    > View(alice_tfidfm)

Play around with a couple of queries to come up with interesting results.

Question 2

How do you assess the quality of the TF-based search. What can you say about precision and recall?
Computing the TFIDF matrix takes quite some time. Why is that, do you think?
Compare the TF-based search with the TFIDF-based search. What has changed? How do you compare the quality of the searches. What can you say about precision and recall?
Alice in Wonderland only has 11 chapters, what can you say about the precision at k where k=5 for both types of searches.
What would happen to precision and recall if you implemented the improvements to the tokenizer you suggested in question 1a.?

Simple Wikipedia

Now that you have had a taste of searching through wonderland, we will turn to the Web again. Wikipedia offers a simple web search form, that we can use to find pages about a particular topic. We will use this form to build our own collection of information resources about that topic.

Preparation

Call the process_wikipedia_results function with a query of your choice (run the query in parallel on the http://simple.wikipedia.org website):

    > wikipedia_index <- process_wikipedia_results("your query here")

The query retrieved 50 result pages, tokenized them, and computed the term frequency matrix.
Have a look at the term frequency matrix. We first give it a more convenient name, and then call the View function:

    > wikipedia_tfm <- wikipedia_index$tfm
    > View(wikipedia_tfm)

Have a look at the top 30 most common words in the entire collection:

    > sort(wikipedia_index$total$dict,decreasing=TRUE)[1:30]

Have a look at the 30 least common words in the entire collection:

    > sort(wikipedia_index$total$dict,decreasing=FALSE)[1:30]

Test the “rule of 30”, does it apply?:

    > sorted_wikipedia <- sort(wikipedia_index$total$dict,decreasing=TRUE)
    > sum(sorted_wikipedia[1:30])
    > sum(sorted_wikipedia[31:length(sorted_wikipedia)])

Question 3

Will your suggested improvements to the tokenizer have the same effect here? Are there additional things you’d like to improve?
Looking at the top 30, does it make sense to include all of them in the stop words list?
Looking at the top 30, do you think the frequency follows Zipf’s law? You can run z <- zipf(wikipedia_index$total) to find out.
Similarly, run h <- heaps(wikipedia_index$total) to determine whether Heaps law applies.

Doing some Simple Search on Wikipedia

The wikipedia_index variable now contains the term frequency matrix for 50 pages from Simple Wikipedia.

To see how well we can search using the term frequency matrix, call the search function, using a query of your choice:

    > search("your query here",wikipedia_index$names,wikipedia_tfm)

Have a look at the IDF score of every term in your query:

    > get_idf("your term here", alice_tfm)

Question 4

Discuss the IDF scores for the terms.
Compute the IDF scores for the terms in the original query (the query you used in process_wikipedia_results), and discuss them.

Continuing…

Compute the TFIDF matrix as follows:

    > wikipedia_tfidfm <- compute_tfidf(wikipedia_tfm)

Run the same query again, but now using the wikipedia_tfidfm matrix:

    > search("your query here",wikipedia_index$names,wikipedia_tfidfm)

Have a look at the TFIDF matrix by running View:

    > View(wikipedia_tfidfm)

Play around with a couple of queries to come up with interesting results.

Question 5

How do you assess the quality of the TF-based search. What can you say about precision and recall?
Compare the TF-based search with the TFIDF-based search. What has changed? How do you compare the quality of the searches. What can you say about precision and recall?
What is the effect of using the same query in the process_wikipedia_results and search functions, for the TF-based and TFIDF-based query. Explain.
We retrieved only 50 information resources from Wikipedia, what can you say about the precision at k where k=10 for a query of your choice?
How many resources do you think you need to retrieve to be able to have better search results (precision and recall-wise), and why?

BONUS

Run the wikipedia processing script again, but with a larger limit:

    > wikipedia_index <- process_wikipedia_results("your query here", limit=YOURLIMIT)

Compare the results with the limit of 50

VU University Amsterdam R Tutorials by Willem Robert van Hage is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

This tutorial has been developed within the COMBINE project supported by the ONR Global NICOP grant N62909-11-1-7060.