VU University Amsterdam
R Tutorials

R Tutorial - Alice in Wonderland

By Rinke Hoekstra and Willem Robert van Hage.

In this tutorial, you are to (partially) reproduce a search engine.

We will use the the open source R language and RStudio program for statistical analysis.

Getting Started

Alice in Web Science Land

The alice.txt file contains the full text of Alice in Wonderland. We will use this book to observe some of the things we have discussed in class (frequency matrix, Heaps, Zipf), and run a search.

Preparation

    > alice_text <- read_plain_text("alice.txt")
    > alice_chapters <- split(alice_text)
    > alice_index <- process_alice(alice_text,alice_chapters)
    > alice_tfm <- alice_index$tfm
    > View(alice_tfm)
    > sort(alice_index$total$dict,decreasing=TRUE)[1:30]
    > sort(alice_index$total$dict,decreasing=FALSE)[1:30]

Question 1

  1. Have a closer look at the term frequency matrix. Given this table, how well do you think the tokenizer did its job? Do you see any room for improvement?
  2. You just saw the top 30 most common words in the collection. What is the proper term for this table if it contained all words? (and why the top 30?)
  3. Looking at the top 30, does it make sense to include all of them in the stop words list?
  4. Looking at the top 30, do you think the frequency follows Zipf’s law? You can run z <- zipf(alice_index$total) to find out. What does it mean again?
  5. Similarly, run h <- heaps(alice_index$total) to determine whether Heaps law applies. What does it mean again?

Doing some Search in Wonderland

    > search("your query here",alice_index$names,alice_tfm)
    > compute_idf(alice_tfm)["your term here"]
    > alice_tfidfm <- compute_tfidf(alice_tfm)
    > search("your query here",alice_index$names,alice_tfidfm)
    > View(alice_tfidfm)

Question 2

  1. How do you assess the quality of the TF-based search. What can you say about precision and recall?
  2. Computing the TFIDF matrix takes quite some time. Why is that, do you think?
  3. Compare the TF-based search with the TFIDF-based search. What has changed? How do you compare the quality of the searches. What can you say about precision and recall?
  4. Alice in Wonderland only has 11 chapters, what can you say about the precision at k where k=5 for both types of searches.
  5. What would happen to precision and recall if you implemented the improvements to the tokenizer you suggested in question 1a.?

Simple Wikipedia

Now that you have had a taste of searching through wonderland, we will turn to the Web again. Wikipedia offers a simple web search form, that we can use to find pages about a particular topic. We will use this form to build our own collection of information resources about that topic.

Preparation

    > wikipedia_index <- process_wikipedia_results("your query here")
    > wikipedia_tfm <- wikipedia_index$tfm
    > View(wikipedia_tfm)
    > sort(wikipedia_index$total$dict,decreasing=TRUE)[1:30]
    > sort(wikipedia_index$total$dict,decreasing=FALSE)[1:30]
    > sorted_wikipedia <- sort(wikipedia_index$total$dict,decreasing=TRUE)
    > sum(sorted_wikipedia[1:30])
    > sum(sorted_wikipedia[31:length(sorted_wikipedia)])

Question 3

  1. Will your suggested improvements to the tokenizer have the same effect here? Are there additional things you’d like to improve?
  2. Looking at the top 30, does it make sense to include all of them in the stop words list?
  3. Looking at the top 30, do you think the frequency follows Zipf’s law? You can run z <- zipf(wikipedia_index$total) to find out.
  4. Similarly, run h <- heaps(wikipedia_index$total) to determine whether Heaps law applies.

Doing some Simple Search on Wikipedia

The wikipedia_index variable now contains the term frequency matrix for 50 pages from Simple Wikipedia.

    > search("your query here",wikipedia_index$names,wikipedia_tfm)
    > get_idf("your term here", alice_tfm)

Question 4

  1. Discuss the IDF scores for the terms.
  2. Compute the IDF scores for the terms in the original query (the query you used in process_wikipedia_results), and discuss them.

Continuing…

    > wikipedia_tfidfm <- compute_tfidf(wikipedia_tfm)
    > search("your query here",wikipedia_index$names,wikipedia_tfidfm)
    > View(wikipedia_tfidfm)

Question 5

  1. How do you assess the quality of the TF-based search. What can you say about precision and recall?
  2. Compare the TF-based search with the TFIDF-based search. What has changed? How do you compare the quality of the searches. What can you say about precision and recall?
  3. What is the effect of using the same query in the process_wikipedia_results and search functions, for the TF-based and TFIDF-based query. Explain.
  4. We retrieved only 50 information resources from Wikipedia, what can you say about the precision at k where k=10 for a query of your choice?
  5. How many resources do you think you need to retrieve to be able to have better search results (precision and recall-wise), and why?

BONUS

Run the wikipedia processing script again, but with a larger limit:

    > wikipedia_index <- process_wikipedia_results("your query here", limit=YOURLIMIT)

Compare the results with the limit of 50

Creative Commons License
VU University Amsterdam R Tutorials by Willem Robert van Hage is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

This tutorial has been developed within the COMBINE project supported by the ONR Global NICOP grant N62909-11-1-7060.