MS Computer Science Virtual University of Pakistan

CS726 Information Retrieval Techniques Solved Midterm Questions

INFORMATION RETRIEVAL TECHNIQUES (CS-726)

MIDTERM PAPERS

QUESTION NO.1:

Consider a very small collection C that consists in the following three documents:

d1: “red green rainbow”

d2: “red green blue”

d3: “yellow rainbow”

For all the documents, calculate the tf scores for all the terms in C. Assume that the words in the vectors are ordered alphabetically. Ignore idf values and normalization by maximum frequency.

Given the following query: “blue green rainbow”, calculate the tf vector for the query, and compute the score of each document in C relative to this query, using the cosine similarity measure. (Don’t forget to compute the lengths of the vectors).

What is the final order in which the documents are presented as result to the query?

SOLUTION:

Term Frequency (TF)

BlueGreenRainbowRedYellow Length
D101110
D211010
D300101
Query 11100

Now we calculate the tf vector for “blue green rainbow”.

\displaystyle Cos\text{ }\left( d1,\text{ }q \right)~=\frac{1\times 1+1\times 1}{\sqrt{{{1}^{2}}+{{1}^{2}}+{{1}^{2}}}.\sqrt{{{1}^{2}}+{{1}^{2}}+{{1}^{2}}}}=\frac{2}{\sqrt{3}.\sqrt{3}}=\frac{2}{3}=0.67

\displaystyle Cos\text{ }\left( d2,\text{ }q \right)=\frac{1\times 1+1\times 1}{\sqrt{{{1}^{2}}+{{1}^{2}}+{{1}^{2}}}.\sqrt{{{1}^{2}}+{{1}^{2}}+{{1}^{2}}}}=\frac{2}{\sqrt{3}.\sqrt{3}}=\frac{2}{3}=0.67

\displaystyle Cos\text{ }\left( d3,\text{ }q \right)=\frac{1\times 1}{\sqrt{{{1}^{2}}+{{1}^{2}}+{{1}^{2}}}.\sqrt{{{1}^{2}}+{{1}^{2}}}}=\frac{1}{\sqrt{3}.\sqrt{2}}=\frac{1}{\sqrt{6}}=\frac{1}{2.45}=0.41

Therefore, d1 and d2 returned first (in any order), whereas, d3 is third.

QUESTION NO.2:

Compute variable byte and  \gamma   codes for the postings list (777, 17743, 294068, 31251336). Use gaps instead of docIDs where possible. Write binary codes in 8-bit blocks.

SOLUTION:

DocIds7771774329406831251336
Gap7771696627632530957268
VB Codes00000110,

10001001

00000001,

00000100,

11000110

00010000,

01101110,

11100101

00001110, 01100001,

00111101, 11010100

Gamma Codes111111111

10,

100001001

111111111111

110,

000010010001

10

111111111111111

1110,

000011011101100

101

11111111111111111111

11110,

11011000010111101101

0100

QUESTION NO.3:

Exercise 5.3  Estimate the time needed for term lookup in the compressed dictionary of Reuters-RCV1 with block sizes of k=4(figure 5.6,b) k=8, and k=16. What is the slowdown compared with k=1(figure 5.6,a)

SOLUTION:

We first search the leaf in the binary tree, then search the particular term in the block.

Average steps needed to look up term is    log(N/k) ‐1+ k/2

For Reuters‐RCV1, N=400000 K

KAverage Steps
417.6
818.6
1621.6

QUESTION NO.4:

Exercise 5.2  Estimate the space usage of the Reuters-RCV1 dictionary with blocks of size k=8 and k=16 in blocked dictionary storage.

SOLUTION:

K = 8

We will save :(8-1) * 3 = 21 bytes for term pointer

Need additional k =8 for term length so space reduced by 1

3 bytes per 8 term block

Total space reduced by= 400000 * 13 /8 = 0.65 MB

Total space are: 7.6 – 0.65 = 6.95 MB

K=16

We will save :(16-1) * 3 = 45 bytes for term pointer

Need additional k=16 for term length so space reduced by 29 bytes per 16 term block

Total space reduced by= 400000 * 29 /16 = 0.725 MB

Total space are: 7.6 – 0.725 = 6.875 MB

QUESTION NO.5:

Exercise 5.8  From the following sequence of γ-coded gaps, reconstruct first the gap sequence and then the postings sequence: 1110001110101011111101101111011.

SOLUTION:

SequenceLengthOffsetGap SequencePosting Sequence
111000111100011001 = 99
1101011010110 = 615
10110111 = 318
111110110111111011011111011 = 5977
1101111011111 =784

 Hint:

In Sequence, we start counting from left side, when first different digit appears i.e. 1110 it is length and again start counting when second different digit appear i.e. 001, it is offset, now break from here  i.e. 1110001 which is first sequence and so on. Sequence is the combination of Length & Offset.

In Gap Sequence, add 1 from left side of offset and then count in decimal e.g. offset=001 so add 1001 which is 9 in decimal.

In Posting Sequence, choose first digit from Gap Sequence i.e. 9 and then add next digit i.e. 6 in first one so we obtained second digit in posting sequence i.e. 15 and so on.

QUESTION NO.6:

Exercise 6.19

Compute the vector space similarity between the query “digital cameras” and the document “digital cameras and video cameras” by filling out the empty columns in Table 6.1. Assume N=10,000,000, logarithmic term weighting (wf columns) for query and document, idf weighting for the query only and cosine normalization for the document only. Treat and as a

stop word. Enter term counts in the tf columns. What is the final similarity score?

word                  Query

tf   wf   df         idf   qi=wf-idf

               Document

tf   wf   di=normalized wf

qi.di
digital

video

cameras

            10,000

100,000

50,000

SOLUTION:

WordQuerydocumentqi*di
tfwfdfidfqi=wf-idftfwfDi=normalized

wf

Digital1110,00033110.521.56
Video00100,00020110.520
Cameras1150,0002.32.321.30.681.56

Similarity score = 1.56+1.56=3.12

QUESTION NO.7:

Exercise 5.6 Consider the postings list (4, 10, 11, 12, 15, 62, 63, 265, 268, 270, 400) with a corresponding list of gaps (4, 6, 1, 1, 3,47,1,202,3,2,130). Assume that the length of the postings list is stored separately, so the system knows when a postings list is complete. Using variable byte encoding: (i) What is the largest gap you can encode in 1 byte? (ii) What is the largest gap you can encode in 2 bytes? (iii) How many bytes will the above postings list require under this encoding? (Count only space for encoding the sequence of numbers.)

SOLUTION:

  1. What is the largest gap you can encode in 1 byte?

In 1 byte,  27 – 1 = 127

  1. What is the largest gap you can encode in 2 bytes?

In 2 bytes, 214 – 1 = 16383

 iii. How many bytes will the above postings list require under this encoding?

13 bytes

QUESTION NO.8:

Wildcard word sh*pi* is given and we have to show all steps to find the word?

SOLUTION:

Consider  the wildcard  query  sh*pi*. We  are  seeking  documents  containing  any  first term  that  begins with  sh  and  second  term with  pi. Accordingly, we run the Boolean query $sh AND pi$. This is looked up in the 3-gram index and yields a list of matching terms such as relive, shopping  and  shipping.  Each of these matching  terms  is  then looked up in the standard inverted index to yield documents matching the query.

QUESTION NO.9:

Exercise 6.11  Can the tf-df weight of a term in a document exceed 1?

SOLUTION:

YES, the tf-idf weight of a term in a document exceed 1.

TF-IDF  is a  family of measures  for scoring a  term with respect  to a document (relevance). The simplest  form  of  TF  (word,  document)  is  the  number  of  times  word  appears  in  document. TFIDF  can  be  1  in  the  naive  case,  or  to  add  the  IDF  effect,  just  do  it  log(number  of documents/number of documents in which word is present).

QUESTION NO.10:

Exercise 6.10

Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 in Figure 6.9. Compute the tf-idf weights for the terms car, auto, insurance, best, for each document, using the idf values from Figure 6.8

termDoc1Doc2Doc3
Car

Auto

Insurance

Best

27

3

0

14

4

33

33

0

24

0

29

17

Figure 6.9 Table of tf values for Exercise 6.10

Termdftidft
Car

Auto

Insurance

Best

18,165

6723

19,241

25,235

1.65

2.08

1.62

1.5

Figure 6.8 Example of idf values. Here we give idf’s of terms with various frequencies in the Reuters collection of 806,791 documents.

SOLUTION:

Tf – idf  = tf * idf

termDoc1Doc2Doc3
Car

Auto

Insurance

Best

44.55

6.24

0

21

6.6

68.64

53.46

0

39.6

0

46.98

25.5

QUESTION NO.11:

Exercise 1.9  For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal? Explain why it is, or give an example where it isn’t.

SOLUTIONS

  1. Time is O(x+y). Instead of collecting documents that occur in both postings lists, collect those that occur in the first one and not in the second.
  2. Time is O(N) (where N is the total number of documents in the collection) assuming we need to return a complete list of all documents satisfying the query. This is because the length of the results list is only bounded by N, not by the length of the postings lists.

QUESTION NO.12:

Exercise 2.3

The following pairs of words are stemmed to the same form by the Porter stemmer. Which pairs would you argue shouldn’t be conflated. Give your reasoning.

  1. Abandon/abandonment
  2. Absorbency/absorbent
  3. Marketing/market
  4. University/universe
  5. Volume/volumes

SOLUTION:

  1. Conflate: very similar semantics.
  2. Conflate: very similar semantics.
  3. Don’t conflate: marketing is too different from market.
  4. Don’t conflate: University is too different from universe.
  5. Conflate: same semantics in most contexts.

QUESTION NO.13:

Exercise 1.9 [⋆⋆]

For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal? Explain why it is, or give an example where it isn’t.

SOLUTION:

The order is not guaranteed to be optimal. Consider three terms with posting list sizes s1=100, s2=105 and s3=110. Suppose, the intersection of s1 and s2 has length 100 and intersection of s1 and s3 has length 0. The ordering s1, s2, s3 requires 100+105+100+110=415 steps through the posting lists. The ordering s1, s3, s2 requires 100+110+0+0=210 steps through the posting lists.

QUESTION NO.14:

Exercise 3.6

Give an example of a sentence that falsely matches the wildcard query mon*h if the search were to simply use a conjunction of bigrams.

SOLUTION:

Query: $m AND mo AND on AND h$

Eg : monish

$m, mo, on, ni, is, sh, h$

QUESTION NO.15:

Exercise 5.13

  • Show that the size of the vocubalry is finite according to Zipf’s law and infinite according to Heaps’ law.
  • Can we derive Heaps’ law from Zipf’s law?

SOLUTION:

  • According to Heaps Law, the vocabulary size M keeps growing with the number of tokens and hence it is infinite. Vocabulary size is fixed constanct in Zipf’s law. The law assumes that the vocabulary is fixed and finite.
  • No, Zipf’s law states that the vocabulary is finite. Thus, it would follow that Heaps’ law does not hold.

Note that our definition of Zipf;s law states that the exponent is -1. Some times the definition is relaxed to include exponents x<-1. In that case, the vocabulary is infinite.

QUESTION NO.16:

Exercise 6.9

What is the idf of a term that occurs in every document? Compare this with the use of stop word lists.

SOLUTION:

IDF is zero. (IDF = log N/N = log 1 = 0)

Like stop words occurs in every document but it is not related to document

Or

It is 0. For a word that occurs in every document, putting it on the stop list has the same effect as idf weighting: the word is ignored.

QUESTION NO.17:

Exercise 3.12

For each of the prefixes of the query – catched, catched in and catched in the – we have a number of substitute prefixes arising from each term and its alternatives. Suppose that we are to retain only the top 10 of these substitute prefixes, as measured by its number of occurences in the collection. We eliminate the rest from consideration for extension to longer prefixes: thus, if batched in is not one of the 10 most common 2-term queries in the collection, we do not consider any extension of batched in as possibly leading to a correction of catched in the rye. How many of the possible substitute prefixes are we eliminating at each phase?

SOLUTION

At catched in, we choose 10 out of 36 possible thus eliminating 26. At catched in the, out of 60 surviving alternatives, we eliminate 50 of these, and at catched in the rye, we eliminate 50 of the surviving alternatives.

QUESTION NO.18:

Difference between information retrieval system and search engine?

SOLUTION:

Information retrieval is a broader field of research that encompasses different algorithms, from ranking, to text summarization, text retrieval, indexing algorithms. It is broader because it involves retrieval of smaller information from a collection of huge information. Searching is a application of IR.

The process of information retrieval starts with input of the query followed by processing of query and matching of the search terms with the information in the database, this is then followed by ranking depending on relevancy of the documents.

The performance evaluation depends on two factors i.e. precision and recall accordingly. Searching is a narrow domain, search algorithm is any algorithm that searches a data within a data structure using different techniques of linear search, binary search and hashing. Searching in web search engines is based on the information retrieval ranking algorithms.                                                        Or

Information Retrieval is an umbrella term for any retrieval mechanism which done on top of unstructured data corpus. So information Retrieval is broader while Search Engine is type of information Retrieval basically works on web documents by use a spider robot.

But in most cases people use Information Retrieval System to refer to Retrieval mechanism from certain document corpus like digital library, … while Search Engine is Retrieval System from content scattered on web across the world. That mean Search Engine doesn’t basically have documents. It only keeps index and Cash for fast Retrieval.

Most of retrieval algorithms and tools used by both is similar.

Generally, Information Retrieval System is quite broad discipline used to retrieve content from unstructured corpora like digital library, web, news feed, social network posts … search engine works best for web contents only.

QUESTION NO.20:

Exercise 3.7

If |Si| denotes the length of string si, show that the edit distance between s1 and s2 is never more than nax {|s1|,|s2|}.

SOLUTION

Consider two character strings s1 and s2. Without any loss in generality, let us assume that l1=length(s1) < l2= length(s2). For transforming s1 into s2, we can replace each character of s1 by the first l1 characters of s2 and insert the remaining characters of s2 at the end of s1. The number of operations incurred is l2=max(l1,l2).

Note: – The edit distance problem is to compute the edit distance between two given strings, along with an optimal edit transcript that describes the transformation.

QUESTION NO.21:

Exercise 2.1

Are the following statements true or false?

  1. In a Boolean retrieval system, stemming never lowers precision.
  2. In a Boolean retrieval system, stemming never lowers recall.
  3. Stemming increases the size of the vocabulary.
  4. Stemming should be invoked at indexing time but not while processing a query.

Solution:

  1. False
  2. True
  3. False
  4. False

QUESTION No.22:

Consider a query  and  a document  collection  consisting of  three documents. Rank  the documents using vector space model. Assume tf-idf weighing scheme.                 

Query: “gold si lver truck”

Document Collection:

d1: “Shipment of gold arrived in a truck.”

d2: “Shipment of gold damaged in a fire.”

d3: “Delivery of silver arrived in a silver truck.” 

SOLUTION:

                                    Term freq                                                                     tf-idf
TermQD1D2D3dfIdf=log(N/df)QD1D2D3
a0111300000
arrived010120.1800.1800.18
damaged001010.48000.480
delivery000110.480000.48
fire001010.48000.480
gold111020.180.180.180.180
in0111300000
of0111300000
shipment111020.1800.180.180
silver000110.480.48000.96
truck110120.1800.1800.18

S(Q,D1) = (Q * D1) / (|Q| * |D1|) = 0.33

S(Q,D2) = 0.08

S(Q,D3) = 0.083

Ranking: D3, D1, D2

QUESTION NO.23

Exercise 5.16

Give some examples of terms that violate the assumption that gaps all have the same size (which we made when estimating the space requirements of a γ-encoded index). What are general characteristics of these terms?

SOLUTION

For news collection like Reuters RCV-1, examples of such terms could be: tennis football, music etc. since the news collection contains news collected over a period of time, words like tennis and football would occur a lot in news documents collected at other times. Although general characteristics of these terms depend on the type of collection we are dealing with and so it is difficult to formulate a common property, these terms could be the ones which are rare in the collection.

QUESTION NO.24

Exercise 5.17

Consider a term whose postings list has size n, say, n = 10,000. Compare the size of

the γ-compressed gap-encoded postings list if the distribution of the term is uniform

(i.e., all gaps have the same size) versus its size when the distribution is not uniform.

Which compressed postings list is smaller?

SOLUTION

Encoding the uniform list 13 13 we need 14 bits

Encoding the non-uniform list 2 24 we need 12 bits

In this case non-uniform is shorter because small gaps are encoded efficiently.

QUESTION NO.25

Exercise 6.14

If we were to stem jealous and jealousy to a common stem before setting up the vector space, detail how the definitions of tf and idf should modified.

SOLUTION

If jealousy and jealous are stemmed to a common term t, then their tf’s and their df’s would be added together in computing the tf-idf weights.

QUESTION NO.26

Exercise 3.3

            If you wanted to search for s*ng in a permuterm wildcard index, what key(s) would one do the lookup on?

SOLUTION:

ng$s*

Question No.27

Exercise 3.11

Consider the four-term query catched in the rye and suppose that each of the query terms has five alternative terms suggested by isolated-term correction. How many possible corrected phrases must we consider if we do not trim the space of corrected phrases, but instead try all six variants for each of the terms?

SOLUTION:

6*6*6*6 = 1296