__INFORMATION RETRIEVAL TECHNIQUES (CS-726)__

__INFORMATION RETRIEVAL TECHNIQUES (CS-726)__

__MIDTERM PAPERS__

__MIDTERM PAPERS__

__QUESTION NO.1:__

Consider a very small collection C that consists in the following three documents:

d1: “red green rainbow”

d2: “red green blue”

d3: “yellow rainbow”

For all the documents, calculate the *tf *scores for all the terms in C. Assume that the words in the vectors are ordered alphabetically. Ignore *idf *values and normalization by maximum frequency.

Given the following query: “blue green rainbow”, calculate the *tf *vector for the query, and compute the score of each document in C relative to this query, using the cosine similarity measure. (Don’t forget to compute the lengths of the vectors).

What is the final order in which the documents are presented as result to the query?

__SOLUTION:__

**Term Frequency (TF)**

Blue | Green | Rainbow | Red | Yellow | Length | |

D1 | 0 | 1 | 1 | 1 | 0 | |

D2 | 1 | 1 | 0 | 1 | 0 | |

D3 | 0 | 0 | 1 | 0 | 1 | |

Query | 1 | 1 | 1 | 0 | 0 |

Now we calculate the tf vector for “blue green rainbow”.

Therefore, d1 and d2 returned first (in any order), whereas, d3 is third.

__QUESTION NO.2:__

Compute variable byte and codes for the postings list (777, 17743, 294068, 31251336). Use gaps instead of docIDs where possible. Write binary codes in 8-bit blocks.

__SOLUTION:__

DocIds | 777 | 17743 | 294068 | 31251336 |

Gap | 777 | 16966 | 276325 | 30957268 |

VB Codes | 00000110, 10001001 | 00000001, 00000100, 11000110 | 00010000, 01101110, 11100101 | 00001110, 01100001, 00111101, 11010100 |

Gamma Codes | 111111111 10, 100001001 | 111111111111 110, 000010010001 10 | 111111111111111 1110, 000011011101100 101 | 11111111111111111111 11110, 11011000010111101101 0100 |

__QUESTION NO.3:__

**Exercise 5.3** Estimate the time needed for term lookup in the compressed dictionary of Reuters-RCV1 with block sizes of k=4(figure 5.6,b) k=8, and k=16. What is the slowdown compared with k=1(figure 5.6,a)

__SOLUTION:__

We first search the leaf in the binary tree, then search the particular term in the block.

Average steps needed to look up term is log(N/k) ‐1+ k/2

For Reuters‐RCV1, N=400000 K

K | Average Steps |

4 | 17.6 |

8 | 18.6 |

16 | 21.6 |

__QUESTION NO.4:__

Exercise 5.2 Estimate the space usage of the Reuters-RCV1 dictionary with blocks of size k=8 and k=16 in blocked dictionary storage.

__SOLUTION:__

K = 8

We will save :(8-1) * 3 = 21 bytes for term pointer

Need additional k =8 for term length so space reduced by 1

3 bytes per 8 term block

Total space reduced by= 400000 * 13 /8 = 0.65 MB

Total space are: 7.6 – 0.65 = 6.95 MB

K=16

We will save :(16-1) * 3 = 45 bytes for term pointer

Need additional k=16 for term length so space reduced by 29 bytes per 16 term block

Total space reduced by= 400000 * 29 /16 = 0.725 MB

Total space are: 7.6 – 0.725 = 6.875 MB

__QUESTION NO.5:__

Exercise 5.8 From the following sequence of γ-coded gaps, reconstruct first the gap sequence and then the postings sequence: 1110001110101011111101101111011.

__SOLUTION:__

Sequence | Length | Offset | Gap Sequence | Posting Sequence |

1110001 | 1110 | 001 | 1001 = 9 | 9 |

11010 | 110 | 10 | 110 = 6 | 15 |

101 | 10 | 1 | 11 = 3 | 18 |

1111101101 | 111110 | 11011 | 111011 = 59 | 77 |

11011 | 110 | 11 | 111 =7 | 84 |

** ****Hint:**

In Sequence, we start counting from left side, when first different digit appears i.e. 1110 **it is length** and again start counting when second different digit appear i.e. 001, **it is offset**, now break from here i.e. 1110001 which is first sequence and so on. Sequence is the combination of Length & Offset.

In Gap Sequence, add 1 from left side of offset and then count in decimal e.g. offset=001 so add 1001 which is 9 in decimal.

In Posting Sequence, choose first digit from Gap Sequence i.e. 9 and then add next digit i.e. 6 in first one so we obtained second digit in posting sequence i.e. 15 and so on.

__QUESTION NO.6:__

**Exercise 6.19**

Compute the vector space similarity between the query “digital cameras” and the document “digital cameras and video cameras” by filling out the empty columns in Table 6.1. Assume N=10,000,000, logarithmic term weighting (wf columns) for query and document, idf weighting for the query only and cosine normalization for the document only. Treat and as a

stop word. Enter term counts in the tf columns. What is the final similarity score?

word | Query tf wf df idf q | Document tf wf d | q_{i}.d_{i} |

digital video cameras | 10,000 100,000 50,000 |

__SOLUTION:__

Word | Query | document | qi*di | ||||||

tf | wf | df | idf | qi=wf-idf | tf | wf | Di=normalized wf | ||

Digital | 1 | 1 | 10,000 | 3 | 3 | 1 | 1 | 0.52 | 1.56 |

Video | 0 | 0 | 100,000 | 2 | 0 | 1 | 1 | 0.52 | 0 |

Cameras | 1 | 1 | 50,000 | 2.3 | 2.3 | 2 | 1.3 | 0.68 | 1.56 |

**Similarity score = 1.56+1.56=3.12**

__QUESTION NO.7:__

**Exercise 5.6 **Consider the postings list (4, 10, 11, 12, 15, 62, 63, 265, 268, 270, 400) with a corresponding list of gaps (4, 6, 1, 1, 3,47,1,202,3,2,130). Assume that the length of the postings list is stored separately, so the system knows when a postings list is complete. Using variable byte encoding: (i) What is the largest gap you can encode in 1 byte? (ii) What is the largest gap you can encode in 2 bytes? (iii) How many bytes will the above postings list require under this encoding? (Count only space for encoding the sequence of numbers.)

__SOLUTION:__

**What is the largest gap you can encode in 1 byte?**

In 1 byte, 2^{7} – 1 = 127

**What is the largest gap you can encode in 2 bytes?**

In 2 bytes, 2^{14} – 1 = 16383

** iii. How many bytes will the above postings list require under this encoding? **

13 bytes

__QUESTION NO.8:__

Wildcard word sh*pi* is given and we have to show all steps to find the word?

__SOLUTION:__

Consider the wildcard query sh*pi*. We are seeking documents containing any first term that begins with sh and second term with pi. Accordingly, we run the Boolean query $sh AND pi$. This is looked up in the 3-gram index and yields a list of matching terms such as relive, shopping and shipping. Each of these matching terms is then looked up in the standard inverted index to yield documents matching the query.

__QUESTION NO.9:__

Exercise 6.11 Can the tf-df weight of a term in a document exceed 1?

__SOLUTION:__

YES, the tf-idf weight of a term in a document exceed 1.

TF-IDF is a family of measures for scoring a term with respect to a document (relevance). The simplest form of TF (word, document) is the number of times word appears in document. TFIDF can be 1 in the naive case, or to add the IDF effect, just do it log(number of documents/number of documents in which word is present).

__QUESTION NO.10:__

Exercise 6.10

Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 in Figure 6.9. Compute the tf-idf weights for the terms car, auto, insurance, best, for each document, using the idf values from Figure 6.8

term | Doc1 | Doc2 | Doc3 |

Car Auto Insurance Best | 27 3 0 14 | 4 33 33 0 | 24 0 29 17 |

Figure 6.9 Table of tf values for Exercise 6.10

Term | df_{t} | idf_{t} |

Car Auto Insurance Best | 18,165 6723 19,241 25,235 | 1.65 2.08 1.62 1.5 |

Figure 6.8 Example of idf values. Here we give idf’s of terms with various frequencies in the Reuters collection of 806,791 documents.

__SOLUTION:__

Tf – idf = tf * idf

term | Doc1 | Doc2 | Doc3 |

Car Auto Insurance Best | 44.55 6.24 0 21 | 6.6 68.64 53.46 0 | 39.6 0 46.98 25.5 |

__QUESTION NO.11:__

Exercise 1.9 For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal? Explain why it is, or give an example where it isn’t.

**SOLUTIONS**

- Time is O(x+y). Instead of collecting documents that occur in both postings lists, collect those that occur in the first one and not in the second.
- Time is O(N) (where N is the total number of documents in the collection) assuming we need to return a complete list of all documents satisfying the query. This is because the length of the results list is only bounded by N, not by the length of the postings lists.

__QUESTION NO.12:__

Exercise 2.3

The following pairs of words are stemmed to the same form by the Porter stemmer. Which pairs would you argue shouldn’t be conflated. Give your reasoning.

- Abandon/abandonment
- Absorbency/absorbent
- Marketing/market
- University/universe
- Volume/volumes

__SOLUTION:__

- Conflate: very similar semantics.
- Conflate: very similar semantics.
- Don’t conflate: marketing is too different from market.
- Don’t conflate: University is too different from universe.
- Conflate: same semantics in most contexts.

__QUESTION NO.13:__

**Exercise 1.9 **[⋆⋆]

For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal? Explain why it is, or give an example where it isn’t.

__SOLUTION:__

The order is not guaranteed to be optimal. Consider three terms with posting list sizes s1=100, s2=105 and s3=110. Suppose, the intersection of s1 and s2 has length 100 and intersection of s1 and s3 has length 0. The ordering s1, s2, s3 requires 100+105+100+110=415 steps through the posting lists. The ordering s1, s3, s2 requires 100+110+0+0=210 steps through the posting lists.

__QUESTION NO.14:__

**Exercise 3.6**

Give an example of a sentence that falsely matches the wildcard query mon*h if the search were to simply use a conjunction of bigrams.

__SOLUTION:__

Query: $m AND mo AND on AND h$

Eg : monish

$m, mo, on, ni, is, sh, h$

__QUESTION NO.15:__

Exercise 5.13

- Show that the size of the vocubalry is finite according to Zipf’s law and infinite according to Heaps’ law.
- Can we derive Heaps’ law from Zipf’s law?

__SOLUTION:__

- According to Heaps Law, the vocabulary size M keeps growing with the number of tokens and hence it is infinite. Vocabulary size is fixed constanct in Zipf’s law. The law assumes that the vocabulary is fixed and finite.
- No, Zipf’s law states that the vocabulary is finite. Thus, it would follow that Heaps’ law does not hold.

Note that our definition of Zipf;s law states that the exponent is -1. Some times the definition is relaxed to include exponents x<-1. In that case, the vocabulary is infinite.

__QUESTION NO.16:__

Exercise 6.9

What is the idf of a term that occurs in every document? Compare this with the use of stop word lists.

__SOLUTION:__

IDF is zero. (IDF = log N/N = log 1 = 0)

Like stop words occurs in every document but it is not related to document

Or

It is 0. For a word that occurs in every document, putting it on the stop list has the same effect as idf weighting: the word is ignored.

__QUESTION NO.17:__

Exercise 3.12

For each of the prefixes of the query – catched, catched in and catched in the – we have a number of substitute prefixes arising from each term and its alternatives. Suppose that we are to retain only the top 10 of these substitute prefixes, as measured by its number of occurences in the collection. We eliminate the rest from consideration for extension to longer prefixes: thus, if batched in is not one of the 10 most common 2-term queries in the collection, we do not consider any extension of batched in as possibly leading to a correction of catched in the rye. How many of the possible substitute prefixes are we eliminating at each phase?

**SOLUTION**

At catched in, we choose 10 out of 36 possible thus eliminating 26. At catched in the, out of 60 surviving alternatives, we eliminate 50 of these, and at catched in the rye, we eliminate 50 of the surviving alternatives.

__QUESTION NO.18:__

**Difference between information retrieval system and search engine?**

__SOLUTION:__

Information retrieval is a broader field of research that encompasses different algorithms, from ranking, to text summarization, text retrieval, indexing algorithms. It is broader because it involves retrieval of smaller information from a collection of huge information. Searching is a application of IR.

The process of information retrieval starts with input of the query followed by processing of query and matching of the search terms with the information in the database, this is then followed by ranking depending on relevancy of the documents.

The performance evaluation depends on two factors i.e. precision and recall accordingly. Searching is a narrow domain, search algorithm is any algorithm that searches a data within a data structure using different techniques of linear search, binary search and hashing. Searching in web search engines is based on the information retrieval ranking algorithms. Or

Information Retrieval is an umbrella term for any retrieval mechanism which done on top of unstructured data corpus. So information Retrieval is broader while Search Engine is type of information Retrieval basically works on web documents by use a spider robot.

But in most cases people use Information Retrieval System to refer to Retrieval mechanism from certain document corpus like digital library, … while Search Engine is Retrieval System from content scattered on web across the world. That mean Search Engine doesn’t basically have documents. It only keeps index and Cash for fast Retrieval.

Most of retrieval algorithms and tools used by both is similar.

Generally, Information Retrieval System is quite broad discipline used to retrieve content from unstructured corpora like digital library, web, news feed, social network posts … search engine works best for web contents only.

__QUESTION NO.20:__

Exercise 3.7

If |S_{i}| denotes the length of string s_{i}, show that the edit distance between s_{1} and s_{2} is never more than nax {|s_{1}|,|s_{2}|}.

**SOLUTION**

Consider two character strings s1 and s2. Without any loss in generality, let us assume that l1=length(s1) < l2= length(s2). For transforming s1 into s2, we can replace each character of s1 by the first l1 characters of s2 and insert the remaining characters of s2 at the end of s1. The number of operations incurred is l2=max(l1,l2).

**Note: – **The edit distance problem is to compute the edit distance between two given strings, along with an optimal edit transcript that describes the transformation.

__QUESTION NO.21:__

Exercise 2.1

Are the following statements true or false?

- In a Boolean retrieval system, stemming never lowers precision.
- In a Boolean retrieval system, stemming never lowers recall.
- Stemming increases the size of the vocabulary.
- Stemming should be invoked at indexing time but not while processing a query.

__Solution:__

- False
- True
- False
- False

__QUESTION No.22:__

**Consider a query and a document collection consisting of three documents. Rank the documents using vector space model. Assume tf-idf weighing scheme. **

**Query: “gold si lver truck” **

**Document Collection: **

**d1: “Shipment of gold arrived in a truck.”**

**d2: “Shipment of gold damaged in a fire.”**

**d3: “Delivery of silver arrived in a silver truck.”**** **

__SOLUTION:__

Term freq tf-idf | ||||||||||

Term | Q | D1 | D2 | D3 | df | Idf=log(N/df) | Q | D1 | D2 | D3 |

a | 0 | 1 | 1 | 1 | 3 | 0 | 0 | 0 | 0 | 0 |

arrived | 0 | 1 | 0 | 1 | 2 | 0.18 | 0 | 0.18 | 0 | 0.18 |

damaged | 0 | 0 | 1 | 0 | 1 | 0.48 | 0 | 0 | 0.48 | 0 |

delivery | 0 | 0 | 0 | 1 | 1 | 0.48 | 0 | 0 | 0 | 0.48 |

fire | 0 | 0 | 1 | 0 | 1 | 0.48 | 0 | 0 | 0.48 | 0 |

gold | 1 | 1 | 1 | 0 | 2 | 0.18 | 0.18 | 0.18 | 0.18 | 0 |

in | 0 | 1 | 1 | 1 | 3 | 0 | 0 | 0 | 0 | 0 |

of | 0 | 1 | 1 | 1 | 3 | 0 | 0 | 0 | 0 | 0 |

shipment | 1 | 1 | 1 | 0 | 2 | 0.18 | 0 | 0.18 | 0.18 | 0 |

silver | 0 | 0 | 0 | 1 | 1 | 0.48 | 0.48 | 0 | 0 | 0.96 |

truck | 1 | 1 | 0 | 1 | 2 | 0.18 | 0 | 0.18 | 0 | 0.18 |

**S(Q,D1) = (Q * D1) / (|Q| * |D1|) = 0.33**

**S(Q,D2) = 0.08**

**S(Q,D3) = 0.083**

**Ranking: D3, D1, D2**

__QUESTION NO.23__

Exercise 5.16

Give some examples of terms that violate the assumption that gaps all have the same size (which we made when estimating the space requirements of a *γ*-encoded index). What are general characteristics of these terms?

**SOLUTION**

For news collection like Reuters RCV-1, examples of such terms could be: tennis football, music etc. since the news collection contains news collected over a period of time, words like tennis and football would occur a lot in news documents collected at other times. Although general characteristics of these terms depend on the type of collection we are dealing with and so it is difficult to formulate a common property, these terms could be the ones which are rare in the collection.

__QUESTION NO.24__

Exercise 5.17

Consider a term whose postings list has size *n*, say, *n *= 10,000. Compare the size of

the *γ*-compressed gap-encoded postings list if the distribution of the term is uniform

(i.e., all gaps have the same size) versus its size when the distribution is not uniform.

Which compressed postings list is smaller?

**SOLUTION**

Encoding the uniform list 13 13 we need 14 bits

Encoding the non-uniform list 2 24 we need 12 bits

In this case non-uniform is shorter because small gaps are encoded efficiently.

**QUESTION NO.25**

Exercise 6.14

If we were to stem jealous and jealousy to a common stem before setting up the vector space, detail how the definitions of tf and idf should modified.

**SOLUTION**

If jealousy and jealous are stemmed to a common term t, then their tf’s and their df’s would be added together in computing the tf-idf weights.

**QUESTION NO.26**

**Exercise 3.3**

** If you wanted to search for s*ng in a permuterm wildcard index, what key(s) would one do the lookup on?**

__SOLUTION:__

ng$s*

**Question No.27**

**Exercise 3.11**

Consider the four-term query catched in the rye and suppose that each of the query terms has five alternative terms suggested by isolated-term correction. How many possible corrected phrases must we consider if we do not trim the space of corrected phrases, but instead try all six variants for each of the terms?

__SOLUTION:__

**6*6*6*6 = 1296**