Information Retrieval : Chapter 2

Harsimar Singh
5 min readAug 29, 2024

--

In this we will talk about vocabulary and postings list. In previous post of chapter 1, I wrote notes about very fundamental process of information retrieval. In this blog we will go through some surfing techniques that levelled up the game.

The first idea of discussion revolves around document structure. Documents can be thought of as piece of information stored as bits and bytes “decoded” in a defined manner so that the output is human language and interpretable. Some common formats are MS Docs, XML documents, pdf etc. These earlier formats were vendor specific and were encoded according to set of rules and then later on decoded. One such format is UTF-8 which is variable length encoding ( a topic of separate discussion ).

https://en.wikipedia.org/wiki/UTF-8#:~:text=UTF%2D8%20is%20a%20variable,Transformation%20Format%20%E2%80%93%208%2Dbit.&text=UTF%2D8%20is%20capable%20of,8%2Dbit)%20code%20units.

Once we have decoded the incoming data stream, next step is to chunk up this doc. We have seen one work working as inverted index, but that’s the case when we were working with singular words. e.g. United States of America represents a single entity. Hence we use the concept of indexing granularity which says do you want an index on document or do you want to create index on pages which can be considered as mini documents. Later on we will see this is a knob that can controlled by the user based on the trade-off of performance.

The focus from here onwards is on determining vocabulary. Will start with token.

Token — Token is instance of a sequence of characters in a document, that are grouped together to form a useful semantic unit of processing. Type — Class of all tokens containing same character sequence. Term — Type that is included in IR dictionary.

Big question is not “How to token the document lines?”

Whitespace — trivial but big no [ chop United States of America ].

Remove punctuation — works but not for rare words [ look at aren’t ].

Based on language — Identification itself needs a layer on top of tokenization algorithm [ e.g. how to tokenize IP address 192.168.1.1 ].

Hyphens and special characters — Very difficult to identify certain characters as one single entity or not [ e.g. phone number (800) 234–2333 are difficult based on common rules ]

Compounded words — German words are written together which consists of nouns. e.g Computerlinguistiks. Here we need some method to break down the combination of two words where each word exists in dictionary. Segmentation do work but it doesn’t guarantee a consistent unique tokenization.

Stop words processing — Stop words doesn’t add much value to the indexing as they are frequent in appearance. Most common way to remove them is to sort by frequency and remove them using a threshold. Thus we save on space.

Normalization — Normalization is a process of converting a word into it’s canonical form. E.g. Words, words, word will all map to word, which is it’s root form. Hence we use the term equivalence class to kind of bound them into a single relation entity. Normalization can done on the basis of —

  1. Accents and Diacritics — The system should match cliché and cliche. In English, it has quite low priority.
  2. Capitalization/case-folding — First we can match all uppercase words with lowercase. But have to be careful with proper nouns. A clever way to lowercase first word of query or beginning letters and keep mid words capitalised. When this is done through machine learning, it is called truecasing.
  3. Some rare cases — There are rare cases like idioms and phrases, non-english languages on internet with foreign and complex words.

Stemming and lemmatisation — To group words which are derivately similar like organize, organizes, and organizing. Aim of stemming and lemmatization is to enable search for all three with similar meaning. The only difference is stemming removes or chops off the end words in hope of acheiving goal. e.g. organizes - s = organize, whereas lemmatisation uses vocabulary and do morpholigical analysis of word before trimming it to original form. Porter stemmer is one such algo that uses additional rules to chop the words and measures the length of the word. Effect of stemming is that it increases recall but has poor precision.

There are certain tradeoffs in every normalization and tagging which one is best is again upto the data analyst which one gives best performance on the use case.

Authors after the above discussion iterate back to posting list or link lists we saw in the chapter 1. A minor modification is to use skip list instead of normal linear link list to decrease merge time. This hopping mechanism ensures that we use less number of comparisons on average and improve linear search. e.g. In the figure below, after merging both the lists till 8, Brutus list pointer can be updated to 28 without comparing with 19 and 23 because if we stand at 16 we can see two pathways — 28 ( skip list ) and 19. Now 28 < next (Caesar) which is 41 and hence we skipped the two numbers.

Skip pointers help only AND queries but not for OR queries. There also exists a tradeoff between the positioning or creation of skip lists. Ideally there should be √ P evenly-spaced skip pointers.

Positional postings and phrase queries — Using double quotes ‘ “ ’ in our phrase search can clearly state the result to be returned according the phrase in quotes. E.g. “Stanford University” is a bi-word phrase so we need some algos to hit this phrase and return result accordingly and not Stanford or University as individual entity.

One way is to create bi-word indexes. Each bi-word is vocabulary term. Variable length indexes are called Phrase Indexing. So our indexing should involve both single word and bi word. This has a high chance of false positives. One downside is extended vocabulary.

Positional indexes — Consider example where in list you store position of occurance of the give word. E.g. word to occurs in docId 1, 6 times at position 7, 18, 33,72, 86, 231. So kind of makes sense to locate and find the words in vicinity so that you can do the merge operation on AND queries. We have more sophisticated algorithms like k word proximity search where we search if for two given words we have | pos(word1) — pos(word20)| ≤ k which indicates on either size. Downside? more comparisons during merge operation.

Combination schemes — Best of both worlds is to combine positional index with bi-word indexing. For some popular bi-words save only the docid during indexing whereas corner cases or most expensive queries will be two words which are pretty common but they are rare combo together.

So in this chapter we saw some dive in into the vocabulary building and found out how is built from scratch and what all indexes are created on top of it!

--

--

Harsimar Singh
Harsimar Singh

Written by Harsimar Singh

Co-Founder VAAR Lab, Alumni IIT Ropar ( 2018-2020 ) M.Tech CSE. Loves breaking complex things into granular objects like Rutherford did.

No responses yet