Information Retrieval : Chapter 1
I didn’t think of this, but this started as a series of short notes on Information Retrieval by Prof. Christopher D. Manning, Prabhakar Raghavan, and Dr. Hinrich Schütze. I am hoping to get some insights, as I am looking forward to enjoying this. The scope of this article is to put out short notes to help readers understand various aspects of the book. Please buy and read the book and support honest tech work. Short notes:
Let’s talk about data. Most of the data in modern times is unstructured for the given problem. E.g. Getting information from email and deciding if it’s a spam or not, requires machine enabled information retrieval (IR) and analysis to take an automated action. Hence, emails are not structured.
IR is like asking a query on a piece of information. Either we need some heuristics to answer the query or we have to find a method and apply it on the chunk of data to find the right answers. The naive way is to look at each sentence to identify it’s presence in the query or not. So we need —
- Parallel processing or high processing rate for the documents (book docs, web docs, pdfs, etc.)
- Ability to apply operations flexibly on the given dataset
- Retrieval of queries based on the rank or quality of data or some way to identify a good chunk of info.
If you use Linux based systems do —
python -m pip freeze | grep requests
Output will be —
requests==2.31.0
requests-toolbelt==1.0.0py
This is very basic form of IR where we search for a python package of name requests. Grep searches through text, read line by line and if it gets similar match, returns result. ( Do experiment with wild cards in search queries)
We will in brief cover some primitive techniques —
Boolean search — Here the system creates a table ( matrix ) with rows as search terms and columns as text doc. It will be table of 0’s and 1’s.
Here query is — Antony and Brutus will be AND operation between the boolean values of both. As we can see, sparse matrix is a problem in scale since it is around 99% 0s and also fail for variable context queries.
Inverted Index — In undergrad, we studied array is of contiguous memory and there were lot of memory potholes when allocated a fixed memory. So came up with linked list using pointers. Same thing is called as Inverted Index. It is like created a list of documents where the corresponding word is present.
Processing the queries in lists includes merging the lists (intersection of list or common occuring documents) when an AND operation is performed. For the OR boolean operation, we have to add both the lists. Both these operations are performed on increasing order of lists. Since we have to add or intersect the lists and store intermediate lists, we can add a layer of query optimisation. For example, use heuristics to store size of list and perform operation on the increasing order of list size. We can claim that this method is linearly dependant on the number of documents.
In conclusion, the methods were used primitively in beginning of unix systems, dot net era and earlier document search engines before the advent of Google. We will continue studying the next chapter for more advanced stuff.