Divide and Conquer Strategies for Effective Information Retrieval
Authors: Jie Chen and Yousef Saad
URL : http://www.siam.org/proceedings/datamining/2009/dm09_043_chenj.pdf
Abstract:
The standard application of Latent Semantic Indexing (LSI), a well-known technique for information retrieval, requires the computation of a partial Singular Value Decomposition
(SVD) of the term-document matrix. This computation becomes infeasible for large document collections, since it is very demanding both in terms of arithmetic operations and in memory requirements. This paper discusses two divide and conquer strategies, with the goal of alleviating these difficulties. Both strategies divide the document collection in subsets, perform relevance analysis on each subset, and conquer the analysis results to form the query response.
Since each sub-problem resulting from the recursive division process has a smaller size, the processing of large scale document collections requires much fewer resources. In addition, the computation is highly parallel and can be easily adapted to a parallel computing environment. To reduce the computational cost, we perform the analysis on the subsets by using the Lanczos vectors instead of singular vectors as in the standard LSI method. This technique is far more efficient than the computation of the truncated SVD, while its accuracy is comparable. Experimental results confirm that the proposed divide and conquer strategies are effective for information retrieval problems.
Three (3) things I learned from my
1. I learned that Latent Semantic Indexing, a well-known method which was developed to deal common problems of word usage. LSI projects the original term-document matrix into a reduced rank subspace by resorting to the Singular Value Decomposition (SVD). The comparison of the query with the documents is then performed in this subspace and produces in this way a more meaningful result.
2. The authors of the article proposed two strategies to have an effective information retrieval, the divided and conquer strategies. It is a paradigm known for its effectiveness in solving very large scale scientific problems is that of multilevel approaches. It can lessen the computational burden because it has a primary goal of reducing cost at the expense of a minimal cost in accuracy.
3. Divide and conquer strategies were proposed to retrieved relevant documents for text mining problems, along with efficient techniques to use as alternatives to the classical truncated Singular Value Decomposition (SVD) approach of Latent Semantic Indexing (LSI).
Application / implication of new things I learned to my work or to me as a person.
Actually, I am not that familiar with some words/ terms used in information science and that is why I have difficulties in understanding those articles that I have read. But after reading the article about “Divide and Conquer Strategies for Effective Information Retrieval”, I realized that I need to have patience in reading this article though it is really hard for me to understand those unfamiliar terms.
We always say that it is easier to get information using the internet, but really we don’t how that information has been stored and retrieved. So maybe, by reading this article, slightly I have a little idea why it happened.
Next time, you may choose an article that is less technical. I want you to enjoy what you are reading for this class.
ReplyDelete