Information Retrieval and Search Engines (B-KUL-H02C8A)

4 ECTSEnglish30 Second termCannot be taken as part of an examination contract
POC Artificial Intelligence

The aim of the course is to study the current techniques and algorithms commonly used in information retrieval, Web search and Web mining, and the challenges of these fields. The theoretical insights are the basis for discussions of commercial systems and ongoing research projects. After the study of this course the student should be able to 1) describe and understand fundamental concepts and algorithms in information retrieval, Web search and Web mining; 2) design and evaluate an information retrieval system.

The exercise sessions give the opportunity to gain an in-depth understanding of the algorithms discussed during the lectures.

The course addresses students who are interested in the theory and applications of the processing, storage and retrieval of information. Elementary knowledge of statistics, probability theory and linear algebra is required. It is recommended that the student is familiar with machine learning methods.

Activities

3 ects. Information Retrieval and Search Engines: Lecture (B-KUL-H02C8a)

3 ECTSEnglishFormat: Lecture20 Second term
POC Artificial Intelligence

The motivation for the course lies in the urgent need for computer programs that assist people in digesting masses of unstructured information composed of text and other media. We need information retrieval technology when, for instance, we find information on the World Wide Web, in repositories of news and blogs, in biomedical document bases, or in governmental and company archives. Moreover, emails, tweets, other messages and advertisements are searched and filtered. Various techniques of content recognition, recommendation and linking play an increasing role and allow generating content models of the documents or messages, that effectively match the personalized information needs of users. We witness a current interest in capturing dynamic changes in the data and in modeling dynamic interactions with users. The proliferation of wireless and mobile devices such as mobile phones has additionally created a demand for effective and robust techniques to index, retrieve and summarize information.

 
The lectures treat the following topics:
 

1. Introduction
 

 2. Advanced representations

 Law of Zipf

Matrix factorization, latent semantic analysis (LSA), training with singular value decomposition

Probabilistic latent semantic analysis (pLSA), latent Dirichlet allocation (LDA), training with Expectation Maximization (EM) algorithms, Markov chain Monte Carlo (MCMC) methods such as Gibbs sampling, and with variational inference

Embeddings obtained with neural networks

 

3. Retrieval and search models

 

Algebraic models: vector space models

Probabilistic models: language retrieval models and Bayesian networks

Neural network models

 

4. Learning to rank

 

Relevance feedback, personalized and contextualized information needs, user profiling

Pointwise, pairwise and listwise approaches

Structured output support vector machines, loss functions, most violated constraints

End-to-end neural network models

Optimization of retrieval effectiveness and of diversity of search results

 

5. Dynamic retrieval and recommendation

 

Static versus dynamic models

Markov decision processes

Multi-armed bandit models

Modelling sessions

Online advertising

 

6. Multimedia information retrieval

 

Multimedia data types and features

Concept detection

Cross-modal indexing of content: latent Dirichlet allocation and deep learning methods

Cross-modal and multimodal retrieval and recommendation models

Illustrations with spoken document, image, video and music search

 

7. Web search

 

Web search engines, crawler-indexer architecture, query processing

Link analysis retrieval models: PageRank, HITS, personalized PageRank and variants

Behavior and credibility based retrieval models

Social search, mining and searching user generated content

 

8. Scalability of Web search 

 

Data structures and search techniques

Inverted files, nextword indices, taxonomy indices, distributed indices

Compression

Learning of hashing functions, cross-modal hashing

Scalability and efficiency challenges

Architectural optimizations

 

9. Clustering

 

Distance and similarity functions in Euclidean and hyperbolic spaces, proximity functions

Sequential and hierarchical cluster algorithms, algorithms based on cost-function optimization, number of clusters

Term clustering for query expansion, document clustering, multiview clustering

 

10. Categorization

 

Feature selection, naive Bayes model, support vector machines, (approximate) k-nearest neighbor models

Deep learning methods

Multilabel and hierarchical categorization

Convolutional neural network (CNN) based hierarchical categorization

 

11. Summarization

 

Document segmentation, maximum marginal relevance

Summarization based on latent Dirichlet allocation models and long short-term memory (LSTM) networks

Abstractive summarization with attention models

Multidocument summarization, search results fusion and visualization 

 

12. Question answering and conversational agents in search and recommendation

 

Retrieval based question answering

Deep learning methods including attention models

Cross-modal question answering

E-commerce search and recommendation

 

13. Evaluation measures and methodology

 

Recall, precision, F-measure, mean average precision, discounted cumulative gain, mean reciprocal answer rank, accuracy, confusion matrix, ROC curve, normalized mutual information, mean absolute error, root mean square error, pyramid method, inter-annotator agreement, test collections

 

14. Discussion of interesting research projects


15. Invited lecture by representative of an important company

 

In 2006-2007: Thomas Hofmann, Director of Engineering, Google Zurich European Engineering Centre, Switzerland; in 2007-2008: Ronny Lempel, director of Yahoo! research, Israel; in 2008-2009: Stephen Robertson, senior researcher at Microsoft Research Cambridge, UK and one of the founders of probabilistic modeling in information retrieval; in 2009-2010: Gregory Grefenstette, Chief Science Officer, Exalead, France; in 2010-2011: Mounia Lalmas, visiting senior researcher at Yahoo! Labs Barcelona, Spain; in 2011-2012: Jakub Zavrel, CEO and founder of TextKernel, The Netherlands; in 2012-2013: Massimiliano Ciaramita, senior research scientist at Google, Zürich, Switzerland; in 2013-2014: Alex Graves, senior research scientist at Google DeepMind, London, UK; in 2014-2015: Fabrizio Silvestri, Senior Scientist at Yahoo Labs, Barcelona; in 2015-2016: Roi Blanco, Senior Scientist at Yahoo Labs, London; in 2016-2017: Holger Schwenk, research scientist at Facebook AI Research, France and Dani Yogatama, research scientist at Google DeepMind, London, UK; in 2017-2018: Enrique Alfonseca, research tech leader at Google AI, Zurich; in 2020-2021: Florian Strub, senior researcher at Google Deepmind, and in 2021-2022: Rylan Conway, applied scientist at Amazon Seattle.

Course material is available on the Toledo-platform of the K.U.Leuven. The following books offer background to the course material:
Baeza-Yates, R. & Ribeiro-Neto, B. (2011). Modern Information Retrieval: The Concepts and Technology behind Search (2nd edition). Harlow, UK: Pearson.
Büttcher, S., Clarke, C.L.A. & Cormack, G.V. (2010). Information Retrieval: Implementing and Evaluating Search Engines. Cambridge, MA: MIT Press. 
Manning, C.D., Raghaven, P. & Schütze, H. (2009). Introduction to Information Retrieval. Cambridge University Press.
Moens, M.-F. (2006). Information Extraction: Algorithms and Prospects in a Retrieval Context (International Series on Information Retrieval 21). Berlin: Springer.

 

Interactive lectures.

1 ects. Information Retrieval and Search Engines: Exercises (B-KUL-H00G9a)

1 ECTSEnglishFormat: Practical10 Second term
POC Artificial Intelligence

  • Exercise session on latent semantic models, probabilistic and vector models
  • Exercise session on learning to rank
  • Exercise session on dynamic retrieval
  • Exercise session on compression
  • Exercise session on categorization and clustering
  • Exercise session on link based and multimodal models

Exercises and answers are available via the Toledo platform. 

Interactive exercise sessions in small groups.

Evaluation

Evaluation: Information Retrieval and Search Engines (B-KUL-H22C8a)

Type : Exam during the examination period
Description of evaluation : Written
Type of questions : Open questions, Closed questions
Learning material : Calculator, Course material


Theory exam (grading: 50 %): Written, open book.

Exercise exam (grading: 50 %): Written, open book.