Information Retrieval and Search Engines (B-KUL-H02C8A)
Aims
The aim of the course is to study the current techniques and algorithms commonly used in information retrieval, Web search and Web mining, and the challenges of these fields. The theoretical insights are the basis for discussions of commercial systems and ongoing research projects. After the study of this course the student should be able to 1) describe and understand fundamental concepts and algorithms in information retrieval, Web search and Web mining; 2) design and evaluate an information retrieval system.
The exercise sessions give the opportunity to gain an in-depth understanding of the algorithms discussed during the lectures.
Previous knowledge
The course addresses students who are interested in the theory and applications of the processing, storage and retrieval of information. Elementary knowledge of statistics, probability theory and linear algebra is required. It is recommended that the student is familiar with machine learning methods.
Is included in these courses of study
- Master in de toegepaste informatica (programma voor studenten gestart vóór 2024-2025) (Leuven) (Artificiële intelligentie) 60 ects.
- Master handelsingenieur in de beleidsinformatica (Leuven) 120 ects.
- Master handelsingenieur in de beleidsinformatica (Leuven) (Minor: Data science) 120 ects.
- Master of Artificial Intelligence (Leuven) (Specialisation: Big Data Analytics (BDA)) 60 ects.
- Master of Artificial Intelligence (Leuven) (Specialisation: Engineering and Computer Science (ECS)) 60 ects.
- Master of Bioinformatics (Leuven) (Bioscience Engineering) 120 ects.
- Master of Bioinformatics (Leuven) (Engineering) 120 ects.
- Master of Information Management (Leuven) 60 ects.
- Master of Biomedical Engineering (Programme for students started before 2021-2022) (Leuven) 120 ects.
- Master of Business and Information Systems Engineering (Leuven) 120 ects.
- Master of Business and Information Systems Engineering (Leuven) (Minor: Data Science) 120 ects.
Activities
3 ects. Information Retrieval and Search Engines: Lecture (B-KUL-H02C8a)
Content
The motivation for the course lies in the urgent need for computer programs that assist people in digesting masses of unstructured information composed of text and other media. We need information retrieval technology when, for instance, we find information on the World Wide Web, in repositories of news and blogs, in biomedical document bases, or in governmental and company archives. Moreover, emails, tweets, other messages and advertisements are searched and filtered. Various techniques of content recognition, recommendation and linking play an increasing role and allow generating content models of the documents or messages, that effectively match the personalized information needs of users. We witness a current interest in capturing dynamic changes in the data and in modeling dynamic interactions with users. The proliferation of wireless and mobile devices such as mobile phones has additionally created a demand for effective and robust techniques to index, retrieve and summarize information.
The lectures treat the following topics:
1. Introduction
2. Advanced representations
Law of Zipf
Matrix factorization, latent semantic analysis (LSA), training with singular value decomposition
Probabilistic latent semantic analysis (pLSA), latent Dirichlet allocation (LDA), training with Expectation Maximization (EM) algorithms, Markov chain Monte Carlo (MCMC) methods such as Gibbs sampling, and with variational inference
Embeddings obtained with neural networks
3. Retrieval and search models
Algebraic models: vector space models
Probabilistic models: language retrieval models and Bayesian networks
Neural network models
4. Learning to rank
Relevance feedback, personalized and contextualized information needs, user profiling
Pointwise, pairwise and listwise approaches
Structured output support vector machines, loss functions, most violated constraints
End-to-end neural network models
Optimization of retrieval effectiveness and of diversity of search results
5. Dynamic retrieval and recommendation
Static versus dynamic models
Markov decision processes
Multi-armed bandit models
Modelling sessions
Online advertising
6. Multimedia information retrieval
Multimedia data types and features
Concept detection
Cross-modal indexing of content: latent Dirichlet allocation and deep learning methods
Cross-modal and multimodal retrieval and recommendation models
Illustrations with spoken document, image, video and music search
7. Web search
Web search engines, crawler-indexer architecture, query processing
Link analysis retrieval models: PageRank, HITS, personalized PageRank and variants
Behavior and credibility based retrieval models
Social search, mining and searching user generated content
8. Scalability of Web search
Data structures and search techniques
Inverted files, nextword indices, taxonomy indices, distributed indices
Compression
Learning of hashing functions, cross-modal hashing
Scalability and efficiency challenges
Architectural optimizations
9. Clustering
Distance and similarity functions in Euclidean and hyperbolic spaces, proximity functions
Sequential and hierarchical cluster algorithms, algorithms based on cost-function optimization, number of clusters
Term clustering for query expansion, document clustering, multiview clustering
10. Categorization
Feature selection, naive Bayes model, support vector machines, (approximate) k-nearest neighbor models
Deep learning methods
Multilabel and hierarchical categorization
Convolutional neural network (CNN) based hierarchical categorization
11. Summarization
Document segmentation, maximum marginal relevance
Summarization based on latent Dirichlet allocation models and long short-term memory (LSTM) networks
Abstractive summarization with attention models
Multidocument summarization, search results fusion and visualization
12. Question answering and conversational agents in search and recommendation
Retrieval based question answering
Deep learning methods including attention models
Cross-modal question answering
E-commerce search and recommendation
13. Evaluation measures and methodology
Recall, precision, F-measure, mean average precision, discounted cumulative gain, mean reciprocal answer rank, accuracy, confusion matrix, ROC curve, normalized mutual information, mean absolute error, root mean square error, pyramid method, inter-annotator agreement, test collections
14. Discussion of interesting research projects
15. Invited lecture by representative of an important company
In 2006-2007: Thomas Hofmann, Director of Engineering, Google Zurich European Engineering Centre, Switzerland; in 2007-2008: Ronny Lempel, director of Yahoo! research, Israel; in 2008-2009: Stephen Robertson, senior researcher at Microsoft Research Cambridge, UK and one of the founders of probabilistic modeling in information retrieval; in 2009-2010: Gregory Grefenstette, Chief Science Officer, Exalead, France; in 2010-2011: Mounia Lalmas, visiting senior researcher at Yahoo! Labs Barcelona, Spain; in 2011-2012: Jakub Zavrel, CEO and founder of TextKernel, The Netherlands; in 2012-2013: Massimiliano Ciaramita, senior research scientist at Google, Zürich, Switzerland; in 2013-2014: Alex Graves, senior research scientist at Google DeepMind, London, UK; in 2014-2015: Fabrizio Silvestri, Senior Scientist at Yahoo Labs, Barcelona; in 2015-2016: Roi Blanco, Senior Scientist at Yahoo Labs, London; in 2016-2017: Holger Schwenk, research scientist at Facebook AI Research, France and Dani Yogatama, research scientist at Google DeepMind, London, UK; in 2017-2018: Enrique Alfonseca, research tech leader at Google AI, Zurich; in 2020-2021: Florian Strub, senior researcher at Google Deepmind, and in 2021-2022: Rylan Conway, applied scientist at Amazon Seattle.
Course material
Course material is available on the Toledo-platform of the K.U.Leuven. The following books offer background to the course material:
Baeza-Yates, R. & Ribeiro-Neto, B. (2011). Modern Information Retrieval: The Concepts and Technology behind Search (2nd edition). Harlow, UK: Pearson.
Büttcher, S., Clarke, C.L.A. & Cormack, G.V. (2010). Information Retrieval: Implementing and Evaluating Search Engines. Cambridge, MA: MIT Press.
Manning, C.D., Raghaven, P. & Schütze, H. (2009). Introduction to Information Retrieval. Cambridge University Press.
Moens, M.-F. (2006). Information Extraction: Algorithms and Prospects in a Retrieval Context (International Series on Information Retrieval 21). Berlin: Springer.
Format: more information
Interactive lectures.
Is also included in other courses
1 ects. Information Retrieval and Search Engines: Exercises (B-KUL-H00G9a)
Content
- Exercise session on latent semantic models, probabilistic and vector models
- Exercise session on learning to rank
- Exercise session on dynamic retrieval
- Exercise session on compression
- Exercise session on categorization and clustering
- Exercise session on link based and multimodal models
Course material
Exercises and answers are available via the Toledo platform.
Format: more information
Interactive exercise sessions in small groups.
Is also included in other courses
Evaluation
Evaluation: Information Retrieval and Search Engines (B-KUL-H22C8a)
Explanation
Theory exam (grading: 50 %): Written, open book.
Exercise exam (grading: 50 %): Written, open book.