The aim of the course is to study the current techniques and algorithms commonly used in text based information retrieval and the challenges of this field. The theoretical insights are the basis for discussions of commercial systems and ongoing research projects.
After the study of this course the student should be able to 1) describe and understand fundamental concepts and algorithms in information retrieval and text mining; 2) design, partially implement, and evaluate a text based information retrieval system.
The course addresses students who are interested in the theory and applications of the processing, storage and retrieval of textual information. Elementary knowledge of statistics, probability theory and linear algebra is required. It is recommended that the student is familiar with machine learning methods.
Examples and samples
Toledo / e-platform
Articles and literature
Order of Enrolment
This course unit is a prerequisite for taking the following course units:
G0J98A : Current Trends in Databases
Is also included in other courses
- Study Abroad Programme in European Culture and Society (PECS)
- Master in de toegepaste informatica (Artificial Intelligence and Databases) 60 ects.
- Master in de toegepaste economische wetenschappen: handelsingenieur in de beleidsinformatica 120 ects.
- Master in de ingenieurswetenschappen: biomedische technologie 120 ects.
- Master of Artificial Intelligence 60 ects.
- Master in de informatica (uitdovend, enkel 2e fase) (Specialisation: Artificial Intelligence) 120 ects.
- Master in de informatica (uitdovend, enkel 2e fase) (Specialisation: Databases) 120 ects.
- Master in de informatica (uitdovend, enkel 2e fase) (Specialisation: Multimedia) 120 ects.
- Master of Information Management 60 ects.
- Master in de ingenieurswetenschappen: computerwetenschappen (Specialisation: Artificial Intelligence) 120 ects.
- Master of Engineering: Biomedical Engineering 120 ects.
1. Exercise session on latent semantic models
2. Exercise session on probabilistic retrieval models
3. Exercise session on feature selection and hierarchical text categorization
4. Exercise session on text clustering
5. Exercise session on link based retrieval models.
The exercise sessions give the opportunity to gain an in-depth understanding of the algorithms discussed during the lectures.
Description of learning activities
Interactive exercise sessions in small groups.
Exercises and answers are available via the Toledo platform.
The motivation for the course lies in the urgent need for computer programs that assist people in digesting masses of unstructured information among which is text. We need information retrieval technology when, for instance, we find information on the World Wide Web, in repositories of news and blogs, in biomedical document bases, or in governmental and company archives. Moreover, emails, other messages and advertisements are searched and filtered. Various techniques of content recognition and linking play an increasing role and allow generating content models of the documents or messages, that effectively match the personalized information needs of users. The proliferation of wireless and mobile devices such as mobile phones has additionally created a demand for effective and robust techniques to index, retrieve and summarize information.
The lectures treat the following topics:
- Text and its characteristics
- Statistical and machine learning techniques
- Link and graph based algorithms
3. Representation of documents and information needs
- Natural language versus controlled language index terms
- Tokenization, detection of collocations, term weighting
- Relevance feedback, personalized and contextualized information needs
4. Latent semantic models
- Dimensionality reduction, Latent Semantic Analysis, probabilistic Latent Semantic Analysis, Latent Dirichlet Allocation
5. Retrieval models
- Set theoretic models: Boolean and extended Boolean models
- Algebraic models: vector space models
- Probabilistic models: Robertson and Sparck Jones model, language models, Bayesian networks
6. Web information retrieval
- Web search engines, crawler-indexer architecture
- Link analysis retrieval models: HITS, extended PageRank, personalized PageRank
- Behavior and credibility based retrieval models
- Web interfaces, social search, searching user generated content
7. Introduction to data structures and search techniques
- Sequential searching
- Inverted files, nextword indices, taxonomy indices, distributed indices
- Compression and searching
8. Text categorization
- Feature selection, chi square statistic, information gain, decision rules and trees, naive Bayes models, support vector machines
- Hierarchical text categorization
- Document filtering
9. Text clustering
- Feature selection, distance and similarity functions, proximity functions
- Sequential and hierarchical cluster algorithms, algorithms based on cost-function optimization, non-negative matrix factorization, number of clusters
- Term clustering for query expansion and thesaurus construction, document clustering, clustering of heterogeneous content
10. Information extraction and question answering
- Feature selection, rule and frame based methods, context-dependent classification
- Named entity recognition and relation extraction
- Retrieval based on extracted facts, entities and temporal information
11. Text summarization
- Feature selection, machine learning
- Text segmentation, maximum marginal relevance, multi-document summarization
- Summarization with latent semantic models
- Word and sentence conflation
12. Cross-language information retrieval
- Query translation, query expansion, probabilistic language models, relevance feedback
- Cross-lingual latent semantic models, bilingual Latent Dirichlet Allocation
13. Multimedia information retrieval
- Multimedia data types, query formats, indexing, cross-media alignment, MPEG-7, retrieval models
- Illustrations with spoken document, image, video and music retrieval
14. Evaluation measures and methodology
- Recall, precision, F-measure, mean average precision, discounted cumulative gain, mean reciprocal answer rank, accuracy, confusion matrix, ROC curve, pyramid method, inter-annotator agreement, test collections
15. Discussion of interesting research projects
16. Invited lecture by representative of an important company: in 2006-2007: Thomas Hofmann, Director of Engineering, Google Zurich European Engineering Centre, Switzerland; in 2007-2008: Ronny Lempel, director of Yahoo! research, Israel; in 2008-2009: Stephen Robertson, senior researcher at Microsoft Research Cambridge, UK and one of the founders of probabilistic modeling in information retrieval; in 2009-2010: Gregory Grefenstette, Chief Science Officer, Exalead, Paris; in 2010-2011: Mounia Lalmas, visiting senior researcher at Yahoo! Labs Barcelona; in 2011-2012: Jakub Zavrel, CEO and founder of TextKernel, The Netherlands.
The aim is to acquire fundamental insights into the theory of text based information retrieval and text mining.
Description of learning activities
Course material is available on the Toledo-platform of the K.U.Leuven. The following books offer background to the course material:
Baeza-Yates, R. & Ribeiro-Neto, B. (2011). Modern Information Retrieval: The Concepts and Technology behind Search (2nd edition). Harlow, UK: Pearson.
Büttcher, S., Clarke, C.L.A. & Cormack, G.V. (2010). Information Retrieval: Implementing and Evaluating Search Engines. Cambridge, MA: MIT Press.
Manning, C.D., Raghaven, P. & Schütze, H. (2009). Introduction to Information Retrieval. Cambridge University Press.
Moens, M.-F. (2006). Information Extraction: Algorithms and Prospects in a Retrieval Context (International Series on Information Retrieval 21). Berlin: Springer.
The assignment is a choice between a short paper or small programming assignment on a given problem of current research interest.
The aim is to design, possibly implement and evaluate a text based information retrieval system.
Description of learning activities
The student works individually on a project, but receives guidance when needed.
The project assignment and additional documentation are available on the Toledo platform of the K.U.Leuven.
- An assignment (grading: 33.3%): At the start of the course (week 7) the student can choose an assignment (paper or programming exercise), which regards a specific problem in information retrieval. The assignment is due during week 17. A score of 50% or more on this assignment is transferred to the second exam session.
- Theory exam (grading: 33.3 %): Oral with written preparation, closed book.
- Exercise exam (grading: 33.3%): Written, open book.