Big Data Analytics Programming (B-KUL-H00Y4A)
Aims
The goal of this course is to familiarize students with the different types of programming environments they may encounter or need to utilize when analyzing large-scale data sets. The course consists of three parts or modules. Each module will begin with background lectures that introduce and cover the relevant topics. The key concepts will be reinforced during practical exercise sessions. Then the students will be expected to use these skills in order to complete programming projects.
Note that this class runs over both semesters. That is, there will be projects and lectures in both the first and second semester. It is not possible to follow this class for only one of the semesters.
Previous knowledge
Strong background and experience with advanced data structures and algorithms, including topics such as hash tables/maps/sets, sorting algorithms, queues, search trees, etc. Understanding of time and space complexity of algorithms. Excellent programming ability in Java, C++, C, or a similar language. General familiarity with relational databases.
Order of Enrolment
Mixed prerequisite:
You may only take this course if you comply with the prerequisites. Prerequisites can be strict or flexible, or can imply simultaneity. A degree level can be also be a prerequisite.
Explanation:
STRICT: You may only take this course if you have passed or applied tolerance for the courses for which this condition is set.
FLEXIBLE: You may only take this course if you have previously taken the courses for which this condition is set.
SIMULTANEOUS: You may only take this course if you also take the courses for which this condition is set (or have taken them previously).
DEGREE: You may only take this course if you have obtained this degree level.
(SIMULTANEOUS(H02C1A) OR SIMULTANEOUS(H0E96A) OR SIMULTANEOUS(H0E98A)) AND SIMULTANEOUS(H02C6A)
The codes of the course units mentioned above correspond to the following course descriptions:
H02C1A : Machine Learning and Inductive Inference
H0E96A : Beginselen van machine learning
H0E98A : Principles of Machine Learning
H02C6A : Data Mining
Is included in these courses of study
- Master of Artificial Intelligence (Leuven) (Specialisation: Big Data Analytics (BDA)) 60 ects.
- Master in de ingenieurswetenschappen: computerwetenschappen (Leuven) (Hoofdoptie Artificiële intelligentie) 120 ects.
- Courses for Exchange Students Faculty of Engineering Science (Leuven)
- Master of Engineering: Computer Science (Leuven) (Option Artificial Intelligence) 120 ects.
- Master in de ingenieurswetenschappen: artificiële intelligentie (Leuven) 120 ects.
Activities
2.5 ects. Big Data Analytics Programming: Lecture (B-KUL-H00Y4a)
Content
Note that the order that the topics are covered in can vary from year to year.
Part I: Basics
1. Introduction and overview
2. Background on hashing, computer organization, complexity basics, etc.
3. Databases basics: SQL, join algorithms, index structures
4. Advanced topics: Fancy indexes, column store, warehouses, noSQL
Part II: Structures and techniques for efficiency
1. Introduction an overview
2. Learning from data streams
3. Fast nearest neighbors algorithms
4. Implementation tricks
5. Approximation methods (e.g., sketches, sampling)
6. Advanced topics?
Part III: Parallel Architectures
1. Introduction and overview
2. Types of parallelism (e.g., shared memory, shared nothing)
3. Concurrency
4. Parallel programming bugs (e.g., data races, deadlock, etc.)
5. Map-reduce
6. Cloud computing
7. Condor?
Course material
Lecture slides, readings, and online resources
0.5 ects. Big Data Analytics: Exercises (B-KUL-H00Y5a)
Content
1. Part I: Query Languages
1. Introduction and overview
2. SQL: selection, projection, select-project-join, group-by, aggregates, subqueries, nested queries
3. Xquery
4. Sparql
5. Writing applications that can interface with a DBMS
6. Indexing?
2. Part II: Scripting Languages
1. Introduction an overview
2. Perl/Python
3. Part III: Parallel Architectures
1. Introduction and overview
2. Types of parallelism (e.g., shared memory, shared nothing)
3. Concurrency
4. Parallel programming bugs (e.g., data races, deadlock, etc.)
5. Map-reduce
6. Cloud computing
7. Condor?
Course material
Exercise slides
3 ects. Big Data Analytics: Assignments (B-KUL-H00Y6a)
Content
Examples of the types of assignments
1. Given a set of verbal queries, translate them into a query language
2. Write queries to extract information needed for a machine learning task
3. Implement advanced machine learning algorithms (e.g., for learning from streaming data)
4. Implement an advanced data mining algorithm
5. A project that uses Hadoop, Spark, etc.
Course material
Assignment sheets
Evaluation
Evaluation: Big Data Analytics Programming (B-KUL-H20Y4a)
Explanation
The evaluation of the course will be based on multiple programming assignments. Solutions are evaluated in terms of correctness, efficiency and generalizability.
Projects that are independent mean that students must complete the assignment individually. Thus using outside sources (e.g., publicly available code, etc.) or working together (e.g., working with somone else to solve the assignment, getting substantial help from someone else to solve the assignment, etc.) is strictly forbidden. If you questions about what is and is not permitted, please consult the instructor.
Information about retaking exams
For the project assignments with a failed result, the student will have an opportunity to complete an alternative assignment.