TNM033: Data Mining
Fall 2011


Navigation Warnings


Often people collect and store data because they think some valuable information is implicitly hidden in it. For instance, in business, data may capture information about customers, competitors, critical markets, fraud, etc. Although most of the information is stored in large databases, SQL queries are not always feasible to support analysis of the data. Many interesting queries, such as

cannot be stated in SQL.

See this interesting recent article Data Mining hottest skills, as cited by respondents to Computerworld's annual Forecast survey.

Data mining is part of the knowledge discovery process consisting of algorithms that aim at finding patterns or models hidden in the data. Hence, data mining algorithms help to find answers for many interesting queries with which traditional database techniques cannot cope. Modern data mining techniques are nowadys used by most web search engines (e.g. google) and can also be used to reveal patterns hidden in the vast amount of data of the world wide web. For instance, have you heard of Wal*Mart success ?

Data mining connects with several other important fields. Some are listed below.

The major topics discuss discussed in this course are the following.

Course plan from studiehandbok.

Organization

The course is based on

Top

Literature

Other course material, such as papers, will be suggested during the course. You are also free to choose any other book(s) at your preference.

Top

Examination

You can pass this course with grade 3, 4, or 5. To be approved in this course you must

  1. participate actively in all seminars: seminars attendance is mandatory ;
  2. prepare a data mining related topic, write a short paper about it, and present the topic during one seminar;
  3. work on a practical problem and analyze a related dataset using the techniques studied. You should then write a report describing the problem, how did you apply data mining tecniques to tackled the problem, and what knowledge did you discover.

You should form groups of 3 students and work together with your group for tasks 2 and 3 above. Not later than October 29th, all groups should be formed and each group should inform, by e-mail, the course responsible about the members of the group.

The presentation, for point 2. above, should be divided by each group member and in the begining of the presentation it should clearly be indicate what is under the responsabilitry of each member. The final pratical problem should also be divided in different sub-tasks that are then assigned to each group member. The final report must also indicate what was the contribution of each person in the group.

You can read more about seminars and pratical problem.

Top

Lectures

The following concrete topics are presented during lectures.

Lecture slides are based on the slides of the following books

Notes for the lectures (including the slides) will be posted in this web page before each lecture. The lecture slides below are from the course given in 2010. If new updates are made in some lecture slides then the new version will be posted in this web page.

  1. Lecture 1: Presentation of the course. What is data mining about, its relations to other fields. Survey of the major techniques and applications.
  2. Lecture 2: How data looks like. Exploring the data. Data Preprocessing: aggregation, sampling, discretization, attribute selection.
  3. Lecture 3: Decision trees: what they are, how to select the best split, how to handle continuous values and missing attributes, overfitting.
  4. Lecture 4: Rule-based classifiers: what is a classification rule, evaluating a rule, algorithms (e.g. PRISM. RIPPER, C4.5rules, PART).
  5. Lecture 5: Evaluating the performance of a classifier: precision, recall, TPR, FPR, TNR, FNR, sensitivity, specificity. Taking into account misclassification costs. The class imbalance problem.
  6. Lecture 6: Evaluating the performance of a model (cont.): cross-validation; bootstrap. Comparing models. Association rules (intro).
  7. Lecture 7: Association rule mining: Apriory algorithm.
  8. Lecture 8: Cluster Analysis. Basic concepts. K-means algorithm.
  9. Lecture 9: Cluster Analysis (cont.). Similarity and dissimilarity measures. Cluster validation. Hierarchical clustering. Cobweb algorithm.
Top

Seminars

Three seminars of 2 hours each, on week 49, have been booked. During each seminar two groups of students present their selected topic. The topic can either be a new topic not discussed in the lectures, a complement to a topic discussed in the lectures, or a practical application of a technique discussed in the lectures. On week 46 a list of topics for the seminars will be made available from the course web page. Each group must choose a topic and inform the course responsible about the group's choice until 18th of November. If more than one group expresses interest in a topic then I apply the rule ''first come, first served''. You inform the course responsible about your choice by sending an e-mail indicating your group number and your chosen topic.

Seminars and presentations follow the rules below.

You are welcome to discuss any issues of your presentation with me in advance.

You can find a suggestion about how to structure your paper here.

Note that I take the following aspects into account when evaluating your presentation.

Your grade for this part will take into account both the presentation and the short paper you submitted about your topic.

Below, you can find a list of proposed topics and some references. You are free to find your own references.

You can get from me the sections of the books recommended.

Seminar Topics

  1. Data Mining Applications: churn prediction

  2. Data Mining Applications: web mining

  3. Classification: grafeted decision trees

  4. Clustering: DBSCAN algorithm

  5. Clustering and visualization: Self-organizing maps (SOM)

Student Groups

Group Number Group Members
Group 1: Web mining [Slides] [Paper]
7 of Dec., 8:00-9:00, TP31
  • Sandra Stendahl
  • Andreas Andersson
  • Gustav Strõmberg
Group 2: Grafeted decision trees [Slides] [Paper]
7 of Dec., 9:00-10:00, TP31
  • Kajsa Eriksson
  • Emil Brissman
Group 3: Self-organizing maps (SOM) [Slides] [Paper]
9 of Dec., 15:00-16:00, TP31
  • Carl Claesson
  • Lucas Correia
  • David Jonsson
Group 4: DBSCAN algorithm [Slides] [Paper]
9 of Dec.,16:00-17:00, TP31
  • Anders Hedblom
  • Niklas Nejman
  • Henrik Bäcklund

Top

Labs and Practical Project

The software you are going to use is WEKA and it is installed in the lab rooms. WEKA is open source software and you can also install it in your own machines. You can download it and have access to WEKA manuals and tutorials. Some more information about WEKA is available at

Use the following simple lab exercises to get acquainted with WEKA.

Practical Project

On week 49, during your seminar presentation, every group must also present a practical problem of their own choice. This part of the presentation should take at most 10 minutes. The following points must be clearly addressed in your presentation.

Finally, you must write a report that describes in detail the points above in the context of your concrete practical problem. The structure and issues to be addressed in your report are described here.

The following aspects are taken in consideration when evaluating your report.

The deadline to submit the final report is January 20th, 2012. You must send a pdf file with your report together with the data set you worked with.

Interesting Links

Top

Important Dates

Task Deadline
Course start Week 43, Monday, 13h-15h
Form group and inform course responsible by e-mail October 29th, week 43
List of seminar topics available Week 46
Choose a seminar's topic and inform course responsible by e-mail November 18th, week 46
Deliver slides and paper for the seminar presentation December 1st, week 48, 15h
Present the practical problem for the final course project Seminar on week 49
Deliver report for the pratical problem January 20th, week 3, 2012

Top

Schedule

You can find the schedule for the lectures (Fö), seminars (SE), and labs (LA) of the course here.

Top

Staff

Course Responsibles
Aida Nordman. Office: Kåkenhus 1411.

Course administrator
Britt Wirmark. Office: Kåkenhus 1407.
Top