Navigation Warnings
- Coure starts on week 43, Monday.
Often people collect and store data because they think some valuable information is implicitly hidden in it. For instance, in business, data may capture information about customers, competitors, critical markets, fraud, etc.
Although most of the information is stored in large databases, SQL queries are not always feasible to support analysis of the data. Many interesting queries, such as
- ''find all records indicating fraud'', or
- ''describe the customers likely to buy product X'',
cannot be stated in SQL.
See this interesting recent article
Data Mining hottest skills, as cited by respondents to Computerworld's annual Forecast survey.
Data mining is part of the knowledge discovery process consisting of algorithms that aim at finding patterns or models hidden in the data.
Hence, data mining algorithms help to find answers for many interesting queries with which traditional database techniques cannot cope.
Modern data mining techniques are nowadys used by most web search engines (e.g. google) and can also be used to reveal patterns hidden in the vast amount of data of the world wide web.
For instance, have you heard of
Wal*Mart success ?
Data mining connects with several other important fields. Some are listed below.
- Artificial intelligence, e.g. search algorithms originating from AI are used in many data mining algorithms.
- Visualization. A stand-alone knowledge discovery system may not be very useful unless novel visualization tools are integrated with the system.
- Databases, as important sources of huge amounts of information.
- Data structures and algorithms techniques are required in order to be able to create non-trivial and efficiente algorithms that dig out information from vasts amounts of data.
- Statistics, many of its techniques are incorporated in data mining algorithms and used in the knowledge discovery process.
For instance, sampling and evaluation of the quality of the discovered patterns.
The major topics discuss discussed in this course are the following.
- The Knowledge Discovery in Data process (KDD).
- Data preprocessing.
- Main data mining algorithms (e.g. decision trees, bayesian classifiers, association rules, clustering, etc) and their practical applications.
- Techniques to evaluate the ``degree of interest'' of the discovered knowledge.
Course plan from studiehandbok.
The course is based on
- 10 lectures (FÖ). Each lecture is 2h.
- 3 seminars (SE). Each seminar is 2h.
- 2 lab sessions. Each lab session is 2h.
Top
Other course material, such as papers, will be suggested during the course. You are also free to choose any other book(s) at your preference.
Top
You can pass this course with grade 3, 4, or 5. To be approved in this course you must
- participate actively in all seminars: seminars attendance is mandatory ;
- prepare a data mining related topic, write a short paper about it, and present the topic during one seminar;
- work on a practical problem and analyze a related dataset using the techniques studied.
You should then write a report describing the problem, how did you apply data mining tecniques to tackled the problem,
and what knowledge did you discover.
You should form groups of 3 students and work together with your group for tasks 2 and 3 above.
Not later than October 29th, all groups should be formed and each group should inform, by e-mail,
the course responsible about the members of the group.
The presentation, for point 2. above, should be divided by each group member and in the begining of the presentation it should clearly be indicate what is under the responsabilitry of each member.
The final pratical problem should also be divided in different sub-tasks that are then assigned to each group member. The final report must also indicate what was the contribution of each person in the group.
You can read more about seminars and pratical problem.
Top
The following concrete topics are presented during lectures.
- Welcoming the students and presentation of the course.
- Introducing the KDD process.
- Data preprocessing.
- Decision tree induction.
- Rule-based classifiers.
- Bayesian classifiers.
- Association analysis and Apriory algorithm.
- Clustering.
- Evaluating discovered knowledge and comparing classifiers.
Lecture slides are based on the slides of the following books
- Introduction to Data Mining, P. Tan, M. Steinbach, V. Kumar, ISBN: 0-321-32136-7
- Data Mining, Pratical Machine Learning Tools and Techniques, I. Witten, E. Frank, ISBN-13: 978-0-12-088407-0
Notes for the lectures (including the slides) will be posted in this web page before each lecture.
The lecture slides below are from the course given in 2010.
If new updates are made in some lecture slides then the new version will be posted in this web page.
- Lecture 1:
Presentation of the course. What is data mining about, its relations to other fields. Survey of the major techniques and applications.
- Lecture 2:
How data looks like. Exploring the data. Data Preprocessing: aggregation, sampling, discretization, attribute selection.
- Lecture 3:
Decision trees: what they are, how to select the best split, how to handle continuous values and missing attributes, overfitting.
- Lecture 4:
Rule-based classifiers: what is a classification rule, evaluating a rule, algorithms (e.g. PRISM. RIPPER, C4.5rules, PART).
- Lecture 5:
Evaluating the performance of a classifier: precision, recall, TPR, FPR, TNR, FNR, sensitivity, specificity.
Taking into account misclassification costs. The class imbalance problem.
- Lecture 6:
Evaluating the performance of a model (cont.): cross-validation; bootstrap. Comparing models. Association rules (intro).
- Lecture 7:
Association rule mining: Apriory algorithm.
- Lecture 8:
Cluster Analysis. Basic concepts. K-means algorithm.
- Lecture 9:
Cluster Analysis (cont.). Similarity and dissimilarity measures. Cluster validation. Hierarchical clustering. Cobweb algorithm.
Top
Three seminars of 2 hours each, on week 49, have been booked. During each seminar two groups of students present their selected topic.
The topic can either be a new topic not discussed in the lectures, a complement to a topic discussed in the lectures,
or a practical application of a technique discussed in the lectures.
On week 46 a list of topics for the seminars will be made available from the course web page.
Each group must choose a topic and inform the course responsible about the group's choice until 18th of November.
If more than one group expresses interest in a topic then I apply the rule ''first come, first served''.
You inform the course responsible about your choice by sending an e-mail indicating your group number and your chosen topic.
Seminars and presentations follow the rules below.
- Each presentation takes at most 20 minutes, followed by 15 minutes discussion/questions.
- Prepare a pdf file of your presentation, with 2 slides per page, and write a short paper, not more than 8 pages,
summarizing the most relevant ideas about your topic.
Your paper and presentation must be sent to the course responsible in pdf format not later than 1st of December, 15h.
I will make your files available to all other students from the course web page.
- For each seminar, you are then advised to read the related short papers and the proposed material in advance.
- Do not forget to add a slide with your bibliographic references to your presentation.
- Each group (except the one making the presentation) must submit a valid question about the seminar's topic.
In the beginning of the seminar, I'll collect all questions for discussion.
- After presentation and discussion of your seminar topic,
you have 10 minutes to present the practical problem you are going to work with in the final course project.
For more details see final practical project.
- Barring tragedy, your presence in all seminars is mandatory.
You are welcome to discuss any issues of your presentation with me in advance.
You can find a suggestion about how to structure your paper
here.
Note that I take the following aspects into account when evaluating your presentation.
- Clarity of your exposition.
Your presentation (and paper) must clearly address the following questions.
- What is the problem about?
- What techniques are used to tackle it? How they compare to each other? What are their weaknesses and strengths?
- What are possible applications?
- Respect for time constraints (i.e. make sure that you can make your presentation in at most 30 minutes) and deadlines.
- Technical content.
- Student evaluations. Each student is asked to fill in an evaluation form for each presentation.
Your grade for this part will take into account both the presentation and the short paper you submitted about your topic.
Below, you can find a list of proposed topics and some references. You are free to find your own references.
You can get from me the sections of the books recommended.
Seminar Topics
- Data Mining Applications: churn prediction
- Data Mining Applications: web mining
- Classification: grafeted decision trees
- Clustering: DBSCAN algorithm
- Clustering and visualization: Self-organizing maps (SOM)
Student Groups
Group Number |
Group Members |
Group 1: Web mining
[Slides]
[Paper]
7 of Dec., 8:00-9:00, TP31 |
- Sandra Stendahl
- Andreas Andersson
- Gustav Strõmberg
|
Group 2: Grafeted decision trees
[Slides]
[Paper]
7 of Dec., 9:00-10:00, TP31 |
- Kajsa Eriksson
- Emil Brissman
|
Group 3: Self-organizing maps (SOM)
[Slides]
[Paper]
9 of Dec., 15:00-16:00, TP31 |
- Carl Claesson
- Lucas Correia
- David Jonsson
|
Group 4: DBSCAN algorithm
[Slides]
[Paper]
9 of Dec.,16:00-17:00, TP31 |
- Anders Hedblom
- Niklas Nejman
- Henrik Bäcklund
|
Top
The software you are going to use is WEKA and it is installed in the lab rooms.
WEKA is open source software and you can also install it in your own machines.
You can download it and have access to WEKA manuals and tutorials.
Some more information about WEKA is available at
Use the following simple lab exercises to get acquainted with WEKA.
On week 49, during your seminar presentation, every group must also present a practical problem of their own choice.
This part of the presentation should take at most 10 minutes.
The following points must be clearly addressed in your presentation.
- Problem description
- Your data set and any preprocessing techniques you may need
- What data mining techniques do you plan to use (e.g. classification, association rule mining, clustering etc)
- What quality measures are you going to use for the extracted patterns.
Finally, you must write a report that describes in detail the points above in the context of your concrete practical problem.
The structure and issues to be addressed in your report are described
here.
The following aspects are taken in consideration when evaluating your report.
- Clarity and organization of your written report. Please, do not minimize these two aspects.
A well-written report is of utmost importance to cause a positive impression about your work.
- Technical content (i.e. description of the techniques used to tackle the problem).
- All points mentioned here.
The deadline to submit the final report is January 20th, 2012.
You must send a pdf file with your report together with the data set you worked with.
Interesting Links
Top
Task |
Deadline |
Course start |
Week 43, Monday, 13h-15h |
Form group and inform course responsible by e-mail |
October 29th, week 43 |
List of seminar topics available |
Week 46 |
Choose a seminar's topic and inform course responsible by e-mail |
November 18th, week 46 |
Deliver slides and paper for the seminar presentation |
December 1st, week 48, 15h |
Present the practical problem for the final course project |
Seminar on week 49 |
Deliver report for the pratical problem |
January 20th, week 3, 2012 |
Top
You can find the schedule for the lectures (Fö), seminars (SE),
and labs (LA) of the course here.
Top
- Course Responsibles
-
Aida Nordman. Office: Kåkenhus 1411.
- Course administrator
-
Britt Wirmark. Office: Kåkenhus 1407.
Top