ML-Assignment-1.pdf

Homework 1: Decision Tree Classification
Deadline: November 27th, 2007 (Azar 6th, 1386)
Send your files (source code, report, etc.) in a zipped folder
to arash.ashari@gmail.com.
Mention your student-id and "HW1-ML" phrase in the
subject of your email, like “84200986-HW1-ML”.
PART 1
In this assignment we give you a set of data and you should create all decision
trees that can be produced by ID3 algorithm. In decision tree construction one
important question is "which attribute should be chosen for the current node". In
ID3, this question is solved by information gain measure heuristic. But there are
lots of cases where two (or even more) attributes have the same information gain.
This evidence shows that ID3 results are not unique. In this assignment you
should construct all of these trees and show them. More details about the format
of the inputs and outputs are as follows.
1.1
Input
All inputs are in the file "input.txt", which should be opened to discover the
size of samples, attributes and all of the training data. Every sample consists of
a set of values for attributes and one output. The first line of the input.txt is
one positive integer n, showing the number of attributes that every sample can
take, followed by n lines, each showing the number of values, mi, each attribute
ai can take, where i ∈{0,1,2,..., n − 1} . Then, there is one blank line and in the next
line comes a number O that shows the number of outputs (classes). Again, after
one blank line there comes l, a number that shows the total number of training
data.
Each training data consists of n+1 numbers all in one line separated by one
space. The first n numbers show the values of the sample's attributes, and the
last one is the corresponding output (class) for that sample.
Till now, we explained how many values, the attributes and output can take. but
now we should explain which values they are. For simplicity, suppose all the
values start from zero. For example the attribute ai can take the values
{0,1,2,…,mi-1} and an output (class) is a number chosen from the following set;
{0,1,2,…,O-1}.
1
1.2
Output
Due to the complexity of the tree formatting in printing, you are free to show
trees with any format that you want. One way is to use a text file. In this way the
format of the tree is usually shown as the format in the Sample Output part. As
another method, you may choose to write results in one XML file and use a tree
view component which is usually available in visual programming languages. If
you use a file for your outputs, the name of the file should be ‘output.xml’ or
‘output.txt’.
1.3
Sample Input
Only numbers appear in the input file. Comments are just for clarification.
5
2
3
4
3
2
// it means there are 5 attributes for each sample: a0, a1, a2, a3, and a4
// it means attribute a0 can take 2 different values: {0, 1}
// it means attribute a1 can take 3 different values: {0, 1, 2}
// it means attribute a2 can take 4 different values: {0, 1, 2, 3}
// it means attribute a3 can take 3 different values: {0, 1, 2}
// it means attribute a4 can take 2 different values: {0, 1}
2
// it means this test case deals with a 2-class problem: {0, 1}
9
// it means there are 9 different samples for decision tree training phase
000001
010201
122000
003001
011001
002200
001200
021200
021100
// sample 1: attributes=<0, 0, 0, 0, 0>, class=1
// sample 2: attributes=<0, 1, 0, 2, 0>, class=1
// sample 3: attributes=<1, 2, 2, 0, 0>, class=0
// sample 4: attributes=<0, 0, 3, 0, 0>, class=1
// sample 5: attributes=<0, 1, 1, 0, 0>, class=1
// sample 6: attributes=<0, 0, 2, 2, 0>, class=0
// sample 7: attributes=<0, 0, 1, 2, 0>, class=0
// sample 8: attributes=<0, 2, 1, 2, 0>, class=0
// sample 9: attributes=<0, 2, 1, 1, 0>, class=0
2
1.4
Sample Output
Tree 1
a2 = 0: 1
a2 = 1:
a1 = 0: 0
a1 = 1: 1
a1 = 2: 0
a2 = 2: 0
a2 = 3: 1
Tree 2
a2 = 0: 1
a2 = 1:
a3 = 0: 1
a3 = 1: 0
a3 = 2: 0
a2 = 2: 0
a2 = 3: 1
Tree 1
1.5
Tree 2
Documentation
Do not explain the source code in details. In your report you should discuss the
main algorithm of your work (i.e. signature of the main functions and important
loops or recursions). After testing your program with different test cases, explain
your observations.
3
Part 2)
Does the ID3 algorithm always produce an optimal decision tree?
If your answer is "Yes", prove the claim mathematically. Otherwise give a
counter-example and discuss the pros and cons of such greedy heuristics.
(A decision tree with minimum height and comparisons is considered as an optimal
decision tree.)
Part 3)
Suppose in a Concept Learning problem (2 classes: {+, -}) the “ID3” and
“Candidate Elimination” algorithms have resulted in 'h1' and 'h2' respectively.
Now we exchange the labels of all training data (i.e. class labels previously
known as + would be – and vice versa.) Thus above algorithms would result in
H1 and H2 respectively.
Describe the relation among h1, h2, H1 and H2.
4