Homework 1: Decision Tree Classification Deadline: November 27th, 2007 (Azar 6th, 1386) Send your files (source code, report, etc.) in a zipped folder to arash.ashari@gmail.com. Mention your student-id and "HW1-ML" phrase in the subject of your email, like “84200986-HW1-ML”. PART 1 In this assignment we give you a set of data and you should create all decision trees that can be produced by ID3 algorithm. In decision tree construction one important question is "which attribute should be chosen for the current node". In ID3, this question is solved by information gain measure heuristic. But there are lots of cases where two (or even more) attributes have the same information gain. This evidence shows that ID3 results are not unique. In this assignment you should construct all of these trees and show them. More details about the format of the inputs and outputs are as follows. 1.1 Input All inputs are in the file "input.txt", which should be opened to discover the size of samples, attributes and all of the training data. Every sample consists of a set of values for attributes and one output. The first line of the input.txt is one positive integer n, showing the number of attributes that every sample can take, followed by n lines, each showing the number of values, mi, each attribute ai can take, where i ∈{0,1,2,..., n − 1} . Then, there is one blank line and in the next line comes a number O that shows the number of outputs (classes). Again, after one blank line there comes l, a number that shows the total number of training data. Each training data consists of n+1 numbers all in one line separated by one space. The first n numbers show the values of the sample's attributes, and the last one is the corresponding output (class) for that sample. Till now, we explained how many values, the attributes and output can take. but now we should explain which values they are. For simplicity, suppose all the values start from zero. For example the attribute ai can take the values {0,1,2,…,mi-1} and an output (class) is a number chosen from the following set; {0,1,2,…,O-1}. 1 1.2 Output Due to the complexity of the tree formatting in printing, you are free to show trees with any format that you want. One way is to use a text file. In this way the format of the tree is usually shown as the format in the Sample Output part. As another method, you may choose to write results in one XML file and use a tree view component which is usually available in visual programming languages. If you use a file for your outputs, the name of the file should be ‘output.xml’ or ‘output.txt’. 1.3 Sample Input Only numbers appear in the input file. Comments are just for clarification. 5 2 3 4 3 2 // it means there are 5 attributes for each sample: a0, a1, a2, a3, and a4 // it means attribute a0 can take 2 different values: {0, 1} // it means attribute a1 can take 3 different values: {0, 1, 2} // it means attribute a2 can take 4 different values: {0, 1, 2, 3} // it means attribute a3 can take 3 different values: {0, 1, 2} // it means attribute a4 can take 2 different values: {0, 1} 2 // it means this test case deals with a 2-class problem: {0, 1} 9 // it means there are 9 different samples for decision tree training phase 000001 010201 122000 003001 011001 002200 001200 021200 021100 // sample 1: attributes=<0, 0, 0, 0, 0>, class=1 // sample 2: attributes=<0, 1, 0, 2, 0>, class=1 // sample 3: attributes=<1, 2, 2, 0, 0>, class=0 // sample 4: attributes=<0, 0, 3, 0, 0>, class=1 // sample 5: attributes=<0, 1, 1, 0, 0>, class=1 // sample 6: attributes=<0, 0, 2, 2, 0>, class=0 // sample 7: attributes=<0, 0, 1, 2, 0>, class=0 // sample 8: attributes=<0, 2, 1, 2, 0>, class=0 // sample 9: attributes=<0, 2, 1, 1, 0>, class=0 2 1.4 Sample Output Tree 1 a2 = 0: 1 a2 = 1: a1 = 0: 0 a1 = 1: 1 a1 = 2: 0 a2 = 2: 0 a2 = 3: 1 Tree 2 a2 = 0: 1 a2 = 1: a3 = 0: 1 a3 = 1: 0 a3 = 2: 0 a2 = 2: 0 a2 = 3: 1 Tree 1 1.5 Tree 2 Documentation Do not explain the source code in details. In your report you should discuss the main algorithm of your work (i.e. signature of the main functions and important loops or recursions). After testing your program with different test cases, explain your observations. 3 Part 2) Does the ID3 algorithm always produce an optimal decision tree? If your answer is "Yes", prove the claim mathematically. Otherwise give a counter-example and discuss the pros and cons of such greedy heuristics. (A decision tree with minimum height and comparisons is considered as an optimal decision tree.) Part 3) Suppose in a Concept Learning problem (2 classes: {+, -}) the “ID3” and “Candidate Elimination” algorithms have resulted in 'h1' and 'h2' respectively. Now we exchange the labels of all training data (i.e. class labels previously known as + would be – and vice versa.) Thus above algorithms would result in H1 and H2 respectively. Describe the relation among h1, h2, H1 and H2. 4
© Copyright 2025 Paperzz