Say all of the examples in S are positive and none are negative. Then, %P = 1, and %N = 0. So,
I(1,0) = -1 log21 - 0 log20 = -0 - 0 = 0 => information content is low
Low information content is desirable in order to make the smallest tree because low information content means that most of examples are classified the SAME, and therefore we would expect that the rest of the tree rooted at this node will be quite small to differentiate between the two classifications.
Now, measure the information gained by using a given attribute. That is, measure the difference in the information content of a node and the information content after a node splits up the examples based on a selected attribute's possible values. To do this, we need a measure of the information content after "splitting" a node's examples into its children based on a hypothesized attribute.
Given a node that measures attribute A, define the Remainder(A) as the weighted sum of the information content of each subset of the examples.