or, equivalently,
where %P is the % of positive examples in S = |P|/|S| = p/(p+n) and
%N is the % of negative examples in S = |N|/|S| = n/(p+n).
I measures the information content in bits (i.e., number of yes/no questions that must be asked) associated with a set S of examples, which consists of the subset P of positive examples and subset N of negative examples.
Note: 0 ≤ I(P,N) ≤ 1, where 0 means there is no information, and 1 means there is maximum information.
Half the examples in S are positive and half are negative. Hence, %P = %N = 1/2. So,
I(1/2, 1/2) = -1/2 log2 1/2 - 1/2 log2 1/2 = -1/2 log2(2-1) - 1/2 log2(2-1) = -1/2 (-1) - 1/2 (-1) = 1/2 + 1/2 = 1 => information content is largeSince the information content is 1, that means all the information is still in the examples, so we have not reduced it at all by classifying the examples. The information gain is 1 - 1 = 0.