Information Gain Method for Selecting the Best Attribute

Use information theory to estimate the size of the subtrees rooted at each child, for each possible attribute. That is, try each attribute, evaluate and pick the best one.

How much (expected) work to guess if an element x in a set S of size |S|?

log₂|S|
That is, at each step we can ask a yes/no question that eliminates at most 1/2 of the elements remaining.
Given S = P ∪ N, where P and N are two disjoint sets, how hard is it to guess if an element x is in P or N?
if x ∈ P, then log₂|P| = log₂p questions needed, where p = |P|
if x ∈ N, then log₂|N| = log₂n questions needed, where n = |N|
So, the expected number of questions that have to be asked is:

(Pr(x ∈ P) * log₂p) + (Pr(x ∈ N) * log₂n)
or, equivalently,

(p/(p+n)) log₂p + (n/(p+n)) log₂n