Written Assignment 3: Probability and Classification
Due 3/2 at 11:59pm
Question 1 (2 points). Consider the following
3-sided dice with the given side values. Assume the dice are all fair
and all rolls are independent.
A: 2, 2, 5
B: 1, 4, 4
C: 3, 3, 3
- What is the expected value of each die?
- Consider the indicator function better(X,Y) which has value 1 if X>Y and
value -1 if X<Y. What are the expected values of better(A, B), better(B, C), better(C, A)? Why
are these sometimes called non-transitive dice?
Question 2 (2 points). On a day when an assignment
is due (A=a), the newsgroup tends to be busy (B=b), and the computer lab tends
to be full (C=c). Consider the following
conditional probability tables for the domain.
P(A) |
P(B|A) |
P(C|A) |
|
B |
A |
P |
b |
a |
0.80 |
¬b |
a |
0.20 |
b |
¬a |
0.40 |
¬b |
¬a |
0.60 |
|
C |
A |
P |
c |
a |
0.70 |
¬c |
a |
0.30 |
c |
¬a |
0.50 |
¬c |
¬a |
0.50 |
|
- Construct the joint distribution out of these conditional probabilities
tables assuming B and C are independent given A.
- What is the marginal distribution P(B,C)? Are these two variables
absolutely independent in this model? Justify your answer using the
actual probabilities, not your intuitions.
- What is the posterior distribution over A given that B=b, P(A | B=b)? What is
the posterior distribution over A given that C=c, P(A | C=c)? What about P(A | B=b, C=c)? Explain the pattern among these posteriors and why it holds.
Question 3 (2 points). Sometimes, there is traffic
on the freeway (C=c). This could either be because of a ball game (B=b) or because of an accident (A=a). Consider the following
joint probability table for the domain A = {a, ¬a}, B = {b, ¬b}, C = {c, ¬c}.
P(A, B, C) |
A |
B |
C |
P |
a |
b |
c |
0.018 |
a |
b |
¬c |
0.002 |
a |
¬b |
c |
0.144 |
a |
¬b |
¬c |
0.036 |
¬a |
b |
c |
0.056 |
¬a |
b |
¬c |
0.024 |
¬a |
¬b |
c |
0.144 |
¬a |
¬b |
¬c |
0.576 |
|
- What is the distribution P(A,B)? Are A and B
independent in this model given no evidence? Justify your answer using
the actual probabilities, not your intuitions.
- What is the marginal distribution over A given no evidence?
- How does this change if we observe that C=c; what is the posterior distribution P(A | C=c)? Does this change intuitively make sense? Why or why not?
- What is the conditional distribution over A if we then learn there is a ball
game, P(A | B=b, C=c)?
Does it make sense that observing B should cause this update to A (called
explaining-away)? Why
or why not?
Question 4 (2 points). Often we need to carry out
reasoning over some pair of variables X, Y conditioned on the value of
other variable E=e.
- Using the definitions of conditional probabilities, prove the
conditionalized version of the product rule: P(X, Y | e)
= P(X | Y, e) P(Y | e)
- Prove the conditionalized version of Bayes' rule: P(Y | X,
e) = P(X | Y, e) P(Y | e) / P(X | e)
Question 5 (2 points). Suppose we wish to calculate
P(C | A, B).
- If we have no conditional independence information, which of the
following tables are sufficient to calculate P(C | A, B)?
- P(A, B), P(C), P(A | C), P(B | C)
- P(A, B), P(C), P(A, B | C)
- P(C), P(A| C), P(B | C)
- What if we know that A and B are conditionally independent given C?
Question 6 (3 points). A common, simple way of
improving classification accuracy is to train an ensemble of multiple
classifiers for a single task, then to classify new instances by the majority
output of the individual classifiers. Suppose we have M binary
classifiers, C_1 ... C_M, each of which has an error rate ε. Assume M is odd.
- Assume that the classifiers make independent errors, that is, for any
input x, the chance of classifier Ci making a mistake is independent of
whether the other classifiers were correct. In this case, give a
formula for the error rate of the ensemble in terms of ε and M.
If ε = 0.1 and M = 3, what is the error rate of the ensemble?
- For a fixed component error rate ε < 0.5, what happens to the ensemble error
rate as M gets very large and why? Why does this not solve the world's
classification problems?
- If we do not know that the errors are independent, is it possible that
the ensemble will have a higher error rate than its members? Again,
assume all classifiers share the same error rate ε < 0.5.
Question 7 (4 points). Consider building a naive Bayes and a perceptron
classifer on the following example, where X and Y are evidence variables and Z is the target class variable.
Z=0: X=0,Y=1
Z=0: X=1,Y=0
Z=1: X=1,Y=1
Z=0: X=1,Y=0
Z=0: X=0,Y=1
Z=0: X=1,Y=0
Z=0: X=0,Y=1
- For each data point, construct a feature vector that consists of [X, Y, 1] where the final element is the bias feature. What parameters will a two-class perceptron learn on this training set if trained
until convergence? Assume that ties result in updates. What will
its accuracy be when tested on the same set?
- What parameters will a naive Bayes classifier learn on this training
set, assuming there is no smoothing. What will its accuracy be if the
test set is identical to the training set?
- Instead of a naive Bayes model, consider predicting P(Z|X,Y) directly
from the full empirical joint distribution over X, Y, and Z. What will the accuracy of this model be
if the test set is identical to the training set?
- Explain the difference between the accuracies of the two Bayesian
classifiers in terms of conditional independence assumptions.
- Give a general disadvantage of each of these three classifiers.
Question 8 (3 points). In this exercise, you will
show that the relative frequency estimates discussed in class are the maximum
likelihood estimates for a single random variable's distribution. Consider
a random variable X with values x_1, x_2, ... x_n. Imagine that we have a
data set in which we observe k samples from X, e.g. [x_7, x_3, x_6, .... ], in
which each x_i occurred c_i times. Let p_i be the probability of x_i.
- To begin with, let X = {h,t} be a coin flip. Assume the data is
[h, t, h, h, t]. Write the likelihood of this exact sample in terms of
p_h and p_t (you may wish to use 1 - p_h instead of p_t). Find the
maximizing parameters by differentiating the likelihood function and setting
the derivative to zero.
- Extend this argument to arbitrary data sets over two-valued random
variables. Does the order of the data matter?
- Show that the relative frequency estimate holds in the general case
where n > 2. You may assume that the likelihood has a unique maximum and no local extrema.