Dakshina Murthy Gandikota's Blog: Bayes

“It is likely to rain,” is a common refrain in the rainy season. What does it mean really though? One answer is, it means there is a high probability that the rain will occur where probability is a measure of prior occurrences of rain given the atmospheric conditions. But where does probability come from? Some opine that it is based on frequency. They are called frequentists. For example, every one understands coin flips. The probability of heads or tails on an unbiased coin flip are equal. There are 2 assumptions here: (1) a coin can land on either head or tails but not on its curved surface and (2) the probabilities add up to 1. The probability of an outcome is always a positive real number below 1. If using an unbiased coin, the probabilities of heads or tails should be the same (0.5). In reality a coin is biased and flipped enough times will favor either heads or tails. The number of times a coin lands on heads or tails is called frequency. In other words, probability in most cases is based on the frequency of occurrences. It is interesting to note in Physics that when light is viewed as a stream of photons spinning in circles, the reflection from a surface is based on the probability. Prof.Feynmann actually talked about viewing photon motion as a spinning arrow and the reflection probability is based on the square of the length of the arrow connecting the tail of arriving photon and head of the reflected proton, also called square of amplitude because the probabilistic region is a circle --whose area is proportional to the square of radius--with the radius given by the amplitude. So going back to the statement that “it is likely to rain”: it can be based on the speaker's observations of prior rainy days. An informed speaker takes in to account all of the variables that predict the rain: presence of clouds, prevailing winds, season of the year, etc.

Most often we don't have frequency measurements to base the probability. For instance, the probability of fever cannot exist in isolation because fever is often the symptom of underlying causes. Flu can cause fever. And so can common cold. So the probability of fever is conditional on the underlying causes. If fever is non-conditional, then it could be based on frequency. Suppose one can say 20% of population, at any given time, is afflicted with fever. That is prior probability of fever, P(Fever)=0.2. Similarly prior probability of Flu is given as, say, P(Flu)=0.3. If the conditional probability of fever given flu is needed P(Fever|Flu) then P(Fever) and P(Flu) should be related to it. This is the leap of faith Reverand Bayes, an 18th century clergyman, made. Using his theorem we can say P(Flu|Fever) = P(Flu)*P(Fever|Flu)/P(Fever). This doesn't seem like a great way to predict flu given fever. As already stated fever has many causes and flu is just one of them. The fever could manifest because of microbes attacking the immune system or the temperature of the surroundings. Not to mention stress and work activity. Therefore, a physician might not adhere to Bayes' theorem as religiously as Rev. Bayes would like.

A Classical Example of Bayes' Theorem

Stated in words, Bayes' theorem is simply starting with a provisional hypothsis we compute the prior probability of the hypothesis. Then in light of new evidence we revise the probability of the hypothesis called it posterior probability. To illustrate, let’s suppose that we were getting tested for HIV. Most of the medical tests have 4 out comes

True Positive: Both the test and presence of HIV in the patient are true
True Negative: The test and presence of HIV agree; no HIV
False Positive: Test and presence of HIV don't match. The test is positive on HIV but the patient actually doesn't have HIV
False Negative: The test detects no HIV even though the patient has HIV

The test manufacturer typically gives the accuracy, say, 99.5%. This is the first conditional probability we have, i.e. P(+ve|HIV)=0.995. And P(Test-ve|HIV)=0.995 which is the conditional probability that the patient doesn't have HIV based on the test accuracy. Knowing that the HIV prevalence in the general population is 2%, which is the prior probability P(HIV), using Bayes theorem we can say:

${P(Cause|Effect)} = {{P(Effect)*P(Effect | Cause)} \over P(Effect)}$

${P(HIV|+ve)} = {{P(+ve|HIV) * P(HIV)} \over P(+ve) }$

Substituting the known probabilities:

${P(HIV|+ve)} = {{0.995 * 0.02} \over P(+ve)}$

The denominator P(+ve) is the overall +ve probability of the test being +ve which will be the case when True Positive and False Positive. Therefore:

$P(+ve) = P(+ve|HIV)*P(HIV) + P(+ve|No HIV)*P(No HIV)$

We know the prior probability of P(HIV)=0.02. So P(No HIV)= 1-0.02=0.98 because the probabilities have to add up to 1.

Similarly, P(+ve|No HIV) = 1 – P(-ve|No HIV) = 1-0.995=0.005

We already know from the manufacturer P(+ve|HIV)=0.995

Substituting

${P(HIV|+ve)} = {{0.995 * 0.02} \over {0.995 * 0.02 + 0.005 * 0.98}}$ = 0.8024 or 80.24%

The patient has potentially a better outcome because the conditional probability of HIV given a positive test results is much lower than advertised 99.5%

Certainty Factors

The traditional alternative to the concept of physical probability or what frequentists describe is to think of probabilities as reporting our subjective degrees of belief. This is what Bayesians prefer for the prior probabilities. The modern computer scientists take it further by adding disbelief. In the '80s and '90s when medical knowledge was codified as rules in the knowledge-based systems—also called expert systems-- the contribution to a hypothesis based on new evidence is given as certainty factors. The reason certainty factors were developed has to do with the difficulties with the Bayesian method. Bayes’ theorem’s accuracy in predicting likelihoods depends on knowing prior probabilities. For example, to determine the probability of a disease:

${P(D_i| E)} = {{P(E | D_i) * P(D_i)} \over \sum{(P(E | D_j)*P(D_j))}}$

Where

$D_i$ is the i'th disease,
E is the evidence,
$P(D_i)$ is the prior probability of the patient having the Disease i before any evidence is known
$P(E | D_i)$ is the conditional probability that the patient will exhibit evidence E, given that disease $D_i$ is present

If we are able to calculate the $P(D_i | E)$, somehow, when incremental evidence is presented the equations turns more complicated like this:

${P(D_i | E)} = {{P(E_2 | D_i \bigcap E_1) * P(D_i | E_1)} \over \sum{(P(E2 | D_j \bigcap E_1) * P(D_j | E_1))}}$

where $E_2$ is the new evidence added to yield the new augmented evidence:

$E = E_1 \bigcap E_2$

In general if we are dealing with N symptoms with each symptom having the values true or false, there are $2^N$ possibilities. This is called a combinatorial problem by the computer scientists.

In other words certainty factors enable the scientists to elicit the belief from human experts.

0 means no evidence
values greater than 0 favor the hypothesis
values less than 0 favor the negation of the hypothesis

$CF = {MB – {MD \over { (1 – min(MB, MD))}}}$

Where

CF = certainty factor
MB =measure of belief
MD=measure of disbelief

When 2 rules support the same hypothesis the combined certainty factor

CF = CF1(1-CF2) If both > 0

=CF1 – $CF2 \over {(1-min(|CF1|, |CF2|)}$ if one < 0

= CF1 + CF2 (1+CF1) if both < 0

CF's are found to be convenient in enabling computer programs to diagnose diseases using rules to supplant human physicians. One criticism of CF's is that while they capture the increased belief in a hypothesis given evidence, they assume the belief and disbelief are independent of each other. For example: an expert might say the presence of clouds raises the belief in imminent rain to say 0.6; another expert might say if the clouds are cirrus then there is disbelief say 0.8. So the experts are inconsistent if not contradicting one another.

Naive Bayes

Let us illustrate Naive Bayes with an example. Suppose we want to classify emails as Spam or No-Spam. The Spam detection is based on the occurrences of words like “free”, “won”, “lottery”, etc. Using Bayes' theorem we can write:

$P(Spam | w_1, w_2...w_n) = P(Spam) * P(w_1,w_2...w_n | Spam) / P(w_1, w_2...w_n)$ where w represents a word.

$P(w_1,w_2....w_n | Spam) = P(Spam | w_1, w_2...w_n) * P(w_1, w_2...w_n) / P(Spam)$

The left hand side can be further expanded to: $P(w_1 | w_2...w_n, Spam) * P(w_2 | w_1, w_3...w_n, Spam)...$

Then we make the assumption that $P(w_1 | w_2...w_n,Spam) = P(w_1 | Spam)$ which is central to the Naive Bayes.

That is, we assume the occurrence of word $w_1$ is independent of the occurrences of rest of the words. So “won lottery” is interpreted as two spam words “won” and “lottery” even though they occur together in the email.

Therefore, $P(Spam | w_1, w_2...w_n) = P(Spam)*P(w_1 | Spam)*P(w_2 | Spam)...P(w_n | Spam) / P(w_1,w_2...w_n)$

To calculate $P(w_1 | Spam)$ we can use frequency of occurrence of word $w_1$ in the Spam email. As for the denominator, lot of applications of Naive Bayes ignore it, making the final form

$P(Spam | w_1, w_2...w_n) \propto P(Spam)*P(w_1 | Spam) * P(w_2 | Spam)...P(w_n | Spam)$

In the virtual world of computers each of the probabilities on the right hand side are so small (less than 1 each) that their product will be rounded off to zero! To overcome this, we apply log to both sides

$log(P(Spam | w_1, w_2...w_n)$ = $log(P(Spam)) + \sum_{i=0}^n{log P(w_i | Spam)}$

What if $P(w_i | Spam)=0$ ? To avoid this, we will use Laplace smoothing or assume that each word occurs at least once.

Bayes Net

A Bayes Net is a model of the world we would like to make predictions about. Consider the following causalities for Dyspnea (shortness of breath)

Suppose we ask what is the probability that Dyspnea is caused by a trip to a foreign country? For that we need to traverse the following subtree of the Bayes net:

Visit to Foreign Country->Tuberculosis->Dyspnea

If we know that patient has TB, then the prior visit, if any, is irrelevant. So we have:

P(Dyspnea | Visit ^ TB) = P(Dyspnea | TB)

Cancer and Bronchitis have a common cause in Smoking.

P(Bronchitis | Cancer ^ Smoking) = P(Bronchitis | Smoking)

This is easier to understand as Bronchitis has no direct cause to Cancer. We could've also written

P(Cancer | Bronchitis ^ Smoking) = P (Cancer | Smoking)

Common effect is when Dyspnea can be caused by Bronchitis or Cancer

P (Dyspnea | Cancer ^ Bronchitis) $\ne$ P(Dyspnea | Cancer) $\ne$ P(Dyspnea | Bronchitis)

One observation we can make here is, if we rule out cancer, then the conditional probability P(Dyspnea | Bronchitis) increases.

Most commonly, Bayesian nets are considered to be representations of joint probability distributions. Sparse Bayesian networks (those with relatively few arcs, which means few parents for each node) represent probability distributions in a computationally tractable way.

Modeling this way from first principles is the preferred approach in various applications of Bayesian nets. Obviously it is a tedious procedure to create Bayes net especially in a world where everything is inter-connected and prior probabilities don't exist. After all the statement that a butterfly flapping its wings in Asia could cause a tornado in US midwest is not far fetched for the Bayesians.

Decision Tree

A decision tree is a special case of Bayes' net where at each node we ask the question like “Does the patient smoke?” If true, then ask another question, “Has the patient contracted TB or cancer?”. If the answer to the first question is false then we pursue foreign visit, “Has the patient visited a foreign country?” And so on.

Given these 6 variables we have $2^6$ or 64 possibilities as follows:

Visit	Smoking	Cancer	TB	Bronchitis	Dyspnea
1	0	0	0	0	0
1	0	0	0	0	1
1	1	0	0	0	1

A decision tree makes it easier to handle this truth table by pruning some sub trees based on how the patient answers the questions. For example, if a patient never smoked, then we can eliminate bronchitis from consideration. A variation of decision tree for classification was given as a generic task. It is believed that: (a) we manage complexity by hierarchically representing hypotheses and (b) we can prune branches in the hierarchical structure based on evidence. An algorithm called establish-refine has been suggested. When we establish the probability or truth of a hypothesis, only then we refine it by entertaining all of the sub-hypotheses connected to it. It has been well known that rule-based systems that use certainty factors adopt a similar strategy but it is implicit in the way rules are processed. For example: the rules for bronchitis will fire after it has been established that the patient smokes. In the generic task the establish process can be as complex as one can make. To establish smoking, one can ask “how many packs a day”, “exposure to secondary smoke”, “any cigars”, etc.

Markov Chains

A Markov chain is a state-transition diagram subject to rules of probability. It assumes that no matter which states were traversed in the past, the set of possible future states are fixed. In other words, the current state determines the probability of transitioning to a future state.

For example, if you made a Markov chain model of an infant's activities, you might include "playing," "eating", "sleeping," and "crying" as states, which with other activities could form a 'state space': a list of all possible states. In addition to states, Markov chain tells you the probability of transitioning from one state to any other state-e.g., the chance that an infant currently playing will fall asleep in the next five minutes without crying first.

Consider another example based on web surfing. No matter what pages you have browsed earlier to arrive at the current page, your ability to follow links on the current page determines the future state with the assumption that you are not entering an all together new link in the browser.

In a paper written in 1913, Markov chose a sequence of 20,000 letters from Pushkin's Eugene Onegin to see if this sequence can be approximately considered a simple chain. He obtained the Markov chain with transition matrix:

$$ \begin{pmatrix} 0.128 & 0.872 \\ 0.663 & 0.337 \end{pmatrix} $$ The transition probabilities are given as

vowel->vowel = .128
vowel->consonant=.872
consonant->vowel=.663
consonant->consonant->.337

Note that the probabilities on each row add up to 1.

The fixed vector for this chain is (.568, .432), indicating that we should expect about 43.2 percent vowels and 56.8 percent consonants in the novel, which was borne out by the actual count.

Claude Shannon considered an interesting variation of this idea in his book The Mathematical Theory of Communication. Shannon considers a series of Markov chain approximations to English prose. He does this first by chains in which the states are letters and then by chains in which the states are words. For example, for the case of words he presents first a simulation where the words are chosen independently but with appropriate frequencies.

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.

If this looks like a drunkard's rambling, consider the following:

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

Which sounds almost like valid text because the words were chosen as a Markov chain. As to how the text was generated: a simulation is carried out by opening a book and choosing the first word, say it is “the”. Then the book is read until the word “the” appears again and the word after this is chosen as the second word, which turned out to be “head”. The book is then read until the word “head” appears again and the next word, “and”, is chosen, and so on.

Mathematically, a Markov chain is a collection of random variables ${X_t}$ where the index t runs through (0, 1, ...) having the property that, given the present, the future is conditionally independent of the past. In other words,

$P(X_t=j |X_0=i_0,X_1=i_1,...,X_{t-1}=i_{t-1})=P(X_t=j|X_{t-1}=i_{t-1})$

If a Markov sequence of random variates $X_n$ take the discrete values $a_1, ..., a_N$, then

$P(x_n=a_{i_n}|x_{n-1}=a_{i_{n-1}}),...,x_1=a_{i_1})=P(x_n=a_{i_n}|x_{n-1}=a_{i_{n-1}}), $

Notice the similarity with Naive Bayes where we didn't need all of the probabilities in the spam/no-spam classification.

Hidden Markov Model

HMM's are yesterday's neural nets. So far their best known application is speech recognition where it is required to map a set of sounds to a set of written words. What we now take for granted in any phone system, where the navigation menu is driven by spoken words (“yes” or “no” or something else), is made possible by HMMs.

Speech recognition aside, let us say we have a model of a stock market that has these 3 states: bull, bear or even. And we monitor our stock portfolio as: up, down or even. Suppose we have untethered ourselves from modern world for 3 days and would like to know the probability of our portfolio moving from up to down to up (since we don't expect it to be up all the time), how do we go reasoning about it? This is where HMM's come handy.

Consider another situation where the accurate temperature records from a century ago were not available. All we know is whether the year was hot or cold. And we also have access to tree rings, that are supposed to be formed with each passing year varying in size by the weather condition, as small, medium and large. Suppose we want to know the climatic conditions in which the tree rings observed as small-medium-small-large were formed. Once again the answer is HMMs.

Put succinctly, a HMM is one in which you observe a sequence of observations( called emissions), but do not know the sequence of states the model went through to generate the emissions. Analyses of hidden Markov models seek to recover the sequence of states from the observed data. Besides speech recognition HMMs are frequently used for the statistical analysis of multiple DNA sequence alignments.

HMM Math

S = states= $\{s_1, s_2,,,s_n\}$

V=observations =$\{v_1, v_2...v_m\}$

Q=fixed state sequence of length T and O the corresponding observations

$Q=q_1,q_2,,,q_T$

$O=o_1,o_2...o_T$

A=transition array storing the probability of state j following state i

$A=[a_{ij}]$ where $a_{ij}=P(q_t=s_j|q_{t-1}=s_i)$

B=observation array which is the probability of observation k being produced from state j independent of t

$B=[b_i(k)] where b_i(k)=P(o_t=v_k|q_t=s_i)$

U = initial probability array =$[u_i] where u_i=P(q_1=s_1)$

We make the following assumptions:

Markov Assumption: current state is dependent only on the previous state
Independence Assumpton: the output observation $o_t$ at time t is dependent on the current state $q_t$

Given L=model

Which is simply $P(o_1|q_1,L)= b(o_1)*b(o_2)...b(o_t)....b(o_T)$

Note that $b(o_1) = P(o_1|q_1)$ the conditional probability of observation $o_1$ belongs to state $q_1$

$P(Q|L) = Uq_1*a_{q_1q_2}*a_{q_3q_4}... a_{q_{T-1}*q_T}$

Note $a_{q_1q_2}$ = probability of transition from state $q_1$ to state $q_2$ which is provided in the input

Therefore $P(O|L) = \sum_1^Q{P(O|Q,L)*P(Q|L)}$ = $\sum_{q_1}^{q_T}{(Uq_1*b(o_1))*a_{q_1q_2}...a_{q_{T-1}q_T}b(o_T))}$

Note U doesn't repeat because it is the initial probability

Kalman Filter

The states in HMM are discrete and won't apply to non-linear systems. When HMM's states are made continuous, then it will be a Kalman filter that can be used for a variety of applications including robotic walk, missile guidance, etc. Kalman filter essentially fills the gaps between the states resulting in a smooth transition from one state to another.

Critique on Bayes

In New York Times, John Allen Poulos critiques Bayes' for its steadfast adherence to prior probabilities. Where do they come from?And why should they influence our beliefs in current events? He takes the example of linkage between vaccination and autism. The presence of thimerosal in vaccines was suspected to cause autism. However, studies, based on statistics and probability theory, proved that there is no signficant linkage between the two. In other words, the conditional probability that autism caused by thimerosal is too insignificant. Let us say P(Autism|Thimerosal)=0.0000001. That is 1 in 10 million can be affected by it. Assuming a population of 5 billion who have been vaccinated, there will be 5000000000/10000000=5000 people with autism! Is that a good outcome? Some contend that there will be many more who will die from tuberculosis, rubella, etc., the diseases for which vaccines are available, if all are not vaccinated.

Another critique of Baye's can be called the inference problem. In our dyspnea diagnosis example, we stated that there are $2^N$ possibilities where N is the number of observed causes. This is a combinatorial problem. For a disease like cancer the N is very large if we assume everything that is artificially made can cause cancer. The Bayes' net worked in the dyspnea diagnosis because we cleverly modeled the net to consider the most relevant causes (like Visit to a foreign country, TB, smoking, etc.). We could have added a few more causes like pollen in the air, air pollution from chemicals, etc. With each cause the number of possibilities doubles making it impossible to compute the probabilities even by the fastest computers.

Of course, the final criticism of Bayes' is that it is as old as religion itself and it is orthodox. Bayes' friend Richard Price, a mathematician, was the actual developer of Bayes' theorem. And if it is rooted in religious beliefs, then it wouldn't have done a good job of speech recognition, search for nuclear weapons, devise actuarial tables, improve low-resolution computer images, judge the authorship of the disputed Federalist papers and determine the false positive rate of mammograms.

Summary

Bayes' probabilities give us a mathematical framework that is rooted in conditional probabilities and prior probabilities. They have predictive power, entirely based on probability theory. In Bayes' nets and decision trees, a model of the universe of interest is created and probabilities are estimated on each node. If there is sufficient evidence, then the hypothesis represented by the node is considered for further examination. We cannot compare Bayes' with any other machine learning algorithm, viz. neural nets, evolutionary programming or traditional logic, as they are based on Baye's theorem and not biologically inspired. It makes sense for statisticians and frequentists whose compilations of natural phenomena provide the grist for the Bayes'.

Online References

http://read.pudn.com/downloads134/ebook/572067/Bayesian%20Artificial%20Intelligence%20-%20Korb%20Nicholson.pdf

http://www1.se.cuhk.edu.hk/~seem5750/Lecture_5.pdf for a discusson of MYCIN certainty factors

http://blog.datumbox.com/machine-learning-tutorial-the-naive-bayes-text-classifier/

https://pythonmachinelearning.pro/text-classification-tutorial-with-naive-bayes/

https://www.norsys.com/tutorials/netica/secA/tut_A1.htm tutorial on Bayesian nets

http://cs229.stanford.edu/section/cs229-hmm.pdf Hidden Markov Models

https://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf

https://www.sciencedirect.com/topics/neuroscience/hidden-markov-models

http://digital.cs.usu.edu/~cyan/CS7960/hmm-tutorial.pdf talks about bear bull market

http://www.bzarg.com/p/how-a-kalman-filter-works-in-pictures/

https://www.nytimes.com/2011/08/07/books/review/the-theory-that-would-not-die-by-sharon-bertsch-mcgrayne-book-review.html

https://www2.isye.gatech.edu/~brani/isyebayes/bank/pvalue.pdf

Dakshina Murthy Gandikota's Blog

Thursday, April 19, 2018

Bayes