Saturday, April 21, 2018

Logic

The history of western logic starts with Greeks, even though Indians and Chinese had evolved their own logic. In this chapter we stick to Greek logic starting with Aristotle. But first let us study the logic behind jihad. All major religions—hinduism, christianity, islam---have the concept of heaven. Hindus had created a hierarchical structure to heaven like Indra loka (Lord Indra's abode), vaikuntham (Lord Vishnu's abode), kailasam (Lord Shiva's abode), etc. So for hindus, heaven means a grand tour of these abodes with a stop over in each of them. The concept can be easily explained as a tourist visa. With good karma comes a tourist visa with an expiration date. The soul is free to reside in the heaven until the visa expires at which time it will return to earth (or earth like planet) to start all over again. The christian and islamic concept of heaven is much different from hindus by assuring a permanent visa status in the heaven. This gives some of the muslims a motivation to commit suicidal missions –called jihad—in anticipation of a permanent visa in heaven. So let us summarize:

All religions offer  visa to Heaven
Hinduism offers a temporary visa to Heaven based on law of karma
Islam and Christianity offer a permanent visa to Heaven 

So logically we can derive that

All religions must follow law of karma

There is no traditional logical validity to the above statement. That is the logical methods like syllogism, induction, deduction, etc. are not applicable. One can call it abduction because there is a bit of generalization with forced conversion –you can call it imposition—of the law of karma. Hindus believe that Adi Sankaracharya (788 AD) used anviksiki and tarka schools of logic to unify the various religions (dwita or dual) that existed in the Indian subcontinent to non-dual or adwita. This is like a soft war. One can say similarly the Greeks converted the barbarians in the Europe by teaching them logic. The famous logician of Greeks, Aristotle was the inventor of syllogism that goes like this:

All men are mortal
Plato is a man
Therefore, Plato is mortal

A syllogism consists of three propositions. The first two, the premises,share exactly one term (men or man in the above example), and they logically entail the third proposition, called the conclusion, which contains the two non-shared terms of the premises (Plato and Mortal).

All birds fly
Parrots are birds
Parrots fly

This seemingly correct syllogism has a false premise. Not all birds fly. Penguins, kiwis, birds with broken wings, etc. can't fly.

What do people mean when the ask: “Where is the logic in this?”

We often see people question the author where is the logic or what is the logic used by you. The author can be a playwright or a blogger. We are expected to follow certain norms of discourse. We are supposed to “deduce” --which is a logical construct—rather than preach (based on belief or a religious text). In deduction we say:

Dogs chase Cats
Cats chase mice
Therefore dogs chase mice

In general: A implies B, B implies C, therefore A implies C. Once again this is prone to wrong implications. Consider this:

Fire starts sprinklers
Sprinklers start water-flow
Therefore Fire starts water-flow

The fallacy here is that fire and water flow have no visible linkage. If a sprinkler malfunctions, then water can flow without fire. Consider this syllogism:

All Indians are from India
Chief Nightfox is an Indian
Therefore, Chief Nightfox is from India

The fallacy is revealed by the clarification that all Native Indians reside in American continent.

So what is logic good for?

There are many occasions where playwrights used logic to entertain their audience. Consider the following dialogue:

Actor 1: Have you forgotten me?

Actor 2: I have forgotten all men; if you are a man, I forgot you too

Logic was used by authors like Sir Arthur Conan Doyle to give us the famous character detective Sherlock Holmes who deduced from facts who the culprit might be rather than conjecture. In the ancient India deduction was developed by Samkhya school. Samkhya considered Pratyakṣa or Dṛṣṭam (direct sense perception), Anumāna (inference), and Śabda or Āptavacana (verbal testimony) to be the only valid means of knowledge or pramana. Unlike some other schools, Samkhya did not consider the following three pramanas to be epistemically proper: Upamāṇa (comparison and analogy), Arthāpatti (postulation, deriving from circumstances) or Anupalabdi (non-perception, negative/cognitive proof). One can say Sir Doyle was influenced by the Samkhya.

Boolean logic was extensively used by computer scientists. Boole's logic has two interpretations: class logic and propositional logic. Boole made a distinction between "primary propositions" which are the subject of syllogistic theory, and "secondary propositions", which are the subject of propositional logic, and showed how under different "interpretations" the same algebraic system could represent both. An example of a primary proposition is "All inhabitants are either Europeans or Asiatics,” which can be represented as

(x) [ I(x) -> (E(x) v A(x)) ]

An example of a secondary proposition is "Either all inhabitants are Europeans or they are all Asiatics."

(x) (I(x) -> E(x)) v (x) (I(x) -> A(x))

These are easily distinguished in modern propositional calculus, where it is also possible to show that the first follows from the second, but there is no way of representing this in the Boolean system. But more famously Boolean logic is used in processing inputs to AND, OR, XOR, NAND, NOT, etc. gates in the computer hardware.

Over the centuries since Aristotle, logic remained exotic within the confines of the elites who are interested in such philosophical topics as epistemology and ontology. Only recently the first-order predicate logic gained influence in the AI programs called rule-based systems that modeled some area of expertise. Consider the following rules:

If systole > 130 OR diastole > 90 then conclude high BP

If systole < 110 OR diastole < 80 then conclude low BP

If high BP then conclude stroke

If low BP then conclude heart disease

We know from visual inspection that the rules should be processed from top to bottom. But how can a computer, which doesn't have a representation of top or down, process these rules? A number of algorithms were proposed such as the famous Rete algorithm to do inference based on recency, working memory and other criteria that don't have a biological basis. Since they don't withstand the test of logical scrutiny, they remain academic pursuits. We still need to understand why they have failed and what can be done to fix them.

Historically though, the stoics school of logic (278-206 BC) dealt with predicates that entail the conditional statements if...then, AND and OR that were precursors to boolean logic and modern rule-based systems.

P vs. NP

If you can solve a problem in polynomial time (P(n)=$n^{20} + 1$) mathematicians call it a P. If you can verify a solution to a problem in polynomial time then it is called NP. In other words, the verification of a solution itself should not take more time than finding a solution. Unfortunately it is not possible to answer the question is $P=NP$? or is $P \ne NP$?

How much memory is possible to a computer program?

We often hear that memory is cheap. But really how much memory is available? Scientists say there are $10^{80}$ particles in the universe. If we can store a bit of information in an electron, then that is the hard limit on the number of bits a computer has access to. Similarly the design of circuits on which computer programs run are spatially limited to "planck's distance" which is the distance traveled by light in $10^{-44}$ seconds!

Deductive vs Inductive Reasoning

According to Ashley Crossman, deductive reasoning and inductive reasoning are two different approaches to conducting scientific research. With deductive reasoning, a researcher tests a theory by collecting and examining empirical evidence to see if it is true. With inductive reasoning, a researcher first gathers and analyzes data, then constructs a theory to explain findings.

Suppose you want to prove that all marsupials originated in Australia. You can start by locating all marsupials in Australia and not finding at least one in other continents. But this is a lengthy process. You need to study every nook and corner of earth to find a counter example. After you gather enough evidence pertaining to marsupials in Australia, you can induce that indeed marsupials originated in Australia.

By nature, inductive reasoning is more open-ended and exploratory, especially during the early stages. Deductive reasoning is more narrow and is generally used to test or confirm hypotheses. Most social research, however, involves both inductive and deductive reasoning throughout the research process.

Turing Machine basis for AI

We ask if there is a procedure that, given any formal mathematical statement, would algorithmically determine whether the statement is true. Church and Turing proved there is no such procedure; Turing introduced the halting problem as a key example of a mathematical problem without an algorithmic solution. To understand the halting problem consider the pseudo-code:

Procedure:A
Loop for I from 1 to 100
print I
I=I=1
End Loop

From a visual inspection we can see that the code will print integers from 1 to 100. Now let us modify it as follows:

Procedure B(Input: I)
print I
J=I+1
Call Procedure B(J)

The above is an example of a recursive procedure which will never halt. But there is no easy way to verify it or prove it. So here come the empiricists who will test algorithms by writing computer programs and see if the inputs and outputs match up. The opposing camp is comprised of rationalists (yes as you have guessed, arm chair philosophers) who will conclude that it is better to not even try implementing the Procedure B that is poorly coded.

Another way to understand the halting problem is say you have an app on your smart phone that frequently crashes requiring you to reboot the device. Is there some way to check if the app will terminate and leave your device in its original state? Fortunately there is a way, if you are able to write a PulsecheckApp that simulates your inputs on any app that you have downloaded. If the PulsecheckApp is satisfied that the app will not crash your device, then only it will allow you to use it. Now a clever programmer can spoof the PulsecheckApp and make an app pass the test. A similar thing happened with Volkswagen cars which adjusted their computer's programs when the fuel emission check was being conducted to meet the federal standards of fuel consumption. That is, the car's computer knew when the car was being tested and when to inflate the numbers. This is also called gaming the system.

Besides spoofing, Turing actually proved, much before computers have been manufactured, that you can't write a program to determine if a program will halt. Because if it were possible, in simple terms, the program to check the halting can be fed as input to itself and there is no end to how many times you have to recurse. This is similar to Russell's paradox that says: consider the set of all sets that are not members of themselves. Such a set appears to be a member of itself if and only if it is not a member of itself.

Some sets, such as the set of all teacups, are not members of themselves. Other sets, such as the set of all non-teacups, are members of themselves. Call the set of all sets that are not members of themselves “R.” If R is a member of itself, then by definition it must not be a member of itself. Similarly, if R is not a member of itself, then by definition it must be a member of itself.

Or put another way, if you say, a village barber cuts hair of everyone in the village who can't cut themselves and there is only one barber in the village; then does "everyone" include the barber himself? Probably not. Then it is a paradoxical statement.

Some view, Turing's halting issue is actually an offshoot of Godel's incompleteness theorem. Turing and Godel were contemporaries and worked in Princeton University. Apparently they never met. Godel's incompleteness theorem applies to arithmetic where there will always be some things that can't be proven using the axioms and theorems of arithmetic. Consider the example in marriage 1 +1 = 3.

No matter what the rationalists had to say, the empiricists, who would like to try out things, marched on. The Turing machine became a standard model for a general-purpose computing device. These results led to the Church–Turing thesis that any deterministic algorithm that can be carried out by a human can be carried out by a Turing machine. This is in general applicable to arithmetic, algebra, etc. But what about human visual processing, motor skills and other biological features for which algorithms are continually being developed and evaluated? Granted neural nets, for computer vision, could be implemented in a Turing machine as they have a halting criterion (i.e. the error is minimal). But they are better developed in parallel machines with each neuron running on its own machine. And a general purpose Turing machine is not the best choice for it as it computes its output based on weighted input after activation. To understand it better, consider a swiss knife versus kitchen knife. To cut an onion, you can use either one, but the kitchen knife is straight forward and doesn't need the extra step with swiss knife to figure out which of its tools can do the job.

Case-based reasoning

Case-based reasoning is a special case of analogical reasoning where the algorithm looks for instances in the past to find the best match based on some criteria. Suppose you are acting like a doctor and have access to all the case files in a physician's office. This is the sub-plot of Steven Spielberg's movie “Catch Me If You Can”. When a new patient walks in to your office, you gather the symptoms and go through the case files until you find a match. If there is no exact match, you make the best possible guess with a previous case and come up with a diagnosis. This in essence is the case-based reasoning approach. The program doesn't have a domain model nor does it have any logic other than pattern matching. It manages to find a satisfying response whether you are looking for a food recipe or a legal case.

Even though case-based reasoning can be explained as pattern matching, its enormous popularity with help desk is unprecedented and made it the most successful application of AI. When you call a manufacturer about a broken dish washer, chances are the company already has a similar incident reported by another customer and have a record of it. So they gather your description of the problem, plug it into a computer program and read out the remedy. Where do new cases come from? If they can't find a match with existing cases, then they create a new one and assign it to a technician. The technician will offer a solution and update the case knowledge base by tagging it with key words that are used in a future search to resolve another customer's problem same as yours.

Explaining inference

Experts have become what they are by being able to come up with a hypothesis to fit the data and explain why it is the best possible hypothesis. Some call it a “logical” explanation. As the old story goes, a carved image is what is left after chiseling away the stone that didn't match with the image. We use all the tools in our chest, so to say, to explain including, if necessary, Newton's laws and Einstein's relativity. Expert testimonials in court-rooms are replete with logical induction, deduction, analogical and mathematical reasoning, physical and chemical science, etc. What about AI programs? In rule-based systems the best the program can do is give a trace of all the rules that were “fired” or considered for matching with the input. In the rules given earlier, suppose some one wants to know what the other causes of stroke and heart disease besides BP, the expert system will have no answer as such rules are not programmed. Further with every new finding added to the rule premises (left hand side), the search process grows exponentially. Consider the following:

If

systole > 130 AND diastole > 90 AND blood-glucose > 120 AND no exercise

Then

conclude heart disease from diabetes

To test each of the premises in the above rule we need $2^4=16$ entries in the truth table such as these:

systole > 130diastole > 90 blood-glucose>120Exercise
1000
1001
1011
1111
1100
1110
1111
1101

With each additional symptom to consider (e.g. pulse > 100), the table grows by 2. For example if we were to add heart-rate, then we will have to account for $2^5=32$ entries. In other words, the search space grows exponentially.

But if someone wanted to know why a particular rule came into existence in the first place, then the program is woefully inadequate to explain it away. IBM's Watson computer is supposed to over-come this by capturing all of the knowledge that we as humans ever created. Even if such a task is possible, it is proving to be very expensive and a massive undertaking. Imagine scanning all the knowledge ever printed on paper and capturing the folk-lore that is transmitted in verbal communication only, and making it available in a program the computer can process to fumble to answer the simple question: “where do hindu vedas come from?”.

Is Memorization and Rote Learning a component of Intelligence?

Children are taught to memorize multiplication tables among other things even in this age of world-wide-web not to mention the calculators that were replaced with smart phones. There is a reason for that. When you are standing in a queue at a store to checkout some items, it would help others waiting in the line if you can figure out the expected amount to pay rather than inquire the clerk and hold everyone up. Suppose you bought 5 cartons of milk each costing \$2.30 a quick multiplication would help to be prepared at the check out. Why do we think the point-of-sale machines would always be right? Some times they add local taxes and recycling fees to over-state the bill you anticipated to be $11.50 (for the 5 cartons of milk).

Another place rote learning would help is in the areas of computer science where certifications are required. Say an Oracle database has a maximum length of a variable at 256 characters (which is by the way made up to be 2^8 or a byte long; in a modern 64 bit machines there is no reason to limit the size unless they are reusing the old code), we would rather remember while programming than wait for the compiler to throw at us an error, if it ever does, so as to minimize the debugging effort. Salesforce, a popular CRM platform, is notorious for restricting the available resources because they offer a multi-tenant environment where a single resource, that could be CPU or database or disk, is shared among multiple organizations. For every resource in Salesforce, there is a limit. And to be a certified programmers in Salesforce means all knowing about these limits.

Not so surprisingly AI scientists have not addressed the rote learning itself. It is not about whether they have done it themselves, but if they know how it is done. Eidetic memory may have something to do with it.

Muslim children studying in madrasaas are prime examples of how rote learning applies to religious texts. Almost all priests have eidetic memory without which they cannot possibly recite the ancient texts accurately. Scientific studies done on vedic pundits showed that when they missed a phonem, they would restart from the beginning possibly because they didn't know at how many other places they missed it. Kekule who invented the benzene ring was said to have imagined it while riding on a bus. It seems there are several people amongst us with extraordinary memory who are not categorized as literate based on prevailing academic standards. This makes a strong case that memorization is an evolutionary strategy to cope with increased information or information explosion when retrieving from the memory, what some times is called lazy loading, is either too time consuming or simply tiring.

Decision Trees

Part of any good learning involves abstracting classes from a data set. Suppose we have the following data about (not all data shown for brevity) our decision to go to gym or jog outside. We would like to create a learner based on the data to help us decide whether to jog or go to gym. Central to the classifier is the concept called entropy. If all the rows are equally divided between jog and gym, then the entropy is one. If all the rows map to either jog or gym, then the entropy is zero. This will be the case for a binary classification.

The algorithm itself can very simply be given as:

  • compute the entropy for data-set
  • for every attribute/feature:
    1. calculate entropy for all categorical values
    2. take average information entropy for the current attribute
    3. calculate gain for the current attribute
  • pick the highest gain attribute
  • Repeat until we get the tree we desired
SunnySnowWindTemperatureHumidityGymJog
FalseTrueFalseFalseTrueTrueFalse
FalseTrueFalseTrueTrueTrueFalse
FalseTrueFalseTrueTrueFalseTrue
FalseTrueFalseTrueTrueFalseFalse

Suppose we want to know if we should go to gym today; we look for rows where Gym=True and Gym=False; let them be 30 and 98 respectively

$P(Gym=True) =-{30 \over {30 + 98}}*log({30 \over {30 + 98}})=0.49$ $P(Gym=False)=-{98 \over {30 + 98}}*log({98 \over {30 + 98}})=0.29$

Entropy $E(S)=\sum {(-p(c)*log(p(c))}$ where c = true or false = 0.49 + 0.29 = 0.78

$f_{sunny=true, gym=true}$= frequency of occurrence of (Sunny=True, Gym=True)=14$

$f_{sunny=true, gym=false}$=frequency of occurrence of (Sunny=True, Gym=False)=16$

Entropy $E(Weather=Sunny) = -{14 \over {(14+16)}}*log({14 \over {(14+16)}}) – {16 \over {(14+16)}}*log({16 \over {(14+16)}}) = 0.99$ Similary, $E(Weather=Rain) = -{15 \over {(15+15)}}*log({15 \over {(15+15)}}) – {15 \over {(15+15)}}*log({15 \over {(15+15)}})=1.00$

$E(Weather=Snow) = -{7 \over {(7+23)}}*log({7 \over {(7+23)}}) – {23 \over {(7+23)}}*log({23 \over {(7+23)}}) =0.78$

Avg Entropy Information for weather $I(Weather) = {{14} \over {30 + 98}} * 0.99 + {{30} \over {30 + 98}}*1.00 + {{30} \over {128}}*0.78=0.525$

$Gain(Weather) = E(Weather) – I(Weather) = 0.78 – 0.525 = 0.255$

$f_{gym=true,snow=True}$ = frequency of occurence of (Gym=True and Snow=True) = 7

$f_{gym=true,snow=False}$ = frequency of occurence of (Gym=True and Snow=False) = 23

Therefore, $E(Gym=True|Snow) = -{ 7 \over {(7 + 23)}}* log ({7 \over {(7 + 23)}}) -{23 \over {(7+23)}}*log({23 \over {(7+23)}}) =0.78$

$f_{gym=false,snow=True} = frequency of occurence of (Gym=False and Snow=True) = 25$

$f_{gym=false, snow=False} = frequency of occurence of (Gym=False and Snow=False) = 73$

Therefore $E(Gym=False|Snow) = -{25 \over {(25 + 73)}}*log({25 \over {(25+73)}}) – {73 \over {(25+73)}} *log({73 \over {(25+73)}})=0.82$

Avg Entropy Information for $E(Gym) = {{(7+23)} \over {128}}*0.78 + {{(25+73)} \over {128}}*0.82=0.81$

$Gain(Gym) = E(Weather) – I(Gym) = 0.78-0.81=-0.03$

So on so forth we calculate the gain for each of the classes. With a bit of arithmetic, our decision tree might look like this:

The interesting thing about the tree is, we can now create the rules by traversing from the top to the leaf nodes such as these:

  • If weather=snowy and temperature=low then go to gym
  • If weather=snowy and temperature=high then go to jog
  • If weather=rainy and temperature=low then go to gym
  • If weather=rainy and temperature=high then go to job
  • If weather=sunny and temperature is high then go for jog
  • If weather=sunny and temperature is low then go to gym
  • If weather=sunny and humidity is high then go to gym
  • If weather=sunny and humidity is low then go for jog

Unlike biologically inspired neural nets and evolutionary programs, a decision tree lends itself to a detailed description on how it has arrived at a conclusion starting from a set of facts. While this is the most appealing aspect, the down side is managing the complexity. As shown above, the entropy and information gain calculations come with a computation complexity that is combinatorial. If we throw in another variable, say go ski if the weather is snowy and temperature is low, then we have to redo the whole procedure.

One has to admit this example little bit contrived. We started with a state space that had these rules built in to begin with. The idea is to demonstrate the procedure in creating the decision tree using entropy and information gain.

In Conclusion

We know that logic plays a vital role in our daily interactions with people. Logic was supposed to have led to the development of arithmetic and algebra but never achieved their rigor. Since Aristotle's time the development of logic had happened in spurts. Godel's completeness theorems and other pradoxes are worth studying but won't by themselves make us better at anything other than logic. AI has a branch called natural language processing that posits the existence of a universal grammar. It was thought the ancient language sanskrit may provide the basis for it, at least, for the Indo-European family of languages. Some linguists like Noam Chomsky believe language is innate. Others believe it is nurture and competence in any natural language can be achieved without prior exposure. Children make the best test subjects for memorization that help them cope with the information explosion. Rote learning forms a major component of religious learning. If computers are meant to learn the way we do, then they need to be equipped with all of these facets to pass the Turing test.

Online References

http://math.wikia.com/wiki/History_of_logic

https://plato.stanford.edu/entries/logic-ancient/

http://www.individual.utoronto.ca/pking/miscellaneous/history-of-logic.pdf

https://en.wikipedia.org/wiki/Samkhya

https://en.wikipedia.org/wiki/Rete_algorithm

https://www.thoughtco.com/deductive-vs-inductive-reasoning-3026549

http://language.worldofcomputing.net/tag/chomsky

https://medium.com/deep-math-machine-learning-ai/chapter-4-decision-trees-algorithms-b93975f7a1f1

Thursday, April 19, 2018

Bayes

“It is likely to rain,” is a common refrain in the rainy season. What does it mean really though? One answer is, it means there is a high probability that the rain will occur where probability is a measure of prior occurrences of rain given the atmospheric conditions. But where does probability come from? Some opine that it is based on frequency. They are called frequentists. For example, every one understands coin flips. The probability of heads or tails on an unbiased coin flip are equal. There are 2 assumptions here: (1) a coin can land on either head or tails but not on its curved surface and (2) the probabilities add up to 1. The probability of an outcome is always a positive real number below 1. If using an unbiased coin, the probabilities of heads or tails should be the same (0.5). In reality a coin is biased and flipped enough times will favor either heads or tails. The number of times a coin lands on heads or tails is called frequency. In other words, probability in most cases is based on the frequency of occurrences. It is interesting to note in Physics that when light is viewed as a stream of photons spinning in circles, the reflection from a surface is based on the probability. Prof.Feynmann actually talked about viewing photon motion as a spinning arrow and the reflection probability is based on the square of the length of the arrow connecting the tail of arriving photon and head of the reflected proton, also called square of amplitude because the probabilistic region is a circle --whose area is proportional to the square of radius--with the radius given by the amplitude. So going back to the statement that “it is likely to rain”: it can be based on the speaker's observations of prior rainy days. An informed speaker takes in to account all of the variables that predict the rain: presence of clouds, prevailing winds, season of the year, etc.

Most often we don't have frequency measurements to base the probability. For instance, the probability of fever cannot exist in isolation because fever is often the symptom of underlying causes. Flu can cause fever. And so can common cold. So the probability of fever is conditional on the underlying causes. If fever is non-conditional, then it could be based on frequency. Suppose one can say 20% of population, at any given time, is afflicted with fever. That is prior probability of fever, P(Fever)=0.2. Similarly prior probability of Flu is given as, say, P(Flu)=0.3. If the conditional probability of fever given flu is needed P(Fever|Flu) then P(Fever) and P(Flu) should be related to it. This is the leap of faith Reverand Bayes, an 18th century clergyman, made. Using his theorem we can say P(Flu|Fever) = P(Flu)*P(Fever|Flu)/P(Fever). This doesn't seem like a great way to predict flu given fever. As already stated fever has many causes and flu is just one of them. The fever could manifest because of microbes attacking the immune system or the temperature of the surroundings. Not to mention stress and work activity. Therefore, a physician might not adhere to Bayes' theorem as religiously as Rev. Bayes would like.

A Classical Example of Bayes' Theorem

Stated in words, Bayes' theorem is simply starting with a provisional hypothsis we compute the prior probability of the hypothesis. Then in light of new evidence we revise the probability of the hypothesis called it posterior probability. To illustrate, let’s suppose that we were getting tested for HIV. Most of the medical tests have 4 out comes

  • True Positive: Both the test and presence of HIV in the patient are true
  • True Negative: The test and presence of HIV agree; no HIV
  • False Positive: Test and presence of HIV don't match. The test is positive on HIV but the patient actually doesn't have HIV
  • False Negative: The test detects no HIV even though the patient has HIV

The test manufacturer typically gives the accuracy, say, 99.5%. This is the first conditional probability we have, i.e. P(+ve|HIV)=0.995. And P(Test-ve|HIV)=0.995 which is the conditional probability that the patient doesn't have HIV based on the test accuracy. Knowing that the HIV prevalence in the general population is 2%, which is the prior probability P(HIV), using Bayes theorem we can say:

${P(Cause|Effect)} = {{P(Effect)*P(Effect | Cause)} \over P(Effect)}$

${P(HIV|+ve)} = {{P(+ve|HIV) * P(HIV)} \over P(+ve) }$

Substituting the known probabilities:

${P(HIV|+ve)} = {{0.995 * 0.02} \over P(+ve)}$

The denominator P(+ve) is the overall +ve probability of the test being +ve which will be the case when True Positive and False Positive. Therefore:

$P(+ve) = P(+ve|HIV)*P(HIV) + P(+ve|No HIV)*P(No HIV)$

We know the prior probability of P(HIV)=0.02. So P(No HIV)= 1-0.02=0.98 because the probabilities have to add up to 1.

Similarly, P(+ve|No HIV) = 1 – P(-ve|No HIV) = 1-0.995=0.005

We already know from the manufacturer P(+ve|HIV)=0.995

Substituting

${P(HIV|+ve)} = {{0.995 * 0.02} \over {0.995 * 0.02 + 0.005 * 0.98}}$ = 0.8024 or 80.24%

The patient has potentially a better outcome because the conditional probability of HIV given a positive test results is much lower than advertised 99.5%

Certainty Factors

The traditional alternative to the concept of physical probability or what frequentists describe is to think of probabilities as reporting our subjective degrees of belief. This is what Bayesians prefer for the prior probabilities. The modern computer scientists take it further by adding disbelief. In the '80s and '90s when medical knowledge was codified as rules in the knowledge-based systems—also called expert systems-- the contribution to a hypothesis based on new evidence is given as certainty factors. The reason certainty factors were developed has to do with the difficulties with the Bayesian method. Bayes’ theorem’s accuracy in predicting likelihoods depends on knowing prior probabilities. For example, to determine the probability of a disease:

${P(D_i| E)} = {{P(E | D_i) * P(D_i)} \over \sum{(P(E | D_j)*P(D_j))}}$

Where

  • $D_i$ is the i'th disease,
  • E is the evidence,
  • $P(D_i)$ is the prior probability of the patient having the Disease i before any evidence is known
  • $P(E | D_i)$ is the conditional probability that the patient will exhibit evidence E, given that disease $D_i$ is present

If we are able to calculate the $P(D_i | E)$, somehow, when incremental evidence is presented the equations turns more complicated like this:

${P(D_i | E)} = {{P(E_2 | D_i \bigcap E_1) * P(D_i | E_1)} \over \sum{(P(E2 | D_j \bigcap E_1) * P(D_j | E_1))}}$

where $E_2$ is the new evidence added to yield the new augmented evidence:

$E = E_1 \bigcap E_2$

In general if we are dealing with N symptoms with each symptom having the values true or false, there are $2^N$ possibilities. This is called a combinatorial problem by the computer scientists.

In other words certainty factors enable the scientists to elicit the belief from human experts.

  • 0 means no evidence
  • values greater than 0 favor the hypothesis
  • values less than 0 favor the negation of the hypothesis

$CF = {MB – {MD \over { (1 – min(MB, MD))}}}$

Where

  • CF = certainty factor
  • MB =measure of belief
  • MD=measure of disbelief
When 2 rules support the same hypothesis the combined certainty factor

CF = CF1(1-CF2) If both > 0

=CF1 – $CF2 \over {(1-min(|CF1|, |CF2|)}$ if one < 0

= CF1 + CF2 (1+CF1) if both < 0

CF's are found to be convenient in enabling computer programs to diagnose diseases using rules to supplant human physicians. One criticism of CF's is that while they capture the increased belief in a hypothesis given evidence, they assume the belief and disbelief are independent of each other. For example: an expert might say the presence of clouds raises the belief in imminent rain to say 0.6; another expert might say if the clouds are cirrus then there is disbelief say 0.8. So the experts are inconsistent if not contradicting one another.

Naive Bayes

Let us illustrate Naive Bayes with an example. Suppose we want to classify emails as Spam or No-Spam. The Spam detection is based on the occurrences of words like “free”, “won”, “lottery”, etc. Using Bayes' theorem we can write:

$P(Spam | w_1, w_2...w_n) = P(Spam) * P(w_1,w_2...w_n | Spam) / P(w_1, w_2...w_n)$ where w represents a word.

$P(w_1,w_2....w_n | Spam) = P(Spam | w_1, w_2...w_n) * P(w_1, w_2...w_n) / P(Spam)$

The left hand side can be further expanded to: $P(w_1 | w_2...w_n, Spam) * P(w_2 | w_1, w_3...w_n, Spam)...$

Then we make the assumption that $P(w_1 | w_2...w_n,Spam) = P(w_1 | Spam)$ which is central to the Naive Bayes.

That is, we assume the occurrence of word $w_1$ is independent of the occurrences of rest of the words. So “won lottery” is interpreted as two spam words “won” and “lottery” even though they occur together in the email.

Therefore, $P(Spam | w_1, w_2...w_n) = P(Spam)*P(w_1 | Spam)*P(w_2 | Spam)...P(w_n | Spam) / P(w_1,w_2...w_n)$

To calculate $P(w_1 | Spam)$ we can use frequency of occurrence of word $w_1$ in the Spam email. As for the denominator, lot of applications of Naive Bayes ignore it, making the final form

$P(Spam | w_1, w_2...w_n) \propto P(Spam)*P(w_1 | Spam) * P(w_2 | Spam)...P(w_n | Spam)$

In the virtual world of computers each of the probabilities on the right hand side are so small (less than 1 each) that their product will be rounded off to zero! To overcome this, we apply log to both sides

$log(P(Spam | w_1, w_2...w_n)$ = $log(P(Spam)) + \sum_{i=0}^n{log P(w_i | Spam)}$

What if $P(w_i | Spam)=0$ ? To avoid this, we will use Laplace smoothing or assume that each word occurs at least once.

Bayes Net

A Bayes Net is a model of the world we would like to make predictions about. Consider the following causalities for Dyspnea (shortness of breath)

Suppose we ask what is the probability that Dyspnea is caused by a trip to a foreign country? For that we need to traverse the following subtree of the Bayes net:

Visit to Foreign Country->Tuberculosis->Dyspnea

If we know that patient has TB, then the prior visit, if any, is irrelevant. So we have:

P(Dyspnea | Visit ^ TB) = P(Dyspnea | TB)

Cancer and Bronchitis have a common cause in Smoking.

P(Bronchitis | Cancer ^ Smoking) = P(Bronchitis | Smoking)

This is easier to understand as Bronchitis has no direct cause to Cancer. We could've also written

P(Cancer | Bronchitis ^ Smoking) = P (Cancer | Smoking)

Common effect is when Dyspnea can be caused by Bronchitis or Cancer

P (Dyspnea | Cancer ^ Bronchitis) $\ne$ P(Dyspnea | Cancer) $\ne$ P(Dyspnea | Bronchitis)

One observation we can make here is, if we rule out cancer, then the conditional probability P(Dyspnea | Bronchitis) increases.

Most commonly, Bayesian nets are considered to be representations of joint probability distributions. Sparse Bayesian networks (those with relatively few arcs, which means few parents for each node) represent probability distributions in a computationally tractable way.

Modeling this way from first principles is the preferred approach in various applications of Bayesian nets. Obviously it is a tedious procedure to create Bayes net especially in a world where everything is inter-connected and prior probabilities don't exist. After all the statement that a butterfly flapping its wings in Asia could cause a tornado in US midwest is not far fetched for the Bayesians.

Decision Tree

A decision tree is a special case of Bayes' net where at each node we ask the question like “Does the patient smoke?” If true, then ask another question, “Has the patient contracted TB or cancer?”. If the answer to the first question is false then we pursue foreign visit, “Has the patient visited a foreign country?” And so on.

Given these 6 variables we have $2^6$ or 64 possibilities as follows:

VisitSmokingCancerTB Bronchitis Dyspnea
1 0 0 0 0 0
1 0 0 0 0 1
1 1 0 0 0 1

A decision tree makes it easier to handle this truth table by pruning some sub trees based on how the patient answers the questions. For example, if a patient never smoked, then we can eliminate bronchitis from consideration. A variation of decision tree for classification was given as a generic task. It is believed that: (a) we manage complexity by hierarchically representing hypotheses and (b) we can prune branches in the hierarchical structure based on evidence. An algorithm called establish-refine has been suggested. When we establish the probability or truth of a hypothesis, only then we refine it by entertaining all of the sub-hypotheses connected to it. It has been well known that rule-based systems that use certainty factors adopt a similar strategy but it is implicit in the way rules are processed. For example: the rules for bronchitis will fire after it has been established that the patient smokes. In the generic task the establish process can be as complex as one can make. To establish smoking, one can ask “how many packs a day”, “exposure to secondary smoke”, “any cigars”, etc.

Markov Chains

A Markov chain is a state-transition diagram subject to rules of probability. It assumes that no matter which states were traversed in the past, the set of possible future states are fixed. In other words, the current state determines the probability of transitioning to a future state.

For example, if you made a Markov chain model of an infant's activities, you might include "playing," "eating", "sleeping," and "crying" as states, which with other activities could form a 'state space': a list of all possible states. In addition to states, Markov chain tells you the probability of transitioning from one state to any other state-e.g., the chance that an infant currently playing will fall asleep in the next five minutes without crying first.

Consider another example based on web surfing. No matter what pages you have browsed earlier to arrive at the current page, your ability to follow links on the current page determines the future state with the assumption that you are not entering an all together new link in the browser.

In a paper written in 1913, Markov chose a sequence of 20,000 letters from Pushkin's Eugene Onegin to see if this sequence can be approximately considered a simple chain. He obtained the Markov chain with transition matrix:

$$ \begin{pmatrix} 0.128 & 0.872 \\ 0.663 & 0.337 \end{pmatrix} $$ The transition probabilities are given as

  • vowel->vowel = .128
  • vowel->consonant=.872
  • consonant->vowel=.663
  • consonant->consonant->.337

Note that the probabilities on each row add up to 1.

The fixed vector for this chain is (.568, .432), indicating that we should expect about 43.2 percent vowels and 56.8 percent consonants in the novel, which was borne out by the actual count.

Claude Shannon considered an interesting variation of this idea in his book The Mathematical Theory of Communication. Shannon considers a series of Markov chain approximations to English prose. He does this first by chains in which the states are letters and then by chains in which the states are words. For example, for the case of words he presents first a simulation where the words are chosen independently but with appropriate frequencies.

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.

If this looks like a drunkard's rambling, consider the following:

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

Which sounds almost like valid text because the words were chosen as a Markov chain. As to how the text was generated: a simulation is carried out by opening a book and choosing the first word, say it is “the”. Then the book is read until the word “the” appears again and the word after this is chosen as the second word, which turned out to be “head”. The book is then read until the word “head” appears again and the next word, “and”, is chosen, and so on.

Mathematically, a Markov chain is a collection of random variables ${X_t}$ where the index t runs through (0, 1, ...) having the property that, given the present, the future is conditionally independent of the past. In other words,

$P(X_t=j |X_0=i_0,X_1=i_1,...,X_{t-1}=i_{t-1})=P(X_t=j|X_{t-1}=i_{t-1})$

If a Markov sequence of random variates $X_n$ take the discrete values $a_1, ..., a_N$, then

$P(x_n=a_{i_n}|x_{n-1}=a_{i_{n-1}}),...,x_1=a_{i_1})=P(x_n=a_{i_n}|x_{n-1}=a_{i_{n-1}}), $

Notice the similarity with Naive Bayes where we didn't need all of the probabilities in the spam/no-spam classification.

Hidden Markov Model

HMM's are yesterday's neural nets. So far their best known application is speech recognition where it is required to map a set of sounds to a set of written words. What we now take for granted in any phone system, where the navigation menu is driven by spoken words (“yes” or “no” or something else), is made possible by HMMs.

Speech recognition aside, let us say we have a model of a stock market that has these 3 states: bull, bear or even. And we monitor our stock portfolio as: up, down or even. Suppose we have untethered ourselves from modern world for 3 days and would like to know the probability of our portfolio moving from up to down to up (since we don't expect it to be up all the time), how do we go reasoning about it? This is where HMM's come handy.

Consider another situation where the accurate temperature records from a century ago were not available. All we know is whether the year was hot or cold. And we also have access to tree rings, that are supposed to be formed with each passing year varying in size by the weather condition, as small, medium and large. Suppose we want to know the climatic conditions in which the tree rings observed as small-medium-small-large were formed. Once again the answer is HMMs.

Put succinctly, a HMM is one in which you observe a sequence of observations( called emissions), but do not know the sequence of states the model went through to generate the emissions. Analyses of hidden Markov models seek to recover the sequence of states from the observed data. Besides speech recognition HMMs are frequently used for the statistical analysis of multiple DNA sequence alignments.

HMM Math

S = states= $\{s_1, s_2,,,s_n\}$

V=observations =$\{v_1, v_2...v_m\}$

Q=fixed state sequence of length T and O the corresponding observations

$Q=q_1,q_2,,,q_T$

$O=o_1,o_2...o_T$

A=transition array storing the probability of state j following state i

$A=[a_{ij}]$ where $a_{ij}=P(q_t=s_j|q_{t-1}=s_i)$

B=observation array which is the probability of observation k being produced from state j independent of t

$B=[b_i(k)] where b_i(k)=P(o_t=v_k|q_t=s_i)$

U = initial probability array =$[u_i] where u_i=P(q_1=s_1)$

We make the following assumptions:

  • Markov Assumption: current state is dependent only on the previous state
  • Independence Assumpton: the output observation $o_t$ at time t is dependent on the current state $q_t$

Given L=model

$P(O|Q,L) = P(o_1|q_1,L)*P(o_2|q_2,L)...P(o_t|q_t,L)...P(o_T|q_T,L)$

Which is simply $P(o_1|q_1,L)= b(o_1)*b(o_2)...b(o_t)....b(o_T)$

Note that $b(o_1) = P(o_1|q_1)$ the conditional probability of observation $o_1$ belongs to state $q_1$

$P(Q|L) = Uq_1*a_{q_1q_2}*a_{q_3q_4}... a_{q_{T-1}*q_T}$

Note $a_{q_1q_2}$ = probability of transition from state $q_1$ to state $q_2$ which is provided in the input

Therefore $P(O|L) = \sum_1^Q{P(O|Q,L)*P(Q|L)}$ = $\sum_{q_1}^{q_T}{(Uq_1*b(o_1))*a_{q_1q_2}...a_{q_{T-1}q_T}b(o_T))}$

Note U doesn't repeat because it is the initial probability

Kalman Filter

The states in HMM are discrete and won't apply to non-linear systems. When HMM's states are made continuous, then it will be a Kalman filter that can be used for a variety of applications including robotic walk, missile guidance, etc. Kalman filter essentially fills the gaps between the states resulting in a smooth transition from one state to another.

Critique on Bayes

In New York Times, John Allen Poulos critiques Bayes' for its steadfast adherence to prior probabilities. Where do they come from?And why should they influence our beliefs in current events? He takes the example of linkage between vaccination and autism. The presence of thimerosal in vaccines was suspected to cause autism. However, studies, based on statistics and probability theory, proved that there is no signficant linkage between the two. In other words, the conditional probability that autism caused by thimerosal is too insignificant. Let us say P(Autism|Thimerosal)=0.0000001. That is 1 in 10 million can be affected by it. Assuming a population of 5 billion who have been vaccinated, there will be 5000000000/10000000=5000 people with autism! Is that a good outcome? Some contend that there will be many more who will die from tuberculosis, rubella, etc., the diseases for which vaccines are available, if all are not vaccinated.

Another critique of Baye's can be called the inference problem. In our dyspnea diagnosis example, we stated that there are $2^N$ possibilities where N is the number of observed causes. This is a combinatorial problem. For a disease like cancer the N is very large if we assume everything that is artificially made can cause cancer. The Bayes' net worked in the dyspnea diagnosis because we cleverly modeled the net to consider the most relevant causes (like Visit to a foreign country, TB, smoking, etc.). We could have added a few more causes like pollen in the air, air pollution from chemicals, etc. With each cause the number of possibilities doubles making it impossible to compute the probabilities even by the fastest computers.

Of course, the final criticism of Bayes' is that it is as old as religion itself and it is orthodox. Bayes' friend Richard Price, a mathematician, was the actual developer of Bayes' theorem. And if it is rooted in religious beliefs, then it wouldn't have done a good job of speech recognition, search for nuclear weapons, devise actuarial tables, improve low-resolution computer images, judge the authorship of the disputed Federalist papers and determine the false positive rate of mammograms.

Summary

Bayes' probabilities give us a mathematical framework that is rooted in conditional probabilities and prior probabilities. They have predictive power, entirely based on probability theory. In Bayes' nets and decision trees, a model of the universe of interest is created and probabilities are estimated on each node. If there is sufficient evidence, then the hypothesis represented by the node is considered for further examination. We cannot compare Bayes' with any other machine learning algorithm, viz. neural nets, evolutionary programming or traditional logic, as they are based on Baye's theorem and not biologically inspired. It makes sense for statisticians and frequentists whose compilations of natural phenomena provide the grist for the Bayes'.

Online References

http://read.pudn.com/downloads134/ebook/572067/Bayesian%20Artificial%20Intelligence%20-%20Korb%20Nicholson.pdf

http://www1.se.cuhk.edu.hk/~seem5750/Lecture_5.pdf for a discusson of MYCIN certainty factors

http://blog.datumbox.com/machine-learning-tutorial-the-naive-bayes-text-classifier/

https://pythonmachinelearning.pro/text-classification-tutorial-with-naive-bayes/

https://www.norsys.com/tutorials/netica/secA/tut_A1.htm tutorial on Bayesian nets

http://cs229.stanford.edu/section/cs229-hmm.pdf Hidden Markov Models

https://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf

https://www.sciencedirect.com/topics/neuroscience/hidden-markov-models

http://digital.cs.usu.edu/~cyan/CS7960/hmm-tutorial.pdf talks about bear bull market

http://www.bzarg.com/p/how-a-kalman-filter-works-in-pictures/

https://www.nytimes.com/2011/08/07/books/review/the-theory-that-would-not-die-by-sharon-bertsch-mcgrayne-book-review.html

https://www2.isye.gatech.edu/~brani/isyebayes/bank/pvalue.pdf