Saturday, April 21, 2018

Logic

The history of western logic starts with Greeks, even though Indians and Chinese had evolved their own logic. In this chapter we stick to Greek logic starting with Aristotle. But first let us study the logic behind jihad. All major religions—hinduism, christianity, islam---have the concept of heaven. Hindus had created a hierarchical structure to heaven like Indra loka (Lord Indra's abode), vaikuntham (Lord Vishnu's abode), kailasam (Lord Shiva's abode), etc. So for hindus, heaven means a grand tour of these abodes with a stop over in each of them. The concept can be easily explained as a tourist visa. With good karma comes a tourist visa with an expiration date. The soul is free to reside in the heaven until the visa expires at which time it will return to earth (or earth like planet) to start all over again. The christian and islamic concept of heaven is much different from hindus by assuring a permanent visa status in the heaven. This gives some of the muslims a motivation to commit suicidal missions –called jihad—in anticipation of a permanent visa in heaven. So let us summarize:

All religions offer  visa to Heaven
Hinduism offers a temporary visa to Heaven based on law of karma
Islam and Christianity offer a permanent visa to Heaven 

So logically we can derive that

All religions must follow law of karma

There is no traditional logical validity to the above statement. That is the logical methods like syllogism, induction, deduction, etc. are not applicable. One can call it abduction because there is a bit of generalization with forced conversion –you can call it imposition—of the law of karma. Hindus believe that Adi Sankaracharya (788 AD) used anviksiki and tarka schools of logic to unify the various religions (dwita or dual) that existed in the Indian subcontinent to non-dual or adwita. This is like a soft war. One can say similarly the Greeks converted the barbarians in the Europe by teaching them logic. The famous logician of Greeks, Aristotle was the inventor of syllogism that goes like this:

All men are mortal
Plato is a man
Therefore, Plato is mortal

A syllogism consists of three propositions. The first two, the premises,share exactly one term (men or man in the above example), and they logically entail the third proposition, called the conclusion, which contains the two non-shared terms of the premises (Plato and Mortal).

All birds fly
Parrots are birds
Parrots fly

This seemingly correct syllogism has a false premise. Not all birds fly. Penguins, kiwis, birds with broken wings, etc. can't fly.

What do people mean when the ask: “Where is the logic in this?”

We often see people question the author where is the logic or what is the logic used by you. The author can be a playwright or a blogger. We are expected to follow certain norms of discourse. We are supposed to “deduce” --which is a logical construct—rather than preach (based on belief or a religious text). In deduction we say:

Dogs chase Cats
Cats chase mice
Therefore dogs chase mice

In general: A implies B, B implies C, therefore A implies C. Once again this is prone to wrong implications. Consider this:

Fire starts sprinklers
Sprinklers start water-flow
Therefore Fire starts water-flow

The fallacy here is that fire and water flow have no visible linkage. If a sprinkler malfunctions, then water can flow without fire. Consider this syllogism:

All Indians are from India
Chief Nightfox is an Indian
Therefore, Chief Nightfox is from India

The fallacy is revealed by the clarification that all Native Indians reside in American continent.

So what is logic good for?

There are many occasions where playwrights used logic to entertain their audience. Consider the following dialogue:

Actor 1: Have you forgotten me?

Actor 2: I have forgotten all men; if you are a man, I forgot you too

Logic was used by authors like Sir Arthur Conan Doyle to give us the famous character detective Sherlock Holmes who deduced from facts who the culprit might be rather than conjecture. In the ancient India deduction was developed by Samkhya school. Samkhya considered Pratyakṣa or Dṛṣṭam (direct sense perception), Anumāna (inference), and Śabda or Āptavacana (verbal testimony) to be the only valid means of knowledge or pramana. Unlike some other schools, Samkhya did not consider the following three pramanas to be epistemically proper: Upamāṇa (comparison and analogy), Arthāpatti (postulation, deriving from circumstances) or Anupalabdi (non-perception, negative/cognitive proof). One can say Sir Doyle was influenced by the Samkhya.

Boolean logic was extensively used by computer scientists. Boole's logic has two interpretations: class logic and propositional logic. Boole made a distinction between "primary propositions" which are the subject of syllogistic theory, and "secondary propositions", which are the subject of propositional logic, and showed how under different "interpretations" the same algebraic system could represent both. An example of a primary proposition is "All inhabitants are either Europeans or Asiatics,” which can be represented as

(x) [ I(x) -> (E(x) v A(x)) ]

An example of a secondary proposition is "Either all inhabitants are Europeans or they are all Asiatics."

(x) (I(x) -> E(x)) v (x) (I(x) -> A(x))

These are easily distinguished in modern propositional calculus, where it is also possible to show that the first follows from the second, but there is no way of representing this in the Boolean system. But more famously Boolean logic is used in processing inputs to AND, OR, XOR, NAND, NOT, etc. gates in the computer hardware.

Over the centuries since Aristotle, logic remained exotic within the confines of the elites who are interested in such philosophical topics as epistemology and ontology. Only recently the first-order predicate logic gained influence in the AI programs called rule-based systems that modeled some area of expertise. Consider the following rules:

If systole > 130 OR diastole > 90 then conclude high BP

If systole < 110 OR diastole < 80 then conclude low BP

If high BP then conclude stroke

If low BP then conclude heart disease

We know from visual inspection that the rules should be processed from top to bottom. But how can a computer, which doesn't have a representation of top or down, process these rules? A number of algorithms were proposed such as the famous Rete algorithm to do inference based on recency, working memory and other criteria that don't have a biological basis. Since they don't withstand the test of logical scrutiny, they remain academic pursuits. We still need to understand why they have failed and what can be done to fix them.

Historically though, the stoics school of logic (278-206 BC) dealt with predicates that entail the conditional statements if...then, AND and OR that were precursors to boolean logic and modern rule-based systems.

P vs. NP

If you can solve a problem in polynomial time (P(n)=$n^{20} + 1$) mathematicians call it a P. If you can verify a solution to a problem in polynomial time then it is called NP. In other words, the verification of a solution itself should not take more time than finding a solution. Unfortunately it is not possible to answer the question is $P=NP$? or is $P \ne NP$?

How much memory is possible to a computer program?

We often hear that memory is cheap. But really how much memory is available? Scientists say there are $10^{80}$ particles in the universe. If we can store a bit of information in an electron, then that is the hard limit on the number of bits a computer has access to. Similarly the design of circuits on which computer programs run are spatially limited to "planck's distance" which is the distance traveled by light in $10^{-44}$ seconds!

Deductive vs Inductive Reasoning

According to Ashley Crossman, deductive reasoning and inductive reasoning are two different approaches to conducting scientific research. With deductive reasoning, a researcher tests a theory by collecting and examining empirical evidence to see if it is true. With inductive reasoning, a researcher first gathers and analyzes data, then constructs a theory to explain findings.

Suppose you want to prove that all marsupials originated in Australia. You can start by locating all marsupials in Australia and not finding at least one in other continents. But this is a lengthy process. You need to study every nook and corner of earth to find a counter example. After you gather enough evidence pertaining to marsupials in Australia, you can induce that indeed marsupials originated in Australia.

By nature, inductive reasoning is more open-ended and exploratory, especially during the early stages. Deductive reasoning is more narrow and is generally used to test or confirm hypotheses. Most social research, however, involves both inductive and deductive reasoning throughout the research process.

Turing Machine basis for AI

We ask if there is a procedure that, given any formal mathematical statement, would algorithmically determine whether the statement is true. Church and Turing proved there is no such procedure; Turing introduced the halting problem as a key example of a mathematical problem without an algorithmic solution. To understand the halting problem consider the pseudo-code:

Procedure:A
Loop for I from 1 to 100
print I
I=I=1
End Loop

From a visual inspection we can see that the code will print integers from 1 to 100. Now let us modify it as follows:

Procedure B(Input: I)
print I
J=I+1
Call Procedure B(J)

The above is an example of a recursive procedure which will never halt. But there is no easy way to verify it or prove it. So here come the empiricists who will test algorithms by writing computer programs and see if the inputs and outputs match up. The opposing camp is comprised of rationalists (yes as you have guessed, arm chair philosophers) who will conclude that it is better to not even try implementing the Procedure B that is poorly coded.

Another way to understand the halting problem is say you have an app on your smart phone that frequently crashes requiring you to reboot the device. Is there some way to check if the app will terminate and leave your device in its original state? Fortunately there is a way, if you are able to write a PulsecheckApp that simulates your inputs on any app that you have downloaded. If the PulsecheckApp is satisfied that the app will not crash your device, then only it will allow you to use it. Now a clever programmer can spoof the PulsecheckApp and make an app pass the test. A similar thing happened with Volkswagen cars which adjusted their computer's programs when the fuel emission check was being conducted to meet the federal standards of fuel consumption. That is, the car's computer knew when the car was being tested and when to inflate the numbers. This is also called gaming the system.

Besides spoofing, Turing actually proved, much before computers have been manufactured, that you can't write a program to determine if a program will halt. Because if it were possible, in simple terms, the program to check the halting can be fed as input to itself and there is no end to how many times you have to recurse. This is similar to Russell's paradox that says: consider the set of all sets that are not members of themselves. Such a set appears to be a member of itself if and only if it is not a member of itself.

Some sets, such as the set of all teacups, are not members of themselves. Other sets, such as the set of all non-teacups, are members of themselves. Call the set of all sets that are not members of themselves “R.” If R is a member of itself, then by definition it must not be a member of itself. Similarly, if R is not a member of itself, then by definition it must be a member of itself.

Or put another way, if you say, a village barber cuts hair of everyone in the village who can't cut themselves and there is only one barber in the village; then does "everyone" include the barber himself? Probably not. Then it is a paradoxical statement.

Some view, Turing's halting issue is actually an offshoot of Godel's incompleteness theorem. Turing and Godel were contemporaries and worked in Princeton University. Apparently they never met. Godel's incompleteness theorem applies to arithmetic where there will always be some things that can't be proven using the axioms and theorems of arithmetic. Consider the example in marriage 1 +1 = 3.

No matter what the rationalists had to say, the empiricists, who would like to try out things, marched on. The Turing machine became a standard model for a general-purpose computing device. These results led to the Church–Turing thesis that any deterministic algorithm that can be carried out by a human can be carried out by a Turing machine. This is in general applicable to arithmetic, algebra, etc. But what about human visual processing, motor skills and other biological features for which algorithms are continually being developed and evaluated? Granted neural nets, for computer vision, could be implemented in a Turing machine as they have a halting criterion (i.e. the error is minimal). But they are better developed in parallel machines with each neuron running on its own machine. And a general purpose Turing machine is not the best choice for it as it computes its output based on weighted input after activation. To understand it better, consider a swiss knife versus kitchen knife. To cut an onion, you can use either one, but the kitchen knife is straight forward and doesn't need the extra step with swiss knife to figure out which of its tools can do the job.

Case-based reasoning

Case-based reasoning is a special case of analogical reasoning where the algorithm looks for instances in the past to find the best match based on some criteria. Suppose you are acting like a doctor and have access to all the case files in a physician's office. This is the sub-plot of Steven Spielberg's movie “Catch Me If You Can”. When a new patient walks in to your office, you gather the symptoms and go through the case files until you find a match. If there is no exact match, you make the best possible guess with a previous case and come up with a diagnosis. This in essence is the case-based reasoning approach. The program doesn't have a domain model nor does it have any logic other than pattern matching. It manages to find a satisfying response whether you are looking for a food recipe or a legal case.

Even though case-based reasoning can be explained as pattern matching, its enormous popularity with help desk is unprecedented and made it the most successful application of AI. When you call a manufacturer about a broken dish washer, chances are the company already has a similar incident reported by another customer and have a record of it. So they gather your description of the problem, plug it into a computer program and read out the remedy. Where do new cases come from? If they can't find a match with existing cases, then they create a new one and assign it to a technician. The technician will offer a solution and update the case knowledge base by tagging it with key words that are used in a future search to resolve another customer's problem same as yours.

Explaining inference

Experts have become what they are by being able to come up with a hypothesis to fit the data and explain why it is the best possible hypothesis. Some call it a “logical” explanation. As the old story goes, a carved image is what is left after chiseling away the stone that didn't match with the image. We use all the tools in our chest, so to say, to explain including, if necessary, Newton's laws and Einstein's relativity. Expert testimonials in court-rooms are replete with logical induction, deduction, analogical and mathematical reasoning, physical and chemical science, etc. What about AI programs? In rule-based systems the best the program can do is give a trace of all the rules that were “fired” or considered for matching with the input. In the rules given earlier, suppose some one wants to know what the other causes of stroke and heart disease besides BP, the expert system will have no answer as such rules are not programmed. Further with every new finding added to the rule premises (left hand side), the search process grows exponentially. Consider the following:

If

systole > 130 AND diastole > 90 AND blood-glucose > 120 AND no exercise

Then

conclude heart disease from diabetes

To test each of the premises in the above rule we need $2^4=16$ entries in the truth table such as these:

systole > 130diastole > 90 blood-glucose>120Exercise
1000
1001
1011
1111
1100
1110
1111
1101

With each additional symptom to consider (e.g. pulse > 100), the table grows by 2. For example if we were to add heart-rate, then we will have to account for $2^5=32$ entries. In other words, the search space grows exponentially.

But if someone wanted to know why a particular rule came into existence in the first place, then the program is woefully inadequate to explain it away. IBM's Watson computer is supposed to over-come this by capturing all of the knowledge that we as humans ever created. Even if such a task is possible, it is proving to be very expensive and a massive undertaking. Imagine scanning all the knowledge ever printed on paper and capturing the folk-lore that is transmitted in verbal communication only, and making it available in a program the computer can process to fumble to answer the simple question: “where do hindu vedas come from?”.

Is Memorization and Rote Learning a component of Intelligence?

Children are taught to memorize multiplication tables among other things even in this age of world-wide-web not to mention the calculators that were replaced with smart phones. There is a reason for that. When you are standing in a queue at a store to checkout some items, it would help others waiting in the line if you can figure out the expected amount to pay rather than inquire the clerk and hold everyone up. Suppose you bought 5 cartons of milk each costing \$2.30 a quick multiplication would help to be prepared at the check out. Why do we think the point-of-sale machines would always be right? Some times they add local taxes and recycling fees to over-state the bill you anticipated to be $11.50 (for the 5 cartons of milk).

Another place rote learning would help is in the areas of computer science where certifications are required. Say an Oracle database has a maximum length of a variable at 256 characters (which is by the way made up to be 2^8 or a byte long; in a modern 64 bit machines there is no reason to limit the size unless they are reusing the old code), we would rather remember while programming than wait for the compiler to throw at us an error, if it ever does, so as to minimize the debugging effort. Salesforce, a popular CRM platform, is notorious for restricting the available resources because they offer a multi-tenant environment where a single resource, that could be CPU or database or disk, is shared among multiple organizations. For every resource in Salesforce, there is a limit. And to be a certified programmers in Salesforce means all knowing about these limits.

Not so surprisingly AI scientists have not addressed the rote learning itself. It is not about whether they have done it themselves, but if they know how it is done. Eidetic memory may have something to do with it.

Muslim children studying in madrasaas are prime examples of how rote learning applies to religious texts. Almost all priests have eidetic memory without which they cannot possibly recite the ancient texts accurately. Scientific studies done on vedic pundits showed that when they missed a phonem, they would restart from the beginning possibly because they didn't know at how many other places they missed it. Kekule who invented the benzene ring was said to have imagined it while riding on a bus. It seems there are several people amongst us with extraordinary memory who are not categorized as literate based on prevailing academic standards. This makes a strong case that memorization is an evolutionary strategy to cope with increased information or information explosion when retrieving from the memory, what some times is called lazy loading, is either too time consuming or simply tiring.

Decision Trees

Part of any good learning involves abstracting classes from a data set. Suppose we have the following data about (not all data shown for brevity) our decision to go to gym or jog outside. We would like to create a learner based on the data to help us decide whether to jog or go to gym. Central to the classifier is the concept called entropy. If all the rows are equally divided between jog and gym, then the entropy is one. If all the rows map to either jog or gym, then the entropy is zero. This will be the case for a binary classification.

The algorithm itself can very simply be given as:

  • compute the entropy for data-set
  • for every attribute/feature:
    1. calculate entropy for all categorical values
    2. take average information entropy for the current attribute
    3. calculate gain for the current attribute
  • pick the highest gain attribute
  • Repeat until we get the tree we desired
SunnySnowWindTemperatureHumidityGymJog
FalseTrueFalseFalseTrueTrueFalse
FalseTrueFalseTrueTrueTrueFalse
FalseTrueFalseTrueTrueFalseTrue
FalseTrueFalseTrueTrueFalseFalse

Suppose we want to know if we should go to gym today; we look for rows where Gym=True and Gym=False; let them be 30 and 98 respectively

$P(Gym=True) =-{30 \over {30 + 98}}*log({30 \over {30 + 98}})=0.49$ $P(Gym=False)=-{98 \over {30 + 98}}*log({98 \over {30 + 98}})=0.29$

Entropy $E(S)=\sum {(-p(c)*log(p(c))}$ where c = true or false = 0.49 + 0.29 = 0.78

$f_{sunny=true, gym=true}$= frequency of occurrence of (Sunny=True, Gym=True)=14$

$f_{sunny=true, gym=false}$=frequency of occurrence of (Sunny=True, Gym=False)=16$

Entropy $E(Weather=Sunny) = -{14 \over {(14+16)}}*log({14 \over {(14+16)}}) – {16 \over {(14+16)}}*log({16 \over {(14+16)}}) = 0.99$ Similary, $E(Weather=Rain) = -{15 \over {(15+15)}}*log({15 \over {(15+15)}}) – {15 \over {(15+15)}}*log({15 \over {(15+15)}})=1.00$

$E(Weather=Snow) = -{7 \over {(7+23)}}*log({7 \over {(7+23)}}) – {23 \over {(7+23)}}*log({23 \over {(7+23)}}) =0.78$

Avg Entropy Information for weather $I(Weather) = {{14} \over {30 + 98}} * 0.99 + {{30} \over {30 + 98}}*1.00 + {{30} \over {128}}*0.78=0.525$

$Gain(Weather) = E(Weather) – I(Weather) = 0.78 – 0.525 = 0.255$

$f_{gym=true,snow=True}$ = frequency of occurence of (Gym=True and Snow=True) = 7

$f_{gym=true,snow=False}$ = frequency of occurence of (Gym=True and Snow=False) = 23

Therefore, $E(Gym=True|Snow) = -{ 7 \over {(7 + 23)}}* log ({7 \over {(7 + 23)}}) -{23 \over {(7+23)}}*log({23 \over {(7+23)}}) =0.78$

$f_{gym=false,snow=True} = frequency of occurence of (Gym=False and Snow=True) = 25$

$f_{gym=false, snow=False} = frequency of occurence of (Gym=False and Snow=False) = 73$

Therefore $E(Gym=False|Snow) = -{25 \over {(25 + 73)}}*log({25 \over {(25+73)}}) – {73 \over {(25+73)}} *log({73 \over {(25+73)}})=0.82$

Avg Entropy Information for $E(Gym) = {{(7+23)} \over {128}}*0.78 + {{(25+73)} \over {128}}*0.82=0.81$

$Gain(Gym) = E(Weather) – I(Gym) = 0.78-0.81=-0.03$

So on so forth we calculate the gain for each of the classes. With a bit of arithmetic, our decision tree might look like this:

The interesting thing about the tree is, we can now create the rules by traversing from the top to the leaf nodes such as these:

  • If weather=snowy and temperature=low then go to gym
  • If weather=snowy and temperature=high then go to jog
  • If weather=rainy and temperature=low then go to gym
  • If weather=rainy and temperature=high then go to job
  • If weather=sunny and temperature is high then go for jog
  • If weather=sunny and temperature is low then go to gym
  • If weather=sunny and humidity is high then go to gym
  • If weather=sunny and humidity is low then go for jog

Unlike biologically inspired neural nets and evolutionary programs, a decision tree lends itself to a detailed description on how it has arrived at a conclusion starting from a set of facts. While this is the most appealing aspect, the down side is managing the complexity. As shown above, the entropy and information gain calculations come with a computation complexity that is combinatorial. If we throw in another variable, say go ski if the weather is snowy and temperature is low, then we have to redo the whole procedure.

One has to admit this example little bit contrived. We started with a state space that had these rules built in to begin with. The idea is to demonstrate the procedure in creating the decision tree using entropy and information gain.

In Conclusion

We know that logic plays a vital role in our daily interactions with people. Logic was supposed to have led to the development of arithmetic and algebra but never achieved their rigor. Since Aristotle's time the development of logic had happened in spurts. Godel's completeness theorems and other pradoxes are worth studying but won't by themselves make us better at anything other than logic. AI has a branch called natural language processing that posits the existence of a universal grammar. It was thought the ancient language sanskrit may provide the basis for it, at least, for the Indo-European family of languages. Some linguists like Noam Chomsky believe language is innate. Others believe it is nurture and competence in any natural language can be achieved without prior exposure. Children make the best test subjects for memorization that help them cope with the information explosion. Rote learning forms a major component of religious learning. If computers are meant to learn the way we do, then they need to be equipped with all of these facets to pass the Turing test.

Online References

http://math.wikia.com/wiki/History_of_logic

https://plato.stanford.edu/entries/logic-ancient/

http://www.individual.utoronto.ca/pking/miscellaneous/history-of-logic.pdf

https://en.wikipedia.org/wiki/Samkhya

https://en.wikipedia.org/wiki/Rete_algorithm

https://www.thoughtco.com/deductive-vs-inductive-reasoning-3026549

http://language.worldofcomputing.net/tag/chomsky

https://medium.com/deep-math-machine-learning-ai/chapter-4-decision-trees-algorithms-b93975f7a1f1

No comments:

Post a Comment