Friday, May 11, 2018

Reinforcement Learning

Fork me on GitHub
😢
Looks like your browser doesn't support this site!
Try a recent version of Chrome or Firefox.
Learners based on Neural Nets, Bayes, Logical inference and Genetic Algorithms have one thing in common: their learning is supervised. This is like holding the hand of a child all the way to adulthood. In real world, the child is initially supervised with feedback until the child learns to manoeuvre itself based on feedback from the environment. For example, while learning to bike the child might learn that no matter how he turns the handle bars left or right, when the bike is tilted at an angle, the fall is inevitable. This could be based on the number of times the child fell while tilted at an angle causing pain. Or it could be the release of endorphins and dopamine in the brain in reaction as the pleasure of riding without fall. Such unfettered learning is the subject matter of reinforcement learning where the learner tries to maximize a function with feedback --called reinforcement reward-- from the environment without supervision. For long this has been the holy grail of AI. We will see if this is possible let alone achievable within the limited means of memory and computational speed.

RL is better explained with a robot that is left in a location for the first time and programmed to map the territory. The robot evaluates the current position and takes a step based on the best evaluation. For example, let us say the robot can step left, right, forward and backward. The goal is to reach a charging station for an energy refill whose GPS location has been made available to the robot. As the robot takes a step, it takes its GPS location and compares it with that of the charging station to calculate distance. The objective is to minimize the distance. There could be hurdles on the way which the robot is unaware of. So the robot may have to step back at times. With reinforcement learning, the robot is guaranteed to reach the charging station, after some trial and error, if there is a path without hurdles. The interesting thing is the robot once after attaining the goal has learnt a complete path from starting point to the charging station with the intermittent paths that it had tried and backtracked that also contributed to the learning. In other words, the robot received negative reinforcement for some locations resulting in backtracking and positive reinforcement for some locations leading to the realization of the goal. Do kids learn this way? If left to their own machinations this is as close as it gets according to the researchers studying child psychology.

Biological Basis

While neural nets are based on the neuronal structure in brain and genetic algorithms are inspired by the evolution, RL is inspired by the behavior when behavior is viewed as a modification of a state based on the reward received from the environment. We know about the Pavlovian dog that responded to stimulus. The behavioral modification in response to reward goes one step further. It is within the human nature, especially among the children, to modify their behavior based on the reward or lack of it.

Exploration vs. Exploitation

Exploration refers to the acquisition of knowledge such as a robot trying to map a territory by wandering around and getting lost at times. Exploitation, on the other hand, is the application of previously acquired knowledge in a new situation with some modification. Both of these approaches are not optimal, and we have to find a proper balance between them to get maximum reward. This is said to be exploration vs exploitation dilemma of reinforcement learning.

The math behind reinforcement learning

Much of the math behind reinforcement learning is based on Bellman equation that goes like this:

$V^{*}(x_t) = r(x_t) + \gamma V^{*}x_{t+1})$

Where $V^*$ represents the ideal value function at state x, t and t+1 are time intervals, and $\gamma$ is the discount factor that is in the range[0,1] that causes immediate reinforcement to have more importance or weighted more heavily than future reinforcement. , and $r({x_t})$ is the reinforcement from the environment. The difference between actual value and ideal value is represented as an error $e$ that can be shown to be

$e(x_t) = \gamma e(x_{t+1})$

Note that the error is exponentially decaying with time and at the goal state is 0 by definition when learning is considered as complete. This is much simpler than what one encounters with neural nets. But one can use a neural net for the approximation of $V(t, w)$ of $V^{*}(x)$ as

$\Delta w(t) = -\alpha [ max(r(x_t, u) + \gamma V(x_{t+1}, w_t)) – V(x_t, w_t)] {{\partial {V(x_t, w_t)}} \over {\partial w_t}}$

where $\alpha$ is the learning rate.

Terminology in Reinforcement Learning

Policy - a mapping from states to actions. The actions for the robot are move right or move forward.

Reinforcement - a scalar variable that communicates the change in the environment to the reinforcement learning system. For example, if an RL system is a controller for a missile, the reinforcement signal might be the distance between the missile and the target (In which case, the RL system would learn to minimize reinforcement).

Markov decision process - An MDP consists of a set of states X; a set of start states S that is a subset of X; a set of actions A; a reinforcement function R where R(x,a) is the expected immediate reinforcement for taking action a in state x; and an action model P where P(x'|x,a) gives the probability that executing action a in state x will lead to state x'. Note: It is a requirement that the choice of action be dependent solely on the current state observation x. If knowledge of prior actions or states affects the current choice of action then the decision process is not Markov.

Deterministic - In the context of states, there is a one-to-one mapping from state/action pairs to successor states. In other words, with probability one, the transition from state x after performing action a will always result in state x'.

Non-deterministic - In the context of states, there exists a probability distribution function P(x'|x,a) that gives the probability that executing action a in state x will lead to state x'. state - The condition of a physical system as specified by a set of appropriate variables. unbiased estimate - The expected (mean) error in the estimate is zero.

Applications

RL was used for packet routing in dynamic networks, improving elevator performance, channel allocation in wireless networks, job-shop scheduling, missile guidance, etc. It is not clear if RA is applicable to higher cognitive functions like planning and diagnosis where the reward and feedback from environment are not applicable.

Conclusion

  • let us say our robot simulates the path to the charging station even before taking a step; in this case the robot needs to evaluate each and every state and compute the value function which is memory intensive and time consuming so much so that the robot may never make a move.
  • assuming the robot successfully met the goal, how to determine which move made it possible; to illustrate let us say the path involves forks where the robot has to choose between right and left; how to determine which of these forks is crucial? This is the credit assignment problem.
  • a number of alternative algorithms to exhaustive search have been proposed such as Q-learning and Advantage learning which are approximations based on assumptions for the value functions
  • it seems the reinforcement learning with policy gradients – i.e. using a neural net to compute the state values – has been successfully applied to ATARI games (https://en.wikipedia.org/wiki/Atari_Games) where the value of a state can be evaluated with respect to win or loss; so the policy gradient leading upto a win or loss can be used in gradient computation for the given set of inputs with some supervised learning.

One can sing the following ditty:

Hinton set the baseline for neural networks
LeCun set the baseline for convolutional neural networks
ImageNet set the baseline for visual recognition competitions
DeepMind set the baseline for deep RL algorithms
OpenAI set the baseline for deep RL environments

Online References

http://old.nbu.bg/cogs/events/2000/Readings/Petrov/rltutorial.pdf for a basic RL tutorial

http://karpathy.github.io/2016/05/31/rl/ for ATARI GAMES

Saturday, May 5, 2018

test

Dagre Interactive Demo

Dagre Interactive Demo

Input

Link for this graph

Graph Visualization

Analogical Reasoning

An analogy is commonly understood as finding a similarity between two arguments or groups of arguments. Besides similarity, according to Aristotle, an example could be used to draw an analogy. For instance, we can say sun like other stars has a life cycle. In this case we are saying sun is a star and like other stars it will have various life cycle phases. A more straightforward analogy is: the inverse square law of gravitation is analogous to inverse square law of magnetic field. Here we are referring to the syntax of the equation that defines gravitational force vs. magnetic or electro-static force. It goes without saying that the target of an analogy is not a complete replica. Let us suppose the source of analogy is "Theory of thermodynamics predicts increased entropy in the universe". And its target could be "The theory of evolutionary programming is based on entropy of the search space." Here the source and target are as far apart as they can be until an analogy is created.

Material and Causal Similarities

Suppose we have these comparisons between the forces of gravitation and electro-magnetism

1Applies to large objects Applies to charged particles
2Predicts force as inverse square law Predicts force as inverse square law
3Uses mass as a measure Uses charge as a measure

1&3 imply a similarity between material properties. And 2 implies a possible causal similarity However their neither of these similarities are necessary or sufficient. For instance, in mathematics an analogy can be created between a square and a cube without any causal similarity.

Let us say source has a set of attributes A1*, A2*...An*. The target has a set of attributes B1*, B2*...Bn* that can be mapped to their counterparts, (i.e. Bn* -> An*). Then we can say a positive analogy between the source and target has been drawn. A negative analogy is when: A* fail to hold in the target and B* fail to hold in source. A single negative analogy, that is strong enough, can rule out any number of positive analogies. This is the case with electrical conduction and fluid flow which share many similarities like the correspondence between Ohm's law and Poiseuille's law but when it comes to conservation they don't match, resulting in the negation of the analogy.

Where is analogy used?

Analogy can be used to generate a hypothesis a priori for further testing. If we knew under what conditions Dow is bearish, we can look for similar conditions in a different stock market like Nifty. Suppose, a presidential speech made the Dow bearish, we can offer a hypothesis like "When the premier of a country makes a negative speech about a group of companies traded in that country, then that country's stock market will go bearish". This is a convenient way to generate hypotheses for testing among the universe of hypotheses.

Plausibility vs. Probability

Plausibility implies that a strong analogy exists between a source and target based on a cursory examination. For example, you can explain the earnings of an average athlete using a bell curve to signify that there will be diminishing margin of returns as the age progresses. The plausibility is derived from the extant knowledge about bell curves that show a statistical distribution of a variable with respect to another (income and age in this case). The distinction between plausibility and probability is that the former need not be conditional on a priori probabilities but has its own distinguishing criterion—that can be called default reasoning.

Hume's 'no free lunch'

Hume (1711–1776) pointed out that ‘even after the observation of the frequent or constant conjunction of objects, we have no reason to draw any inference concerning any object beyond those of which we have had experience’. This applies to induction which takes analogical reasoning to an unknown territory of over-generalization. Suppose we study the spending behavior of consumers in a given economy and come up with an analogy with the rate of depression among clinical cases prima facie. The induced version of it is greater the rate of clinical depression, greater/lesser the spending behavior of consumers in that economy. The induced rule is not tenable until we have additional knowledge about consumers and clinical depression and the relationship between the two. Suppose, in the absence of a valid theory, we corroborate statistically the two categories of people, there is no way to tell that they share other properties (eg. the socio-economic class or height or weight).

Design Patterns and Analogy

Design patterns are a favorite theme of computer programmers across continents who stumble into the same lines of code that can be explained as a structural analogy. When we write code to feed computers, regardless of the programming language used, we tend to follow a familiar pattern of logic. Suppose someone wrote in Java a program to connect to a database, retrieve some data, process it and close the connection. If they were to write the same in another programming language, say C++, the logic is going to be the same. The important thing to note here is the release of the connection, if it is drawn from a pool of connections, without which the program will run out of connections after a few tries. In analogous terms, if we have a limited resource that we would like to share among multiple users, we should ensure that the resource is freed before someone else can use it. One can illustrate this with a hotel example where tenants possess a key to enter a room and must return the key to make the room available for another tenant's use.

Analogical Reasoning and Induction

Normally induction is when you can infer a general proposition based on many particular observations. One can observe monkeys across the continents and induce the proposition that monkeys are prehensile. The drawback here is even if there is one monkey specie whose tail does not match up with our observations, we have to draw an exception. This is the problem we run into when we are inducing rules in AI. For every induced rule we have to write where there are exceptions. With analogical reasoning we bring two disparate domains into focus and can avoid over-generalization. This is the theory behind case-based reasoning. When we call customer service about our broken dish-washer, chances are the company already knows about this problem. Based on our account of the problem, they can search their database and try to find a case that is analogous to the current one. In practice, they may not find an exact match in which case they tweak the matched case to explain all of the symptoms we are experiencing with our dish washer. Suppose we are noticing that our dish washer is making too much noise. If the best our search in the database yielded was the case about a dish-washer that was leaking. Putting two and two together one can say the rubber lining of the door, that also acts like a sound barrier, is not tight enough. While this is not as grandiose as Priestley's analogy of electro-static force with gravitation to derive the inverse square law, it is nevertheless a common use case of analogical reasoning. When computers draw analogies they do so based on the structure or syntax of the similarities. What will be more powerful is if our case-based leaner can retrieve a case from noise-proofing in windows and apply it to the dish-washer by using the commonly held attribute between the two: hermetical seal.

Analogy as precursor to induction

Consider the following syllogism

All mammals deliver children in situ
Human is a mammal
So humans deliver children in situ

Where does one get the premise "All mammals deliver children in situ?" Many suggest that it is through induction. Suppose we see elephants deliver baby elephants, horses deliver baby horses, and so on. And give them a new class called "mammals" and classify all mammals in a theoretical framework. Some argue that it is in the definition of mammals that we capture the essence. This has the advantage that if some animal that delivers children may not be a mammal who typically feed milk to their new borns. This can be explained more easily with the syllogism "All birds fly; Crow is a bird; So crows fly". What happens when we find a bird like kiwi that doesn't fly? It leads to 2 possibilities: either our inductive method needs a rework or the definition of a bird is wrong. It is easier to fix the latter as induction requires collecting a lot of evidence and then coming up with a theory that explains all of the evidence.

Scientists frequently use induction--that is based on analogies--to create new theories. Then to answer a specific question they use deduction. Contrast this with case-based reasoning where there is no inductive step but cases are tweaked to explain a set of observations. This relieves the legal scholars and computer scientists alike using case-based reasoning from time consuming evidence gathering in the inductive step.

Analogies as Explanations

Aesop's fables are an examples of allegories and metaphors to draw analogies between humans and other domains. Panchatantra tales are Indian allegories that serve similar purpose. Used this way, analogies provide elucidation of profound concepts that are otherwise hard to explain. How can we explain the meaning of creativity to a child? Tell them a tale of a fox that talked its way out of a lion's den and they will understand, albeit within the scope of the tale. It is held by some scientists that every new theory should have some analogies to its predecessors. Viewed this way scientific pursuit is a large continuum with one analogy piled over another. If Bohr's model of atom is inspired by the Kepler's description of planetary motion, then it makes sense. But where does the analogy end? Obviously the atomic nucleus doesn't behave like the sun which is considered to be not only the center of solar system but also revolving around other galaxies and so on.

Analogies as reminders and synergies

Often we hear people say, “this reminds me of ...” What is the trigger for this? Among many viable options, analogical reasoning is the most common occurrence. One can tell a long story of how hindu children are initiated into wearing a sacred thread to a person who summarizes it as: “that reminds me of my grandson's bar mitzvah!” What is the missing link here? They are exchanging stories of coming of age. Different cultures use different means to illustrate coming of age. In some African tribes the coming of age involves getting bitten by poisonous ants. Sataré Maué, for example, will require approximately 20 tocandira ant "inoculations" to ensure the highest level of protection against mosquitoes.

Analogies in Law

Lawyers use precedents to settle cases even before they ever see the daylight of a court room. The reason being all of the available evidence is going to reinforce the verdict or there is no evidence to begin with. When we have some evidence to the effect that the burglar broke a glass window and entered our house, there is likely more than one precedent where the burglary happened in similar way at a different house. The only issue here is the cost of material damage and its recovery. In cases where there is no hard evidence, but circumstantial evidence is available, then the precedent can be a powerful way to argue the case. Take for instance, someone getting sick after eating a store-bought green vegetable. Even without hard evidence for the purchase at a particular store or the presence of some bacteria—like E.Coli – to cause the illness, the case can be settled out of court using a precedent, assuming there is no nation-wide epidemic of E.Coli.

Machine Learners for Analogical Reasoning

Sowa and Majumdar describe a machine learner for analogical reasoning that came up with the following similarities between cats and cars:

Analogy of Cat to Car
CatCar
headhood
eyeheadlight
corneaglass plate
mouthfuel cap
stomachfuel tank
bowelcombustion chamber
anusexhaust pipe
skeletonchassis
heartengine
pawwheel
furpaint

Each concept, i.e. cat and car in this case, is represented within a conceptual graphs. Represented this way, the concept cat belongs in a conceptual graph beginning with animal to which a dog also belongs. This conceptual graph in turn can be represented as a sub-graph of living beings and so on. Then they use three methods to draw analogies:

  • match by labels: if cat and dog were to be compared, then it is straightforward as they share many labels, viz. number of eyes, tail, etc.
  • matching subgraphs: the sub-graph of living beings can be plants and animals which can then be compared
  • matching transformations:involves relating sub-graphs of one graph to sub-graphs of the other. For example the sub-graphs of animals can be compared to sub-graphs of automobiles

While the learner is based on the work of an American logician called Charles Peirce, it depends on a collection of concepts called WordNet knowledge base that contains over 100,000 nodes. The time it takes to search among these nodes depends on the algorithm used. While their algorithm of N log N complexity takes a few seconds, an algorithm of N^3 complexity, presumably making an exhaustive search, can take more than 30 years.

Peirce invented a geometric-topological notation called existential graphs. Like Venn diagrams for sets, the existential graphs serve as the underlying implementation of logic. Presumably this has a biological basis. The existential graphs allow to chart logical reasoning in its finest detail, making visible every single step in the reasoning process (as against notations aimed at quick, results- oriented heuristics). An existential graph for the assertion "Tom believes that Mary wants to marry a sailor" is shown to illustrate the due diligence in representing the concepts.

Representing concepts in a graph notation this way is a time exacting process. This can better be accomplished by a neural net. Still, one has to come up a training set comprising a list of concepts and their inter connections. But once that is done, it takes an efficient algorithm to traverse through the graph with the additional advantage of being able to visually represent the inference process. On the other hand, some think analogy in computer is based on the structural representation of the concepts that are predisposed to analogies. Viewed this way, almost all of the analogical representations in computers share this defect. The representation is the key to a successful analogiser. So it is not clear if the computer is any better than a child using lego sets to create complex objects. Just as using lego bricks one can build any complex structure, the syntax of the computer representations can capture any similarities between the concepts.

Like Data to Machine Learner as History to Democracy

With the raise of the social media there is plethora of data to be mined or understood. A machine learner is perfectly suited for this task where data can be used, say, in training neural nets. If enough people share their data it is theoretically possible to cure rare diseases and explain social behaviors alike. What the machine learners, in effect, need is an open society where everything about the life of a person can be known. Thus each person --their likes, dislikes, favorites, preferences, academic background, credit history and skills--serves as a datum for the learner. Democracies have come about because the history of failed nations is well known where each failed nation provides a datum to create a more robust democracy. However unlike history, which is painstakingly generated with archaeological studies and logic, the data harvested from social media is almost trivial to be of use beyond generating sales of products. Yet this raises concerns over the privacy of individuals participating on social platforms.

Online References

http://www.jfsowa.com/pubs/analog.htm

https://plato.stanford.edu/entries/reasoning-analogy/

https://wiki.eecs.yorku.ca/course_archive/2014-15/W/6339/_media/conceptual_graph_examples.pdf

Thursday, May 3, 2018

D3 Collapsible Tree With Labels That Scale

Tree Example

Note that the tree is collapsible


Saturday, April 21, 2018

Logic

The history of western logic starts with Greeks, even though Indians and Chinese had evolved their own logic. In this chapter we stick to Greek logic starting with Aristotle. But first let us study the logic behind jihad. All major religions—hinduism, christianity, islam---have the concept of heaven. Hindus had created a hierarchical structure to heaven like Indra loka (Lord Indra's abode), vaikuntham (Lord Vishnu's abode), kailasam (Lord Shiva's abode), etc. So for hindus, heaven means a grand tour of these abodes with a stop over in each of them. The concept can be easily explained as a tourist visa. With good karma comes a tourist visa with an expiration date. The soul is free to reside in the heaven until the visa expires at which time it will return to earth (or earth like planet) to start all over again. The christian and islamic concept of heaven is much different from hindus by assuring a permanent visa status in the heaven. This gives some of the muslims a motivation to commit suicidal missions –called jihad—in anticipation of a permanent visa in heaven. So let us summarize:

All religions offer  visa to Heaven
Hinduism offers a temporary visa to Heaven based on law of karma
Islam and Christianity offer a permanent visa to Heaven 

So logically we can derive that

All religions must follow law of karma

There is no traditional logical validity to the above statement. That is the logical methods like syllogism, induction, deduction, etc. are not applicable. One can call it abduction because there is a bit of generalization with forced conversion –you can call it imposition—of the law of karma. Hindus believe that Adi Sankaracharya (788 AD) used anviksiki and tarka schools of logic to unify the various religions (dwita or dual) that existed in the Indian subcontinent to non-dual or adwita. This is like a soft war. One can say similarly the Greeks converted the barbarians in the Europe by teaching them logic. The famous logician of Greeks, Aristotle was the inventor of syllogism that goes like this:

All men are mortal
Plato is a man
Therefore, Plato is mortal

A syllogism consists of three propositions. The first two, the premises,share exactly one term (men or man in the above example), and they logically entail the third proposition, called the conclusion, which contains the two non-shared terms of the premises (Plato and Mortal).

All birds fly
Parrots are birds
Parrots fly

This seemingly correct syllogism has a false premise. Not all birds fly. Penguins, kiwis, birds with broken wings, etc. can't fly.

What do people mean when the ask: “Where is the logic in this?”

We often see people question the author where is the logic or what is the logic used by you. The author can be a playwright or a blogger. We are expected to follow certain norms of discourse. We are supposed to “deduce” --which is a logical construct—rather than preach (based on belief or a religious text). In deduction we say:

Dogs chase Cats
Cats chase mice
Therefore dogs chase mice

In general: A implies B, B implies C, therefore A implies C. Once again this is prone to wrong implications. Consider this:

Fire starts sprinklers
Sprinklers start water-flow
Therefore Fire starts water-flow

The fallacy here is that fire and water flow have no visible linkage. If a sprinkler malfunctions, then water can flow without fire. Consider this syllogism:

All Indians are from India
Chief Nightfox is an Indian
Therefore, Chief Nightfox is from India

The fallacy is revealed by the clarification that all Native Indians reside in American continent.

So what is logic good for?

There are many occasions where playwrights used logic to entertain their audience. Consider the following dialogue:

Actor 1: Have you forgotten me?

Actor 2: I have forgotten all men; if you are a man, I forgot you too

Logic was used by authors like Sir Arthur Conan Doyle to give us the famous character detective Sherlock Holmes who deduced from facts who the culprit might be rather than conjecture. In the ancient India deduction was developed by Samkhya school. Samkhya considered Pratyakṣa or Dṛṣṭam (direct sense perception), Anumāna (inference), and Śabda or Āptavacana (verbal testimony) to be the only valid means of knowledge or pramana. Unlike some other schools, Samkhya did not consider the following three pramanas to be epistemically proper: Upamāṇa (comparison and analogy), Arthāpatti (postulation, deriving from circumstances) or Anupalabdi (non-perception, negative/cognitive proof). One can say Sir Doyle was influenced by the Samkhya.

Boolean logic was extensively used by computer scientists. Boole's logic has two interpretations: class logic and propositional logic. Boole made a distinction between "primary propositions" which are the subject of syllogistic theory, and "secondary propositions", which are the subject of propositional logic, and showed how under different "interpretations" the same algebraic system could represent both. An example of a primary proposition is "All inhabitants are either Europeans or Asiatics,” which can be represented as

(x) [ I(x) -> (E(x) v A(x)) ]

An example of a secondary proposition is "Either all inhabitants are Europeans or they are all Asiatics."

(x) (I(x) -> E(x)) v (x) (I(x) -> A(x))

These are easily distinguished in modern propositional calculus, where it is also possible to show that the first follows from the second, but there is no way of representing this in the Boolean system. But more famously Boolean logic is used in processing inputs to AND, OR, XOR, NAND, NOT, etc. gates in the computer hardware.

Over the centuries since Aristotle, logic remained exotic within the confines of the elites who are interested in such philosophical topics as epistemology and ontology. Only recently the first-order predicate logic gained influence in the AI programs called rule-based systems that modeled some area of expertise. Consider the following rules:

If systole > 130 OR diastole > 90 then conclude high BP

If systole < 110 OR diastole < 80 then conclude low BP

If high BP then conclude stroke

If low BP then conclude heart disease

We know from visual inspection that the rules should be processed from top to bottom. But how can a computer, which doesn't have a representation of top or down, process these rules? A number of algorithms were proposed such as the famous Rete algorithm to do inference based on recency, working memory and other criteria that don't have a biological basis. Since they don't withstand the test of logical scrutiny, they remain academic pursuits. We still need to understand why they have failed and what can be done to fix them.

Historically though, the stoics school of logic (278-206 BC) dealt with predicates that entail the conditional statements if...then, AND and OR that were precursors to boolean logic and modern rule-based systems.

P vs. NP

If you can solve a problem in polynomial time (P(n)=$n^{20} + 1$) mathematicians call it a P. If you can verify a solution to a problem in polynomial time then it is called NP. In other words, the verification of a solution itself should not take more time than finding a solution. Unfortunately it is not possible to answer the question is $P=NP$? or is $P \ne NP$?

How much memory is possible to a computer program?

We often hear that memory is cheap. But really how much memory is available? Scientists say there are $10^{80}$ particles in the universe. If we can store a bit of information in an electron, then that is the hard limit on the number of bits a computer has access to. Similarly the design of circuits on which computer programs run are spatially limited to "planck's distance" which is the distance traveled by light in $10^{-44}$ seconds!

Deductive vs Inductive Reasoning

According to Ashley Crossman, deductive reasoning and inductive reasoning are two different approaches to conducting scientific research. With deductive reasoning, a researcher tests a theory by collecting and examining empirical evidence to see if it is true. With inductive reasoning, a researcher first gathers and analyzes data, then constructs a theory to explain findings.

Suppose you want to prove that all marsupials originated in Australia. You can start by locating all marsupials in Australia and not finding at least one in other continents. But this is a lengthy process. You need to study every nook and corner of earth to find a counter example. After you gather enough evidence pertaining to marsupials in Australia, you can induce that indeed marsupials originated in Australia.

By nature, inductive reasoning is more open-ended and exploratory, especially during the early stages. Deductive reasoning is more narrow and is generally used to test or confirm hypotheses. Most social research, however, involves both inductive and deductive reasoning throughout the research process.

Turing Machine basis for AI

We ask if there is a procedure that, given any formal mathematical statement, would algorithmically determine whether the statement is true. Church and Turing proved there is no such procedure; Turing introduced the halting problem as a key example of a mathematical problem without an algorithmic solution. To understand the halting problem consider the pseudo-code:

Procedure:A
Loop for I from 1 to 100
print I
I=I=1
End Loop

From a visual inspection we can see that the code will print integers from 1 to 100. Now let us modify it as follows:

Procedure B(Input: I)
print I
J=I+1
Call Procedure B(J)

The above is an example of a recursive procedure which will never halt. But there is no easy way to verify it or prove it. So here come the empiricists who will test algorithms by writing computer programs and see if the inputs and outputs match up. The opposing camp is comprised of rationalists (yes as you have guessed, arm chair philosophers) who will conclude that it is better to not even try implementing the Procedure B that is poorly coded.

Another way to understand the halting problem is say you have an app on your smart phone that frequently crashes requiring you to reboot the device. Is there some way to check if the app will terminate and leave your device in its original state? Fortunately there is a way, if you are able to write a PulsecheckApp that simulates your inputs on any app that you have downloaded. If the PulsecheckApp is satisfied that the app will not crash your device, then only it will allow you to use it. Now a clever programmer can spoof the PulsecheckApp and make an app pass the test. A similar thing happened with Volkswagen cars which adjusted their computer's programs when the fuel emission check was being conducted to meet the federal standards of fuel consumption. That is, the car's computer knew when the car was being tested and when to inflate the numbers. This is also called gaming the system.

Besides spoofing, Turing actually proved, much before computers have been manufactured, that you can't write a program to determine if a program will halt. Because if it were possible, in simple terms, the program to check the halting can be fed as input to itself and there is no end to how many times you have to recurse. This is similar to Russell's paradox that says: consider the set of all sets that are not members of themselves. Such a set appears to be a member of itself if and only if it is not a member of itself.

Some sets, such as the set of all teacups, are not members of themselves. Other sets, such as the set of all non-teacups, are members of themselves. Call the set of all sets that are not members of themselves “R.” If R is a member of itself, then by definition it must not be a member of itself. Similarly, if R is not a member of itself, then by definition it must be a member of itself.

Or put another way, if you say, a village barber cuts hair of everyone in the village who can't cut themselves and there is only one barber in the village; then does "everyone" include the barber himself? Probably not. Then it is a paradoxical statement.

Some view, Turing's halting issue is actually an offshoot of Godel's incompleteness theorem. Turing and Godel were contemporaries and worked in Princeton University. Apparently they never met. Godel's incompleteness theorem applies to arithmetic where there will always be some things that can't be proven using the axioms and theorems of arithmetic. Consider the example in marriage 1 +1 = 3.

No matter what the rationalists had to say, the empiricists, who would like to try out things, marched on. The Turing machine became a standard model for a general-purpose computing device. These results led to the Church–Turing thesis that any deterministic algorithm that can be carried out by a human can be carried out by a Turing machine. This is in general applicable to arithmetic, algebra, etc. But what about human visual processing, motor skills and other biological features for which algorithms are continually being developed and evaluated? Granted neural nets, for computer vision, could be implemented in a Turing machine as they have a halting criterion (i.e. the error is minimal). But they are better developed in parallel machines with each neuron running on its own machine. And a general purpose Turing machine is not the best choice for it as it computes its output based on weighted input after activation. To understand it better, consider a swiss knife versus kitchen knife. To cut an onion, you can use either one, but the kitchen knife is straight forward and doesn't need the extra step with swiss knife to figure out which of its tools can do the job.

Case-based reasoning

Case-based reasoning is a special case of analogical reasoning where the algorithm looks for instances in the past to find the best match based on some criteria. Suppose you are acting like a doctor and have access to all the case files in a physician's office. This is the sub-plot of Steven Spielberg's movie “Catch Me If You Can”. When a new patient walks in to your office, you gather the symptoms and go through the case files until you find a match. If there is no exact match, you make the best possible guess with a previous case and come up with a diagnosis. This in essence is the case-based reasoning approach. The program doesn't have a domain model nor does it have any logic other than pattern matching. It manages to find a satisfying response whether you are looking for a food recipe or a legal case.

Even though case-based reasoning can be explained as pattern matching, its enormous popularity with help desk is unprecedented and made it the most successful application of AI. When you call a manufacturer about a broken dish washer, chances are the company already has a similar incident reported by another customer and have a record of it. So they gather your description of the problem, plug it into a computer program and read out the remedy. Where do new cases come from? If they can't find a match with existing cases, then they create a new one and assign it to a technician. The technician will offer a solution and update the case knowledge base by tagging it with key words that are used in a future search to resolve another customer's problem same as yours.

Explaining inference

Experts have become what they are by being able to come up with a hypothesis to fit the data and explain why it is the best possible hypothesis. Some call it a “logical” explanation. As the old story goes, a carved image is what is left after chiseling away the stone that didn't match with the image. We use all the tools in our chest, so to say, to explain including, if necessary, Newton's laws and Einstein's relativity. Expert testimonials in court-rooms are replete with logical induction, deduction, analogical and mathematical reasoning, physical and chemical science, etc. What about AI programs? In rule-based systems the best the program can do is give a trace of all the rules that were “fired” or considered for matching with the input. In the rules given earlier, suppose some one wants to know what the other causes of stroke and heart disease besides BP, the expert system will have no answer as such rules are not programmed. Further with every new finding added to the rule premises (left hand side), the search process grows exponentially. Consider the following:

If

systole > 130 AND diastole > 90 AND blood-glucose > 120 AND no exercise

Then

conclude heart disease from diabetes

To test each of the premises in the above rule we need $2^4=16$ entries in the truth table such as these:

systole > 130diastole > 90 blood-glucose>120Exercise
1000
1001
1011
1111
1100
1110
1111
1101

With each additional symptom to consider (e.g. pulse > 100), the table grows by 2. For example if we were to add heart-rate, then we will have to account for $2^5=32$ entries. In other words, the search space grows exponentially.

But if someone wanted to know why a particular rule came into existence in the first place, then the program is woefully inadequate to explain it away. IBM's Watson computer is supposed to over-come this by capturing all of the knowledge that we as humans ever created. Even if such a task is possible, it is proving to be very expensive and a massive undertaking. Imagine scanning all the knowledge ever printed on paper and capturing the folk-lore that is transmitted in verbal communication only, and making it available in a program the computer can process to fumble to answer the simple question: “where do hindu vedas come from?”.

Is Memorization and Rote Learning a component of Intelligence?

Children are taught to memorize multiplication tables among other things even in this age of world-wide-web not to mention the calculators that were replaced with smart phones. There is a reason for that. When you are standing in a queue at a store to checkout some items, it would help others waiting in the line if you can figure out the expected amount to pay rather than inquire the clerk and hold everyone up. Suppose you bought 5 cartons of milk each costing \$2.30 a quick multiplication would help to be prepared at the check out. Why do we think the point-of-sale machines would always be right? Some times they add local taxes and recycling fees to over-state the bill you anticipated to be $11.50 (for the 5 cartons of milk).

Another place rote learning would help is in the areas of computer science where certifications are required. Say an Oracle database has a maximum length of a variable at 256 characters (which is by the way made up to be 2^8 or a byte long; in a modern 64 bit machines there is no reason to limit the size unless they are reusing the old code), we would rather remember while programming than wait for the compiler to throw at us an error, if it ever does, so as to minimize the debugging effort. Salesforce, a popular CRM platform, is notorious for restricting the available resources because they offer a multi-tenant environment where a single resource, that could be CPU or database or disk, is shared among multiple organizations. For every resource in Salesforce, there is a limit. And to be a certified programmers in Salesforce means all knowing about these limits.

Not so surprisingly AI scientists have not addressed the rote learning itself. It is not about whether they have done it themselves, but if they know how it is done. Eidetic memory may have something to do with it.

Muslim children studying in madrasaas are prime examples of how rote learning applies to religious texts. Almost all priests have eidetic memory without which they cannot possibly recite the ancient texts accurately. Scientific studies done on vedic pundits showed that when they missed a phonem, they would restart from the beginning possibly because they didn't know at how many other places they missed it. Kekule who invented the benzene ring was said to have imagined it while riding on a bus. It seems there are several people amongst us with extraordinary memory who are not categorized as literate based on prevailing academic standards. This makes a strong case that memorization is an evolutionary strategy to cope with increased information or information explosion when retrieving from the memory, what some times is called lazy loading, is either too time consuming or simply tiring.

Decision Trees

Part of any good learning involves abstracting classes from a data set. Suppose we have the following data about (not all data shown for brevity) our decision to go to gym or jog outside. We would like to create a learner based on the data to help us decide whether to jog or go to gym. Central to the classifier is the concept called entropy. If all the rows are equally divided between jog and gym, then the entropy is one. If all the rows map to either jog or gym, then the entropy is zero. This will be the case for a binary classification.

The algorithm itself can very simply be given as:

  • compute the entropy for data-set
  • for every attribute/feature:
    1. calculate entropy for all categorical values
    2. take average information entropy for the current attribute
    3. calculate gain for the current attribute
  • pick the highest gain attribute
  • Repeat until we get the tree we desired
SunnySnowWindTemperatureHumidityGymJog
FalseTrueFalseFalseTrueTrueFalse
FalseTrueFalseTrueTrueTrueFalse
FalseTrueFalseTrueTrueFalseTrue
FalseTrueFalseTrueTrueFalseFalse

Suppose we want to know if we should go to gym today; we look for rows where Gym=True and Gym=False; let them be 30 and 98 respectively

$P(Gym=True) =-{30 \over {30 + 98}}*log({30 \over {30 + 98}})=0.49$ $P(Gym=False)=-{98 \over {30 + 98}}*log({98 \over {30 + 98}})=0.29$

Entropy $E(S)=\sum {(-p(c)*log(p(c))}$ where c = true or false = 0.49 + 0.29 = 0.78

$f_{sunny=true, gym=true}$= frequency of occurrence of (Sunny=True, Gym=True)=14$

$f_{sunny=true, gym=false}$=frequency of occurrence of (Sunny=True, Gym=False)=16$

Entropy $E(Weather=Sunny) = -{14 \over {(14+16)}}*log({14 \over {(14+16)}}) – {16 \over {(14+16)}}*log({16 \over {(14+16)}}) = 0.99$ Similary, $E(Weather=Rain) = -{15 \over {(15+15)}}*log({15 \over {(15+15)}}) – {15 \over {(15+15)}}*log({15 \over {(15+15)}})=1.00$

$E(Weather=Snow) = -{7 \over {(7+23)}}*log({7 \over {(7+23)}}) – {23 \over {(7+23)}}*log({23 \over {(7+23)}}) =0.78$

Avg Entropy Information for weather $I(Weather) = {{14} \over {30 + 98}} * 0.99 + {{30} \over {30 + 98}}*1.00 + {{30} \over {128}}*0.78=0.525$

$Gain(Weather) = E(Weather) – I(Weather) = 0.78 – 0.525 = 0.255$

$f_{gym=true,snow=True}$ = frequency of occurence of (Gym=True and Snow=True) = 7

$f_{gym=true,snow=False}$ = frequency of occurence of (Gym=True and Snow=False) = 23

Therefore, $E(Gym=True|Snow) = -{ 7 \over {(7 + 23)}}* log ({7 \over {(7 + 23)}}) -{23 \over {(7+23)}}*log({23 \over {(7+23)}}) =0.78$

$f_{gym=false,snow=True} = frequency of occurence of (Gym=False and Snow=True) = 25$

$f_{gym=false, snow=False} = frequency of occurence of (Gym=False and Snow=False) = 73$

Therefore $E(Gym=False|Snow) = -{25 \over {(25 + 73)}}*log({25 \over {(25+73)}}) – {73 \over {(25+73)}} *log({73 \over {(25+73)}})=0.82$

Avg Entropy Information for $E(Gym) = {{(7+23)} \over {128}}*0.78 + {{(25+73)} \over {128}}*0.82=0.81$

$Gain(Gym) = E(Weather) – I(Gym) = 0.78-0.81=-0.03$

So on so forth we calculate the gain for each of the classes. With a bit of arithmetic, our decision tree might look like this:

The interesting thing about the tree is, we can now create the rules by traversing from the top to the leaf nodes such as these:

  • If weather=snowy and temperature=low then go to gym
  • If weather=snowy and temperature=high then go to jog
  • If weather=rainy and temperature=low then go to gym
  • If weather=rainy and temperature=high then go to job
  • If weather=sunny and temperature is high then go for jog
  • If weather=sunny and temperature is low then go to gym
  • If weather=sunny and humidity is high then go to gym
  • If weather=sunny and humidity is low then go for jog

Unlike biologically inspired neural nets and evolutionary programs, a decision tree lends itself to a detailed description on how it has arrived at a conclusion starting from a set of facts. While this is the most appealing aspect, the down side is managing the complexity. As shown above, the entropy and information gain calculations come with a computation complexity that is combinatorial. If we throw in another variable, say go ski if the weather is snowy and temperature is low, then we have to redo the whole procedure.

One has to admit this example little bit contrived. We started with a state space that had these rules built in to begin with. The idea is to demonstrate the procedure in creating the decision tree using entropy and information gain.

In Conclusion

We know that logic plays a vital role in our daily interactions with people. Logic was supposed to have led to the development of arithmetic and algebra but never achieved their rigor. Since Aristotle's time the development of logic had happened in spurts. Godel's completeness theorems and other pradoxes are worth studying but won't by themselves make us better at anything other than logic. AI has a branch called natural language processing that posits the existence of a universal grammar. It was thought the ancient language sanskrit may provide the basis for it, at least, for the Indo-European family of languages. Some linguists like Noam Chomsky believe language is innate. Others believe it is nurture and competence in any natural language can be achieved without prior exposure. Children make the best test subjects for memorization that help them cope with the information explosion. Rote learning forms a major component of religious learning. If computers are meant to learn the way we do, then they need to be equipped with all of these facets to pass the Turing test.

Online References

http://math.wikia.com/wiki/History_of_logic

https://plato.stanford.edu/entries/logic-ancient/

http://www.individual.utoronto.ca/pking/miscellaneous/history-of-logic.pdf

https://en.wikipedia.org/wiki/Samkhya

https://en.wikipedia.org/wiki/Rete_algorithm

https://www.thoughtco.com/deductive-vs-inductive-reasoning-3026549

http://language.worldofcomputing.net/tag/chomsky

https://medium.com/deep-math-machine-learning-ai/chapter-4-decision-trees-algorithms-b93975f7a1f1