Thursday, April 19, 2018

Bayes

“It is likely to rain,” is a common refrain in the rainy season. What does it mean really though? One answer is, it means there is a high probability that the rain will occur where probability is a measure of prior occurrences of rain given the atmospheric conditions. But where does probability come from? Some opine that it is based on frequency. They are called frequentists. For example, every one understands coin flips. The probability of heads or tails on an unbiased coin flip are equal. There are 2 assumptions here: (1) a coin can land on either head or tails but not on its curved surface and (2) the probabilities add up to 1. The probability of an outcome is always a positive real number below 1. If using an unbiased coin, the probabilities of heads or tails should be the same (0.5). In reality a coin is biased and flipped enough times will favor either heads or tails. The number of times a coin lands on heads or tails is called frequency. In other words, probability in most cases is based on the frequency of occurrences. It is interesting to note in Physics that when light is viewed as a stream of photons spinning in circles, the reflection from a surface is based on the probability. Prof.Feynmann actually talked about viewing photon motion as a spinning arrow and the reflection probability is based on the square of the length of the arrow connecting the tail of arriving photon and head of the reflected proton, also called square of amplitude because the probabilistic region is a circle --whose area is proportional to the square of radius--with the radius given by the amplitude. So going back to the statement that “it is likely to rain”: it can be based on the speaker's observations of prior rainy days. An informed speaker takes in to account all of the variables that predict the rain: presence of clouds, prevailing winds, season of the year, etc.

Most often we don't have frequency measurements to base the probability. For instance, the probability of fever cannot exist in isolation because fever is often the symptom of underlying causes. Flu can cause fever. And so can common cold. So the probability of fever is conditional on the underlying causes. If fever is non-conditional, then it could be based on frequency. Suppose one can say 20% of population, at any given time, is afflicted with fever. That is prior probability of fever, P(Fever)=0.2. Similarly prior probability of Flu is given as, say, P(Flu)=0.3. If the conditional probability of fever given flu is needed P(Fever|Flu) then P(Fever) and P(Flu) should be related to it. This is the leap of faith Reverand Bayes, an 18th century clergyman, made. Using his theorem we can say P(Flu|Fever) = P(Flu)*P(Fever|Flu)/P(Fever). This doesn't seem like a great way to predict flu given fever. As already stated fever has many causes and flu is just one of them. The fever could manifest because of microbes attacking the immune system or the temperature of the surroundings. Not to mention stress and work activity. Therefore, a physician might not adhere to Bayes' theorem as religiously as Rev. Bayes would like.

A Classical Example of Bayes' Theorem

Stated in words, Bayes' theorem is simply starting with a provisional hypothsis we compute the prior probability of the hypothesis. Then in light of new evidence we revise the probability of the hypothesis called it posterior probability. To illustrate, let’s suppose that we were getting tested for HIV. Most of the medical tests have 4 out comes

True Positive: Both the test and presence of HIV in the patient are true
True Negative: The test and presence of HIV agree; no HIV
False Positive: Test and presence of HIV don't match. The test is positive on HIV but the patient actually doesn't have HIV
False Negative: The test detects no HIV even though the patient has HIV

The test manufacturer typically gives the accuracy, say, 99.5%. This is the first conditional probability we have, i.e. P(+ve|HIV)=0.995. And P(Test-ve|HIV)=0.995 which is the conditional probability that the patient doesn't have HIV based on the test accuracy. Knowing that the HIV prevalence in the general population is 2%, which is the prior probability P(HIV), using Bayes theorem we can say:

${P(Cause|Effect)} = {{P(Effect)*P(Effect | Cause)} \over P(Effect)}$

${P(HIV|+ve)} = {{P(+ve|HIV) * P(HIV)} \over P(+ve) }$

Substituting the known probabilities:

${P(HIV|+ve)} = {{0.995 * 0.02} \over P(+ve)}$

The denominator P(+ve) is the overall +ve probability of the test being +ve which will be the case when True Positive and False Positive. Therefore:

$P(+ve) = P(+ve|HIV)*P(HIV) + P(+ve|No HIV)*P(No HIV)$

We know the prior probability of P(HIV)=0.02. So P(No HIV)= 1-0.02=0.98 because the probabilities have to add up to 1.

Similarly, P(+ve|No HIV) = 1 – P(-ve|No HIV) = 1-0.995=0.005

We already know from the manufacturer P(+ve|HIV)=0.995

Substituting

${P(HIV|+ve)} = {{0.995 * 0.02} \over {0.995 * 0.02 + 0.005 * 0.98}}$ = 0.8024 or 80.24%

The patient has potentially a better outcome because the conditional probability of HIV given a positive test results is much lower than advertised 99.5%

Certainty Factors

The traditional alternative to the concept of physical probability or what frequentists describe is to think of probabilities as reporting our subjective degrees of belief. This is what Bayesians prefer for the prior probabilities. The modern computer scientists take it further by adding disbelief. In the '80s and '90s when medical knowledge was codified as rules in the knowledge-based systems—also called expert systems-- the contribution to a hypothesis based on new evidence is given as certainty factors. The reason certainty factors were developed has to do with the difficulties with the Bayesian method. Bayes’ theorem’s accuracy in predicting likelihoods depends on knowing prior probabilities. For example, to determine the probability of a disease:

${P(D_i| E)} = {{P(E | D_i) * P(D_i)} \over \sum{(P(E | D_j)*P(D_j))}}$

Where

$D_i$ is the i'th disease,
E is the evidence,
$P(D_i)$ is the prior probability of the patient having the Disease i before any evidence is known
$P(E | D_i)$ is the conditional probability that the patient will exhibit evidence E, given that disease $D_i$ is present

If we are able to calculate the $P(D_i | E)$, somehow, when incremental evidence is presented the equations turns more complicated like this:

${P(D_i | E)} = {{P(E_2 | D_i \bigcap E_1) * P(D_i | E_1)} \over \sum{(P(E2 | D_j \bigcap E_1) * P(D_j | E_1))}}$

where $E_2$ is the new evidence added to yield the new augmented evidence:

$E = E_1 \bigcap E_2$

In general if we are dealing with N symptoms with each symptom having the values true or false, there are $2^N$ possibilities. This is called a combinatorial problem by the computer scientists.

In other words certainty factors enable the scientists to elicit the belief from human experts.

0 means no evidence
values greater than 0 favor the hypothesis
values less than 0 favor the negation of the hypothesis

$CF = {MB – {MD \over { (1 – min(MB, MD))}}}$

Where

CF = certainty factor
MB =measure of belief
MD=measure of disbelief

When 2 rules support the same hypothesis the combined certainty factor

CF = CF1(1-CF2) If both > 0

=CF1 – $CF2 \over {(1-min(|CF1|, |CF2|)}$ if one < 0

= CF1 + CF2 (1+CF1) if both < 0

CF's are found to be convenient in enabling computer programs to diagnose diseases using rules to supplant human physicians. One criticism of CF's is that while they capture the increased belief in a hypothesis given evidence, they assume the belief and disbelief are independent of each other. For example: an expert might say the presence of clouds raises the belief in imminent rain to say 0.6; another expert might say if the clouds are cirrus then there is disbelief say 0.8. So the experts are inconsistent if not contradicting one another.

Naive Bayes

Let us illustrate Naive Bayes with an example. Suppose we want to classify emails as Spam or No-Spam. The Spam detection is based on the occurrences of words like “free”, “won”, “lottery”, etc. Using Bayes' theorem we can write:

$P(Spam | w_1, w_2...w_n) = P(Spam) * P(w_1,w_2...w_n | Spam) / P(w_1, w_2...w_n)$ where w represents a word.

$P(w_1,w_2....w_n | Spam) = P(Spam | w_1, w_2...w_n) * P(w_1, w_2...w_n) / P(Spam)$

The left hand side can be further expanded to: $P(w_1 | w_2...w_n, Spam) * P(w_2 | w_1, w_3...w_n, Spam)...$

Then we make the assumption that $P(w_1 | w_2...w_n,Spam) = P(w_1 | Spam)$ which is central to the Naive Bayes.

That is, we assume the occurrence of word $w_1$ is independent of the occurrences of rest of the words. So “won lottery” is interpreted as two spam words “won” and “lottery” even though they occur together in the email.

Therefore, $P(Spam | w_1, w_2...w_n) = P(Spam)*P(w_1 | Spam)*P(w_2 | Spam)...P(w_n | Spam) / P(w_1,w_2...w_n)$

To calculate $P(w_1 | Spam)$ we can use frequency of occurrence of word $w_1$ in the Spam email. As for the denominator, lot of applications of Naive Bayes ignore it, making the final form

$P(Spam | w_1, w_2...w_n) \propto P(Spam)*P(w_1 | Spam) * P(w_2 | Spam)...P(w_n | Spam)$

In the virtual world of computers each of the probabilities on the right hand side are so small (less than 1 each) that their product will be rounded off to zero! To overcome this, we apply log to both sides

$log(P(Spam | w_1, w_2...w_n)$ = $log(P(Spam)) + \sum_{i=0}^n{log P(w_i | Spam)}$

What if $P(w_i | Spam)=0$ ? To avoid this, we will use Laplace smoothing or assume that each word occurs at least once.

Bayes Net

A Bayes Net is a model of the world we would like to make predictions about. Consider the following causalities for Dyspnea (shortness of breath)

Suppose we ask what is the probability that Dyspnea is caused by a trip to a foreign country? For that we need to traverse the following subtree of the Bayes net:

Visit to Foreign Country->Tuberculosis->Dyspnea

If we know that patient has TB, then the prior visit, if any, is irrelevant. So we have:

P(Dyspnea | Visit ^ TB) = P(Dyspnea | TB)

Cancer and Bronchitis have a common cause in Smoking.

P(Bronchitis | Cancer ^ Smoking) = P(Bronchitis | Smoking)

This is easier to understand as Bronchitis has no direct cause to Cancer. We could've also written

P(Cancer | Bronchitis ^ Smoking) = P (Cancer | Smoking)

Common effect is when Dyspnea can be caused by Bronchitis or Cancer

P (Dyspnea | Cancer ^ Bronchitis) $\ne$ P(Dyspnea | Cancer) $\ne$ P(Dyspnea | Bronchitis)

One observation we can make here is, if we rule out cancer, then the conditional probability P(Dyspnea | Bronchitis) increases.

Most commonly, Bayesian nets are considered to be representations of joint probability distributions. Sparse Bayesian networks (those with relatively few arcs, which means few parents for each node) represent probability distributions in a computationally tractable way.

Modeling this way from first principles is the preferred approach in various applications of Bayesian nets. Obviously it is a tedious procedure to create Bayes net especially in a world where everything is inter-connected and prior probabilities don't exist. After all the statement that a butterfly flapping its wings in Asia could cause a tornado in US midwest is not far fetched for the Bayesians.

Decision Tree

A decision tree is a special case of Bayes' net where at each node we ask the question like “Does the patient smoke?” If true, then ask another question, “Has the patient contracted TB or cancer?”. If the answer to the first question is false then we pursue foreign visit, “Has the patient visited a foreign country?” And so on.

Given these 6 variables we have $2^6$ or 64 possibilities as follows:

Visit	Smoking	Cancer	TB	Bronchitis	Dyspnea
1	0	0	0	0	0
1	0	0	0	0	1
1	1	0	0	0	1

A decision tree makes it easier to handle this truth table by pruning some sub trees based on how the patient answers the questions. For example, if a patient never smoked, then we can eliminate bronchitis from consideration. A variation of decision tree for classification was given as a generic task. It is believed that: (a) we manage complexity by hierarchically representing hypotheses and (b) we can prune branches in the hierarchical structure based on evidence. An algorithm called establish-refine has been suggested. When we establish the probability or truth of a hypothesis, only then we refine it by entertaining all of the sub-hypotheses connected to it. It has been well known that rule-based systems that use certainty factors adopt a similar strategy but it is implicit in the way rules are processed. For example: the rules for bronchitis will fire after it has been established that the patient smokes. In the generic task the establish process can be as complex as one can make. To establish smoking, one can ask “how many packs a day”, “exposure to secondary smoke”, “any cigars”, etc.

Markov Chains

A Markov chain is a state-transition diagram subject to rules of probability. It assumes that no matter which states were traversed in the past, the set of possible future states are fixed. In other words, the current state determines the probability of transitioning to a future state.

For example, if you made a Markov chain model of an infant's activities, you might include "playing," "eating", "sleeping," and "crying" as states, which with other activities could form a 'state space': a list of all possible states. In addition to states, Markov chain tells you the probability of transitioning from one state to any other state-e.g., the chance that an infant currently playing will fall asleep in the next five minutes without crying first.

Consider another example based on web surfing. No matter what pages you have browsed earlier to arrive at the current page, your ability to follow links on the current page determines the future state with the assumption that you are not entering an all together new link in the browser.

In a paper written in 1913, Markov chose a sequence of 20,000 letters from Pushkin's Eugene Onegin to see if this sequence can be approximately considered a simple chain. He obtained the Markov chain with transition matrix:

$$ \begin{pmatrix} 0.128 & 0.872 \\ 0.663 & 0.337 \end{pmatrix} $$ The transition probabilities are given as

vowel->vowel = .128
vowel->consonant=.872
consonant->vowel=.663
consonant->consonant->.337

Note that the probabilities on each row add up to 1.

The fixed vector for this chain is (.568, .432), indicating that we should expect about 43.2 percent vowels and 56.8 percent consonants in the novel, which was borne out by the actual count.

Claude Shannon considered an interesting variation of this idea in his book The Mathematical Theory of Communication. Shannon considers a series of Markov chain approximations to English prose. He does this first by chains in which the states are letters and then by chains in which the states are words. For example, for the case of words he presents first a simulation where the words are chosen independently but with appropriate frequencies.

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.

If this looks like a drunkard's rambling, consider the following:

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

Which sounds almost like valid text because the words were chosen as a Markov chain. As to how the text was generated: a simulation is carried out by opening a book and choosing the first word, say it is “the”. Then the book is read until the word “the” appears again and the word after this is chosen as the second word, which turned out to be “head”. The book is then read until the word “head” appears again and the next word, “and”, is chosen, and so on.

Mathematically, a Markov chain is a collection of random variables ${X_t}$ where the index t runs through (0, 1, ...) having the property that, given the present, the future is conditionally independent of the past. In other words,

$P(X_t=j |X_0=i_0,X_1=i_1,...,X_{t-1}=i_{t-1})=P(X_t=j|X_{t-1}=i_{t-1})$

If a Markov sequence of random variates $X_n$ take the discrete values $a_1, ..., a_N$, then

$P(x_n=a_{i_n}|x_{n-1}=a_{i_{n-1}}),...,x_1=a_{i_1})=P(x_n=a_{i_n}|x_{n-1}=a_{i_{n-1}}), $

Notice the similarity with Naive Bayes where we didn't need all of the probabilities in the spam/no-spam classification.

Hidden Markov Model

HMM's are yesterday's neural nets. So far their best known application is speech recognition where it is required to map a set of sounds to a set of written words. What we now take for granted in any phone system, where the navigation menu is driven by spoken words (“yes” or “no” or something else), is made possible by HMMs.

Speech recognition aside, let us say we have a model of a stock market that has these 3 states: bull, bear or even. And we monitor our stock portfolio as: up, down or even. Suppose we have untethered ourselves from modern world for 3 days and would like to know the probability of our portfolio moving from up to down to up (since we don't expect it to be up all the time), how do we go reasoning about it? This is where HMM's come handy.

Consider another situation where the accurate temperature records from a century ago were not available. All we know is whether the year was hot or cold. And we also have access to tree rings, that are supposed to be formed with each passing year varying in size by the weather condition, as small, medium and large. Suppose we want to know the climatic conditions in which the tree rings observed as small-medium-small-large were formed. Once again the answer is HMMs.

Put succinctly, a HMM is one in which you observe a sequence of observations( called emissions), but do not know the sequence of states the model went through to generate the emissions. Analyses of hidden Markov models seek to recover the sequence of states from the observed data. Besides speech recognition HMMs are frequently used for the statistical analysis of multiple DNA sequence alignments.

HMM Math

S = states= $\{s_1, s_2,,,s_n\}$

V=observations =$\{v_1, v_2...v_m\}$

Q=fixed state sequence of length T and O the corresponding observations

$Q=q_1,q_2,,,q_T$

$O=o_1,o_2...o_T$

A=transition array storing the probability of state j following state i

$A=[a_{ij}]$ where $a_{ij}=P(q_t=s_j|q_{t-1}=s_i)$

B=observation array which is the probability of observation k being produced from state j independent of t

$B=[b_i(k)] where b_i(k)=P(o_t=v_k|q_t=s_i)$

U = initial probability array =$[u_i] where u_i=P(q_1=s_1)$

We make the following assumptions:

Markov Assumption: current state is dependent only on the previous state
Independence Assumpton: the output observation $o_t$ at time t is dependent on the current state $q_t$

Given L=model

Which is simply $P(o_1|q_1,L)= b(o_1)*b(o_2)...b(o_t)....b(o_T)$

Note that $b(o_1) = P(o_1|q_1)$ the conditional probability of observation $o_1$ belongs to state $q_1$

$P(Q|L) = Uq_1*a_{q_1q_2}*a_{q_3q_4}... a_{q_{T-1}*q_T}$

Note $a_{q_1q_2}$ = probability of transition from state $q_1$ to state $q_2$ which is provided in the input

Therefore $P(O|L) = \sum_1^Q{P(O|Q,L)*P(Q|L)}$ = $\sum_{q_1}^{q_T}{(Uq_1*b(o_1))*a_{q_1q_2}...a_{q_{T-1}q_T}b(o_T))}$

Note U doesn't repeat because it is the initial probability

Kalman Filter

The states in HMM are discrete and won't apply to non-linear systems. When HMM's states are made continuous, then it will be a Kalman filter that can be used for a variety of applications including robotic walk, missile guidance, etc. Kalman filter essentially fills the gaps between the states resulting in a smooth transition from one state to another.

Critique on Bayes

In New York Times, John Allen Poulos critiques Bayes' for its steadfast adherence to prior probabilities. Where do they come from?And why should they influence our beliefs in current events? He takes the example of linkage between vaccination and autism. The presence of thimerosal in vaccines was suspected to cause autism. However, studies, based on statistics and probability theory, proved that there is no signficant linkage between the two. In other words, the conditional probability that autism caused by thimerosal is too insignificant. Let us say P(Autism|Thimerosal)=0.0000001. That is 1 in 10 million can be affected by it. Assuming a population of 5 billion who have been vaccinated, there will be 5000000000/10000000=5000 people with autism! Is that a good outcome? Some contend that there will be many more who will die from tuberculosis, rubella, etc., the diseases for which vaccines are available, if all are not vaccinated.

Another critique of Baye's can be called the inference problem. In our dyspnea diagnosis example, we stated that there are $2^N$ possibilities where N is the number of observed causes. This is a combinatorial problem. For a disease like cancer the N is very large if we assume everything that is artificially made can cause cancer. The Bayes' net worked in the dyspnea diagnosis because we cleverly modeled the net to consider the most relevant causes (like Visit to a foreign country, TB, smoking, etc.). We could have added a few more causes like pollen in the air, air pollution from chemicals, etc. With each cause the number of possibilities doubles making it impossible to compute the probabilities even by the fastest computers.

Of course, the final criticism of Bayes' is that it is as old as religion itself and it is orthodox. Bayes' friend Richard Price, a mathematician, was the actual developer of Bayes' theorem. And if it is rooted in religious beliefs, then it wouldn't have done a good job of speech recognition, search for nuclear weapons, devise actuarial tables, improve low-resolution computer images, judge the authorship of the disputed Federalist papers and determine the false positive rate of mammograms.

Summary

Bayes' probabilities give us a mathematical framework that is rooted in conditional probabilities and prior probabilities. They have predictive power, entirely based on probability theory. In Bayes' nets and decision trees, a model of the universe of interest is created and probabilities are estimated on each node. If there is sufficient evidence, then the hypothesis represented by the node is considered for further examination. We cannot compare Bayes' with any other machine learning algorithm, viz. neural nets, evolutionary programming or traditional logic, as they are based on Baye's theorem and not biologically inspired. It makes sense for statisticians and frequentists whose compilations of natural phenomena provide the grist for the Bayes'.

Online References

http://read.pudn.com/downloads134/ebook/572067/Bayesian%20Artificial%20Intelligence%20-%20Korb%20Nicholson.pdf

http://www1.se.cuhk.edu.hk/~seem5750/Lecture_5.pdf for a discusson of MYCIN certainty factors

http://blog.datumbox.com/machine-learning-tutorial-the-naive-bayes-text-classifier/

https://pythonmachinelearning.pro/text-classification-tutorial-with-naive-bayes/

https://www.norsys.com/tutorials/netica/secA/tut_A1.htm tutorial on Bayesian nets

http://cs229.stanford.edu/section/cs229-hmm.pdf Hidden Markov Models

https://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf

https://www.sciencedirect.com/topics/neuroscience/hidden-markov-models

http://digital.cs.usu.edu/~cyan/CS7960/hmm-tutorial.pdf talks about bear bull market

http://www.bzarg.com/p/how-a-kalman-filter-works-in-pictures/

https://www.nytimes.com/2011/08/07/books/review/the-theory-that-would-not-die-by-sharon-bertsch-mcgrayne-book-review.html

https://www2.isye.gatech.edu/~brani/isyebayes/bank/pvalue.pdf

Wednesday, April 11, 2018

Neural Nets

The common household thermostat functions by sampling the room temperature and comparing it with the set temperature. The result is called an error. For example, if the set temperature is 70F and the sampled room temperature is 68F, then the error is 2F. A corrective action is invoked which in this case is heating the house until the set temperature and sampled temperature are close enough. In control theory this is called a feedback controller. The variable fed back is the error. If it is positive then the house is heated else the house is cooled. If there is a feedback controller, then it must imply a feedforward controller. Unlike feedback controller that adjusts the output to minimize the error, the feedforward controller has no error but has a model to estimate the current output and simulate the model to arrive at the desired output. Thus a feedforward controller is model-based where a simulation is often required. Suppose we have a simple requirement that the house has to be cooled after 9am if it sunny; otherwise it has to be heated. How to determine sunny is the challenge for the model-based controller. Scale it up with a big office space where we may have the following rules: if it is sunday, and sunny outside, then tun-off air-conditioning; if it is a working day and sunny outside then cool the office to 70F;if it is snowing outside, heat the office to 70F, and so on. Let us suppose we have successfully modeled and the office air-conditioning is humming as desired. We forgot about public holidays, not to mention rainy days, while modeling the controller. So we dutifully add more rules to it for handling the exceptions. If the EPA (Environmental Protection Agency) that has temperature guidelines for commercial buildings, changes the desired temperature from 70F to 72F in winter months, we are faced with the challenge of rewriting all of our rules.

As you can see feedforward controllers are so much more complicated. Actually they aren't for a neural net which is a biologically inspired controller that has the ability to learn and has no complicated model for us to maintain. At its simplest definition a neural net samples the input, propagates the input sampled to the nearest neuron which computers the error by matching it with the desired output and takes a corrective action. This is similar to the way neurons operate.

A neuron has an axon that passes the signal in the form of voltage to the nearest neuron (it could be farther too making the axon long) and a dendrite to receive the signal. Scientists call them synapses. The cell body of the neuron has ions that are polarized by the voltage of the input signal and carry out chemical reactions to generate the output voltage. Since multiple axons can terminate at a dendrite—kind of a junction in the roadways—each voltage is weighted and if it exceeds a threshold, then its voltage is passed on to the cell. Suppose v=voltage, w=weight, then the input voltage for 3 neurons can be given as v1w1 + v2w2+v3w3 > threshold then the neuron is fired. Otherwise nothing happens. That is not the end of story. The weights can be positive (excitory) or negative (inhibitory). It has been hypothesized that the brain is principally composed of about 10 billion such neurons, each connected to about 10,000 other neurons. So this is really huge!

If a neuron fires repeatedly then the neighbor can ignore the signal. Take for instance, you are facing a cheetah all by yourself and a fight or flight decision has to be made. The neurons will be firing resulting in the decision that you pretend to be dead. It is the most pragmatic decision because a cheetah has claws to injure you and it can out-run you. So neither fight nor flight is the best approach. The neurons firing in the fight mode and the ones firing in the flight mode could be sending the most strong signals, but the brain region processing the signals, say prefrontal cortex, will decide what course of action to take. It need not be a life threatening situation. It can happen if you have to decide in a sultry house whether to turn on air-conditioning but a decision was taken to open the windows instead not only because it is energy saving but also for the reasons of health (fresh air).

A bit of history

Way back in 1943 neurophysiologist Warren McCullough and mathematician Walter Pitts modeled a simple neural network using electrical circuits. The neural nets described by them had thresholds and weights, but there were no hidden layers. Also there was no training mechanism. What they showed was that the neural net could master any function that a digital computer could. As early as in the 50's IBM researchers tried to simulate a neural net under the supervision of Nathaniel Rochester rather unsuccessfully. That didn't stop others from trying. With the advent of phone communication, the early telephone workers were faced with the daunting task of reducing or eliminating the echoes. So echo cancellation software based on neural network (MADALINE) was developed by Bernard Windrow and Macian Hoff of Stanford University in 1959 that seems to be the first practical application. In the 60's and 70's research teams like Widrow & Hoff (1962), Kohonen & Anderson (1972) tried different approaches like weighted connections and digital circuits. There are a number of publications like Yann Le Cun, Leon Bottou, Genevieva Orr and Klaus-Rober Muller from AT&T Labs (1998) on topics ranging from how to initialize the weights to alternatives to sigmoid functions to model neuron activation. Their main theme seems to be faster and low foot memory footprint applications which is not of much concern these days. It is to be noted that USA was not the only country developing neural nets. In 2010 Ganesan et al. applied neural nets for diagnosing cancer using demographic data of India. Their goal was to "check for lung cancer disease" -- primarily among tobacco smokers -- "quickly and painlessly and thus detecting the disease at an early stage". More recently neural nets are being developed for monitoring credit card fraud, determining credit-worthiness of loan applicants, "chemical nose" that combines spectrometry with neural nets and so on. It may be noted the credit-worthiness used to be determined by human adjudicators before AI came along and largely replaced the bias of humans with more objective data. However, neural nets don't explain how they decide. They merely give a faster response based on their training set. Where humans have bias, AI systems such as neural nets have inductive bias (more later).

The learning net

The difference between a feedforward controller and a neural net is in the ability of the latter to learn. This is particularly exciting because then we can train the net to learn any complex activity. Neural nets that propagate the error backwards –called backprop-- from the output layer to the middle layer, then to the input layer are the winners of the learning contest. So obviously scientists tried without layers and have failed in their attempts. So a canonical neural net can be defined as one with at least one neuron in the input layer connected to another neuron in the middle layer culminating in the neuron in the output layer. It took scientists led by David Rumelhart several decades since the conception of a Perceptron to come up with the canonical model. A Perceptron merely carried out boolean operation like AND, OR and NOT. The result of AND is 1 if all of its inputs are 1 each. An example is pass a resolution if all of the votes agree to it. The result of OR is 1 if at least one of the inputs is 1. An example: reject a resolution if there is one veto. The result of NOT is to flip the input from 1 to 0 and vice-versa. But Perceptron could not handle XOR which says if 2 inputs are not the same then set the output to 1. Let us say your marketing team came up with a remarkable observation that old men and younger women prefer tea. You can then look up XOR and pass it the age of men and women as inputs with the restriction that younger men and older women don't prefer tea. What about middle-aged men and middle-aged men? So we run into semantic issues and the Perceptron breaks down with all the confusion. Researchers like Marvin Minsky were the first to report that Perceptrons are computationally weak. So something needed to be done and the messenger was David Rumelhart with the unveiling of backprop. As many things go Rumelhart was the last one to stumble into this. Paul Werbos of Harvard had proposed a similar algorithm in his PhD thesis in 1974. Even before Arthur Bryson and Yu-Chi Ho, a couple of control theorists, had the same idea in 1969.

We had come to expect that neural nets can learn. What are they learning? In simple terms they are learning the weights of connections between the neurons. If the network is initialized with random weights and the errors are back propagated resulting in the adjustments of weights, then it is learning the weights. For instance we have two output neurons: drink water or drink orange juice. And the input neurons are athlete and diabetic. We want the diabetic to drink water and we don't care what the athlete drinks. So the connection between diabetic, via the middle layer, could be hard-wired to drink water neuron and we hope the connection between athlete and the two outputs can similarly be hard-wired. But the problem is athlete is recommended both water and juice which is confusing (assuming juice can't be diluted with water). There is a need to learn more about the athlete. While we adjust the weights to neurons connecting athlete to the output, we can break the stronger reinforcement between diabetic and one of the outputs. Therefore, what the net finally learns is not similar to: if you are a diabetic drink water and avoid juice with complete certainty; if you are an athlete drink water and juice. The former stronger association conflicts with the latter's weak association between inputs and outputs. This is to show that neural nets are not amenable to traditional logic. What they learn is based on the inputs and outputs you present to them. If you gave more examples of diabetics than athletes, including diabetics who are athletes, then the net could be biased towards diabetics. There is a fine balance to make. Suppose we trained the net to handle all the cases and now we want it to handle a case when an athlete is pre-diabetic, the net will be stumped because we haven't taken into account the pre-diabetics whose mapping to water or juice mirrors that of an athlete who is not a diabetic. But how the net to know who is a pre-diabetic? Even expert physicians are not sure unless they measure blood glucose levels. This is the reason scientists split the training set into two unequal sets one for training and the other for testing. Any unexpected inputs require augmentation of training sets and relearning.

Then neural nets need to not over-fit. For example the net trained with blood glucose and heart rate, instead of diabetics and athletes (assuming athletes have lower heart rates), could fall into the trap that a lower blood glucose and a lower heart rate always means an athlete. It is possible for a diabetic to have both the conditions. Nor the net's learning could be extrapolated to diabetics who are training to be athletes. A more egregious over-fit is when we try to mix a training set for dogs with that of their owners thinking a dog and its owner share many things when they obviously are not biologically identical.

Hidden Layers and Traps

We talked about a middle layer which sometimes is called a hidden layer because it is neither an input nor an output. There are two aspects to it: there could be multiple hidden layers and multiple neurons within each of them. How do we know how many of each should we start with? Researchers aren't sure what exactly the hidden layers do except they somehow handle the non-linearities that exist in the training set. A linear function mathematically can be represented as a straight line. In the net mentioned above the relationship between a diabetic and water is a linear one. A non-linear function can be a parabola or an exponential. This is what we encounter when the boundary between positive and negative examples is not a simple straight line.

Here is the classic nature versus nurture situation. All of us are born with limited neurons which cannot reproduce. That is the constraint created by the nature. With constant conditioning of the brain, or nurture, we can make the most out of them even if we don't employ 100% of the neurons that can never be replaced. Studies done on Alzheimer's patients show that it is the battle between nature and nurture with the plaques in the brains at issue. The plaques come about naturally. It is hoped that with reinforcement learning the natural hurdles can be overcome by constantly nurturing the Alzheimer's patients by their care givers.

How deep is Deep Learning?

In the 90's a cascade-correlation (CC) algorithm was developed to determine the hidden layer topography. Imagine you have an emergency situation to put off some fires in a building. You are the fire chief and you command a battalion. One strategy is to send all of the battalion into the building to put off the fire. But it has certain risk that could be life threatening and you lose the whole battalion in the worst case. So you send one firefighter at a time after carefully evaluating the outcome. CC is the battalion chief. CC starts training a net without any hidden layers first. If the learning is incomplete, as determined by the errors, it adds a hidden neuron. If the learning is complete, then we are done. Otherwise CC adds another neuron. And the process repeats until the desired accuracy is achieved. This is exactly what we needed to handle non-linearities in our training set. But it is the slowest way to learn.

With deep learning it is envisaged that multiple hidden layers can be added to the net. This is the next step in the evolution of CC. While CC determines the optimal number of hidden nodes, deep learning takes it to the next level by adding optimal number of hidden layers. So why did it take so long to figure this out? One reason is the availability of cheap memory and processing power. Another is what is known as an autoencoder which basically works like a hidden layer that learns incrementally. By stacking autoencoders one can create a deep learning net. For example to recognize the proverbial elephant one autoencoder can learn to recognize trunk, another to recognize legs, etc. with the output emerging from the composition of all the partial results. With cloud computing being the rage, one can create servers on demand called elastic computing in the cloud without ever touching them. Elastic computing has been defined by Microsoft Azure as:

Elastic computing is the ability to quickly expand or decrease computer processing, memory, and storage resources to meet changing demands without worrying about capacity planning and engineering for peak usage. Typically controlled by system monitoring tools, elastic computing matches the amount of resources allocated to the amount of resources actually needed without disrupting operations.

Autoencoders are the elastic computers for the backprop. We get the flexibility to model our net while keeping an eye on accuracy. If adding an autoencoder resulted in over-fit, then undoing it is a simple matter of removing it from the configuration. What about the weights? The researchers are not clear, unlike CC, if the weights have to be reset to random values and the learning repeated in which case the learning is impeded. If the trade-off is acceptable it is a small price to pay for the final result of a network with optimal weights that won't over-fit.

Further more autoencoders' learning is unsupervised. To understand this imagine a 100x100 pixel image that needs to be learnt. You can set each pixel value as input as well as output because an autoencoder is not expected to classify the image as a tree or an animal. What is expected of an autoencoder is the compression of the image with sufficiently small number of hidden nodes. Suppose you would like to verify if an image is a tree, then we just plugin the pixels to the inputs of an autoencoder and if the outputs match the inputs, then it is indeed a tree.

One issue with the deep neural nets is that the signal gets attenuated as it moves from layer to layer. A workaround has been to skip some layers. For instance, if the first layer identifies the data input is a diabetic, then it can directly go to the output node drink water rather than juice. This is akin to hard-wiring or injecting rules into the network. Also some researcher used specialized hardware called Graphics Processing Units (GPU) that were originally designed for video games to simulate hundreds of layers. Such brute-force methods would work for large corporations like Microsoft, Twitter and Google that can afford to throw expensive hardware at the net or even parallel processing. For hoi polloi the best hope is to invent better algorithms.

Can we predict stock market?

Using recurrent networks we can, in theory, predict which way the stock market moves based on the previous values. It is like extrapolation in the numerical analysis. Recurrent networks are defined by Wikipedia as:

A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed graph along a sequence. This allows it to exhibit dynamic temporal behavior for a time sequence. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.

So why aren't we all getting rich by predicting the stock market accurately? Here comes the credit assignment issue. Can we say which layer or node is responsible for correctly mapping an input? In most symbolic computation such as rules for medical diagnosis we can say a diagnosis follows the symptoms and lab tests. The credit is assigned fairly accurately to a rule that could be like: if the fasting blood glucose level is high, then the patient is diabetic . We can't say the same about neural nets. A net is a holistic representation of the solution space that cannot be dissected. This is unlike the way brain functions. Neurologists have shown that if a region of brain has been impaired because of say an accident, other regions take over its functions. It is also shown that the neurons in the speech region in the case of speech impaired people, the sign language learning could take place in the same region. In both the cases neurologists have shown that brain works much like the world-wide-web. If part of the web has been cut off because of power outage, the rest of the web is still functional. This is not the case with neural nets and they continue to be the black boxes holding the secrets of learning.

Gradient Descent: is it up or down?

It can be hypothesized that learning the weights follows a path of least resistance. Imagine you are blind-folded and need to get to a higher ground. You feel the ground and ascertain if it is sloped up or down and make the move. This is called gradient descent by the connectionists (the cognitive scientists who apply neural nets to their research). Why can't we just try different weights for each connection and test the error? Imagine a network with the topography 3 x 4 x 1. That is 3 inputs, 4 hidden and 1 output. There are a total of 3 x 4 + 4=16 weights to be learnt. If we restrict the precision of each weight to 3 decimal places, then we need to generate and test 1000**16 values. That is 1 followed by 19 zeroes! Even for the fastest computers, this could take an eon. So until we can make faster hardware, we stick to gradient descent to find local optima as the global optima could be an over-fit. To understand why global optimum could be an over-fit, consider a net for recognizing hand written digits. We train it so that it is 100% accurate in recognizing X's handwriting, But it will fail to recognize the same digits in Y's handwriting. It turns out the reason is the net has found a global optimum which won't apply to Y. If all we want from the net is to recognize X's handwriting then it is fine. However, if we would like our net to be used in a bank ATM and recognize the amounts on checks, it is better to find a local optimum.

Earlier it was implied that the sum of weighted voltage has to exceed a threshold for a neuron to fire. This is called activation. In backprop a common activation function is a sigmoid or an S-curve. Early on neurologists found out that repeated firing of the adjacent neurons has no particular impact . This is to say the neuron temporarily ignores the inputs. Another time a neuron ignores its inputs is when the voltage starts building up. Putting these two together scientists have come up with sigmoid or S-curve that is a smoothened step function whose shape won't change at low and high values—i.e. remains parallel to the x-axis. The reason biological neurons ignore a low voltage could be that it is just noise. What about the higher values? It is probably a feature of evolution when the voltage beyond a certain threshold cannot be arbitrarily increased keeping in consideration the sensitivity of the organism to fluctuations in the underlying chemical processes.

How does Backprop adjust the weights in the network?

We will simulate a backprop with 3 nodes arranged as follows

Input Node (I) ->Hidden node(J)->Output Node(K)

Let us say the

input is $x_{in}$ and the desired output is y

$w_{ij}$ is the weight of connection between nodes I and J

$w_{jk}$ is the weight of connection between nodes J and K

The activation function for each node can be given as $$ \sigma = {1 \over (1+e^{-\alpha}) }$$ whose derivative is -$\sigma ^2$

Then we can write

$y_j = x_{in}*w_{ij}*\sigma _j$

$Error = 0.5(y – y_j)^2$

To adjust a weight, we take the derivative of Error with respect to weight

$\partial (Error)\over\partial w$ = $\partial (0.5(y-y_j)^2)\over\partial w$ = $(y-y_j){\partial (y-y_j)\over\partial w}$ = $(y-y_j)[{\partial (y)\over\partial w} – {\partial (y_j)\over\partial w}]$

Since y does not have a direct relationship with $w_{ij}$, $\partial(y)\over\partial w$=0

${\partial(Error)\over\partial w} = -(y-y_j){\partial(y_j)\over\partial w}$

Let us substitute $y_j=x_{in}*w_{ij}*\sigma _j$ to get

$\partial(Error)\over\partial w$ = -$(y-y_j)\partial{(x_{in}*w_{ij}*\sigma _j)}\over\partial w$ $= -(y-y_j)*x_{in}*[w_{ij} * {\partial (\sigma _j)\over \partial w} + \sigma _j * {\partial (w_{ij}) \over \partial w}]$

$\partial(w_{ij})\over\partial w$ =1 and ${\partial(\sigma _j)\over\partial w} = -{\sigma _j}^2$

Upon substitution we get

$\partial(Error)\over \partial w$ = -$(y-y_j)*x_{in}*\sigma _j(1-\sigma _j)$

Now we want to minimize the error. At the minimal point of Error versus weight we assume the tangent is parallel to weight axis.

Therefore $(y-y_j)*x_{in}*\sigma _j(1-\sigma _j) = 0$

A general form of backpropagation algorithm for adjusting weights can be given as follows:

Feed the training set forward through the net and calculate for each node inputs (s) and outputs (z after applying activation function f)

calculate the change in weight as the weighted sum of products of error with actual activation output

We have assumed that the error function is quadratic, i.e. (y-yj)**2 because we want backprop to do gradient descent to adjust the weights. Other methods exist such as Newton method and Conjugate gradient. It is very likely that backprop will adjust the weight based on a local optimum. If we don't want to over-fit the net, then it should be fine. M. Forouzanfar, H. R. Dajani, V. Z. Groza, M. Bolic & S.Rajan compared the estimation errors due to ten different training algorithms belonging to three classes: steepest descent (with variable learning rate, with variable learning rate and momentum, resilient backpropagation), quasi-Newton (Broyden-Fletcher-Goldfarb-Shanno, one step secant, Levenberg-Marquardt) and conjugate gradient (FletcherReeves update, Polak-Ribiére update, Powell-Beale restart, scaled conjugate gradient). Their objective was to estimate blood pressure (BP) from the oscillometric measurements. It turns out all the off-the-shelf BP cuffs use oscillometry and a neural net was tasked to improve the errors. As to the winner, they say it was RBP. Go figure!

Some argue that backprop is computationally expensive and slow to converge to the global optimum. Obviously computing the errors for each node and adjusting the weights could take a very long time depending on the size of the training set. They prefer to use GA to specify the new weights with each iteration. If it is possibe to hierarchically specify parents and children in the net and use the parents to cross-over, i.e. swap a neuron's weights with the cross-over results of its parents. This could be done across layers as well. Also mutation is possible by adjusting the weight by a small percentage, changing the sign on the weight (positive for reinforcement or negative for inhibition), etc. The goal in all of these is to find an optimum solution that will meet the accuracy. A neural net that is supposed to diagnose cancer may do well to detect cancer with some accuracy rather than rule out cancer where there is one. Physicians call them false positives and false negatives. If the backprop says the patient has cancer when there is no cancer, then it false positive. Or when an anti-virus program says a program is a virus, when it isn't, that is a false-positive. A false negative is when the patient has really cancer after determining with test results.

Brute force or taming the complexity?

Neural nets are shown to be the most preferred algorithms for image and voice recognition. Why so we may wonder? The simple answer is that they are non-predictive. YouTube videos have a sound track. You can train a neural net to recognize certain words and label the video as U (universal) or R (restricted). Spoken words come from a dictionary of finite size (we hope the user is not using made up words). There is nothing to predict from the inputs. Similarly you can train a net to classify email as span or no-spam. Also, neural nets do well with noise. A word spoken over a noisy phone line can be captured by a net, if the training set has taken noise into consideration (this rules out the aliens trying to communicate with us from outer space).

With all these wonderful features, aren't we using brute force? The answer depends on the complexity of learning. If memory is an issue, then neural nets make the best use of it. If lazy loading is required, where the net starts learning, as the word is being spoken, then faster processing will be required. At the moment both memory and processors are cheaply available.

The leads to the conundrum why can't the net find for itself the best topography: number of hidden layers and number of nodes per layer. In every AI solution there is always what we can call an inductive bias. With GA's it is the alleles. In symbolic learning it is the existence of logic. In Baye's it is the eponymous rule. We don't seem to have a good handle on the inductive bias. When you can't beat them, join them seems to be the mantra. In a sense, that is inevitable because inductive bias in humans is inherited (genetic) and acquired (nurtured). We heard the sayings like: birds of same feather flock.

Yet another aspect of neural nets is called logistic regression which in short means mapping the real world variables to inputs. Going back to our diabetic-athlete net, we said we could have used blood glucose levels and heart rate as inputs. But these are continuous variables. A blood glucose level of 100 mg/dL could be considered high somewhat. Whereas a reading of 200 mg/dL is definitely on the higher side (ignoring post-prandial and fasting requirements). Add to the mix the measurement units where the glucose level can be expressed as ounces per gallon. How can we feed these numbers to the net? The answer seems to be again in the form of sigmoid. The significance of sigmoid or s-curve is that it will provide a range of real numbers between 0 and 1 unlike its step brother which turns on or off. S-curves are used in many ways outside the realm of neural nets. For example, the expected income of a person over life time, the performance of an athlete with each passing year, etc. can all be expressed as S-curves.

Can AI put an end to Guest Workers?

Every year US invites nearly 100,000 guest workers both from academic and commercial sectors to assist the American corporations with computer applications. Some of their tasks include writing code and testing software. Such tasks are considered as routine by the AI community who thinks they can be automated. Just as horse-carriages gave way to taxi-cabs, the guest workers one day would trickle down to the most competitive ones with AI smarts. This is not a forecast for the near future when we will need as many good programmers as we can garner to implement the manifold AI algorithms ranging from GA's to Deep learning nets. It is when the American workers are freed from the routine coding and testing tasks, they can wear their thinking caps and do what they do best: create new algorithms for old problems.

References

http://www.cs.cornell.edu/boom/2004sp/ProjectArch/AppofNeuralNetworkCrystallography/NeuralNetworkCascadeCorrelation.htm

http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/

http://briandolhansky.com/blog/2013/9/27/artificial-neural-networks-backpropagation-part-4 has a detailed example of how weights and errors are calculated in a backprop

https://www.neuraldesigner.com/blog/5_algorithms_to_train_a_neural_network discusses the effects of different search methods such as Newton, Quasi-Newton, Conjugate Gradient, etc.

https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/Biology/index.html for a history of the neural networks research

https://www.wired.com/2016/01/microsoft-neural-net-shows-deep-learning-can-get-way-deeper/ describes deep neural nets at Microsoft

Genetic Algorithms

Darwin was famously quoted as saying: "struggle for existence and survival of the fittest." We drop the first half of the phrase and stick to the fittest in the GA's. There is no dearth of literature on GA's. The following is a bird's eye-view of GA's which are the most similar to the way evolution happens among the species than any other algorithm even though GA's are mentioned along side Evolution Programming and Evolutionary Strategies--the parallel exploration of species' evolution. The GA algorithms' significance is: they not only provide a way to solve difficult problems such as engineering optimization--using strategies found in the nature--but also enable biologists to simulate various ways in which species evolve. Take for example Island Models where species "migrate" from one island to another. In the virtual world the islands are computers or CPU's and the migration is carried out by exchanging results from evolving some data with inter-process communication.

Before we dive into GA, let us understand what the developers of computers envisioned for their invention. Alan Turing and John Von Neumann are some of the earliest pioneers of computers. They wanted their invention to mimic the human intelligence besides carrying out arithmetic or number crunching. It is in this backdrop that the pioneer of GA John Holland came into limelight. His mentor Arthur Burks was a close associate of Von Neumann and they all shared the vision of computers learning and evolving just as humans do. It is not surprising because tools played an important role in our evolution. Even before fire was discovered, neanderthals had used tools for hunting and gathering. Over the millennia these tools had become more and more sophisticated. During the industrial revolution of the 19th century mechanization and automation was introduced to assist with our physical tasks. For instance automobiles freed us from the burden of riding horses or horse-drawn carriages. Noting that horses were domesticated by us and trading them for automobiles might slow down their future evolution. We similarly created genetic variants in agriculture to grow the sweetest mangoes and tastiest oranges.

As mechanization and automation freed us from physical tasks, it is only natural that we looked forward to freeing up our brains from routine tasks. Abacus played an important role in counting much before calculators and computers came along. But abacus was limited to manipulating a few numbers and doesn't scale with large numbers. With calculators we conquered the addition, subtraction, multiplication and division of large numbers. We don't consider these devices as intelligent though they are definitely faster than the fastest human computers like Shakuntala Devi. The ability of computers to play chess is impressive, but we still have grand masters who can routinely beat the computers. In other words, computers are not particularly smart and they dutifully do the crunching dictated by the algorithms we feed them. The quest is on to program a computer to do things that we consider as intelligent. Turing test was one of the earliest means to test the computer's abilities to fool a human tester. If the tester, in a double-blind setting, posed questions for which the computer gave answers that were similar to us, then the computer might be misidentified as a human. We are far from the vision of Alan Turing. We are, however, slowly getting there. And GA's enable the computers to replicate the biological processes. Far from perfect, it is a fledgling with about three decades of research and development.

Biological Background

Living organisms are made of cells that contain chromosomes which are vehicles to transfer genes from parents to offspring. The genes are master templates for proteins and represent traits. For example the color of hair is encoded in a gene. Besides genes chromosomes contain DNA which is the basic building block of life. Offspring emerges by the cross-over of chromosomes from parents. That is, parents' genes are passed on to the offspring in a non-deterministic manner which results in diversity among life forms. Some old lifeforms wither away and are replaced by their newer versions. We no longer see dinosaurs but see their evolved offspring. In other words, dinosaurs evolved into lifeforms like birds, crocodiles, etc.

So where is the DNA in all of these?

If Darwin's theory is a computing machine, DNA--later discovered by Crick and Watson--is the nuts and bolts of it. Without DNA there is no life! Let alone evolution. According to Wikipedia:

"DNA) is a thread-like chain of nucleotides carrying the genetic instructions used in the growth, development, functioning and reproduction of all known living organisms and many viruses. Each nucleotide is composed of one of four nitrogen-containing nucleobases (cytosine [C], guanine [G], adenine [A] or thymine [T]), a sugar called deoxyribose, and a phosphate group"

The GA researchers liken CGAT to alleles. These basic building blocks are rearranged by the organism inside an environment to create complex behaviors and structures, just as alleles 0 or 1 can represent the most complex problem down to the simplest example.

A GA Example

Let us say we have 8 notes such as in Indian classical music denoted as: s, r, g, m, p, d, n, t. We want to evolve these notes into raagas--which are like scales in the western music. We assume a raaga has these 8 notes arranged in some order to illustrate. So we represent them as follows:

Note that we used 0 and 1 (called alleles) to make them digestible by the computer in an efficient manner. Each of these strings 001, 010, etc. are called chromosomes in the GA literature. Once we defined the strings, the next step is to create intermediate population selected by fitness which is a measure of how well the string meets some specification. Scales starting with s-r-g-m-p-d-n-t, for example, could be considered as more fit than those starting with any other notes. The fitness could also be based on playing the notes on a musical instrument and determining if they are melodious. This is called simulation. In engineering optimization problems, the simulation could be model-based or an actual artifact made to a specification. Say you have a tank with certain volume to hold water and you want to connect a pump to the tank. How big of a tank or the horse power of pump is determined by some calculation. If the tank has 10 cubic meters of volume and the pump can handle 0.1 cubic meters per minute, then the tank will be emptied in 100 minutes. Bigger the pump faster the transfer. But there is a cost associated with it as bigger pumps cost more. Also the construction of tank itself is subject to some cost constraints.

Continuing with our musical notes and scales, we will now create chromosomes by pairing the notes

001001 ss
010010 rr
100100 gg
110110 mm
111111 pp
101101 dd
011011 nn
000000 tt

Now to make it interesting, the GA's cross-over the chromosomes. So let us create the cross-over at the middle (it can be anywhere within the string) of our chromosomes taking two at a time.

001010 sr
010001 rs
100110 gm
100100 mg
111101 pd
101111 dp
011000 nt
000011 tn

Now we created paired notes that are different from our original population. Let us apply the fitness. We agreed that scales starting with s-r are preferable. So we eliminate all but the top 2 to arrive at:

001010 sr.

But this isn't a scale that we have defined as containing 8 notes. For that we have to cross-over 001010 (sr) with other chromosomes by pairing them. So now our population (a term used by GA's to indicate the set of strings we are currently working with) consists of:

001010 010001 sr-rs
001010 100110 sr-gm
001010 111101 sr-pd
001010 101111 sr-dp
001010 011000 sr-nt
001010 000011 sr-nt

As you can see we have prefixed sr (001010) to every other chrmosome. Thus we have evolved the populaton to 4 notes. We also note that sr-gm is the one that closely matches with the fitness criterion. So we cross it over with others and continue this process two more times to arrive at the perfect scale s-r-g-m-p-d-n-t that we are seeking.

This example is perhaps the most elementary process in the world of music. But it is conceivable that our ancestors carried out similar reasoning when they "invented" the scales. Even if they didn't, this is one way to explain a novice how scales have evolved from notes. Most people are interested in a descriptive rather than a prescriptive solution. That is, we can tell a novice "a scale must have 8 notes" or we can say "a scale is of 8 notes selected based on their fitness". Which one of the explanations is more satisfying is up to the receiver. Of course, the explanation can be twisted to say "this is how you should learn" rather than "this is how it is learnt." Whatever explanation you choose we just made a GA for generating scales!

Mutation vs. Cross-over

A cross-over is how diploids like us evolve. Haploids however have single set of unpaired chromosomes. So how do they evolve? One answer is mutation. In evolutionary biology a mutation happens in genes. It is largely believed that blue or green eyes are a result of mutation. In GA's mutation is done by randomly switching the alleles (0 or 1). Let us say we have a chromosome --in the music example--001010 010001 sr-rs. Let us mutate the 4th bit from right to arrive at 001010 011001. This is actually sr-ns which could not be generated by cross-over. Of course, we could have mutated the 3rd bit from left and lost s note.

The question is which is better: mutation or cross-over? Some GA researchers believe cross-over is a metaphor for exchanging genes via mating rituals. And it is totally unnecessary to evolve the population in the virtual world at least. Their opponents reason that cross-over results in a faster convergence to the desired solution by searching across hyper-planes (a n-dimensional representation of search space that will be described next).

For instance each of the notes can be represented as existing in three dimensional space as a cube with its origin at 000. When we consider the scales, it will be a 8-dimensional hyper cube that computers can search through even though we can't easily visualize it. The result is cross-over transfers a larger set of chromosomes than mutation that can be limiting the search for one or two alleles at a time. It can be seen when you decide to mutate a set of alleles based on some criteria. If we define fitness function as notes starting with n-r-t, then it is possible to mutate the lower bits keeping the higher bits the same.

Search Space

We can say the search space is a way to describe the number of generations the GA algorithm has to go through before arriving at the solution. In our example, we needed 4 generations to create scales.In theory though we had 2 raised to the power of 24 or 16,777,216 (almost 17 million) strings to search! Even for computers this is a herculean task. If we randomly generated the strings and applied the fitness function to them, we would be needing 17 million trials! If a string has length L and there are 2 alleles, the search space is 2 raised to the power of L. For 3 alleles, it is 3 raised to the power of L. And so on.

In general though search space refers to the set of potential solutions and a notion of distance between them. We already looked at the solution space which for GA is an exponential. How about their distance? There are several ways the distance between 2 strings can be found. One is called Hamming distance. From Wikipedia Hamming distance is defined as:

In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences.

It is now clear that an algorithm that has the knack to figure out the distance between the current solution and the required solution wins! And GA's will get you there faster than any other method.

Where do alleles come from?

In the musical scale example we have started with a set of notes. Where do the notes or alleles, come from? We have at least two ways of explaining them. The notes are primordial or they are a subset of our speech that is taken from our musical repertoire. First we recognize that each note represents a frequency of sound. If they are primordial, we can conceive that the creator incorporated them in the creation at the beginning of time. String theorists believe that vibrational particles with mass, charge and other properties created the universe. So, for example, an electron is a manifestation of a vibrational state of strings. Some theologians believe that the first documented human knowledge was in vedas which state that the creator uttered aum resulting in the creation of the universe. This observation is compatible with string theory that has no reference to vedas. We can hypothesize that at time=0 a primordial universe was formed from aum which took the course of big bang and evolution of life. There is no need to believe that creation and evolution are mutually exclusive. In GA's we take alleles as pre-existing. GA's can't invent them all by themselves. Computer scientists use binary strings made of 0 and 1 because that is how computers represent all the data, including programs, in their memory which is orchestrated by transistors that switch either on (1) or off (0). Engineers might use real numbers to solve optimization problems which are also represented as 0 and 1 by the underlying computer architecture without GA explicitly knowing them.

Schema Theorem and Elitism

In GA world, a schema is a chromosome containing 3 alleles: 1, 0, *. For example we could have defined our scales as s-r-*-*-p-*-*-*, where we used * to indicate we don't care which note appears there. So s-r-s-r-p-d-n-t could have been a candidate. Any chromosome that is represented with these 3 alleles is called a schemata or schema. Now the schema theorem says:

M(H,t+1) = M(H,t) f(H,t)/f (1-p) + p [(1-losses) M(H,t)f(H,t)/f + gains]

Where M(H,t) = the number of string samplings on a hyper plane (current solution) at time t
f(H,t) = average evaluation of the sample of strings
f = cumulative of  f(H,t) over all hyper-planes
p=probability of cross-over

After some simplification we get

M(H,t+1)/M(H,t) = pf(H,t)/f + constant

So how is it going to be useful? For that we need to divide M(H,t+1) and M(H,t) with M which is the search space.

P(t+1) = P(t) * constant1 + constant2

where P is the probability of a generation. With some assumptions we can say that the probability of a generation is deterministically dependent on the previous generation. This is called Elitism in GA where a chromosome crosses over with all of its children and grand children and so on. Gengiz Khan whose genes can be traced to one in 200 people is probably considered as an Elitist.

Then we can ask if it is deterministic how can new species arrive in the nature? If dogs could evolve into seals by virtue of being able to swim, how can we explain it? Obviously GA is not going to help you with that. Reading Darwin's manuscripts might help.

Natural Selection

Darwin was not very clear on how nature selects a specie. This is not to say he didn't know. We know that species evolved over millions of years. That means, the natural selection is a continuous process. Some evolutionists believe species would suddenly appear where none could be found (say volcanic explosion). This is as good as saying an alien ship brought them.

In GA's, selection is, as already stated, by fitness. A chromosome selected for cross-over, by default, is fit. If a chromosome is totally dropped, it is because it is replaced by a child chromosome. This has a parallel in our world as generations of people bearing the same last name (patrilineal which Wikipedia defines as "the male line or agnatic kinship, a common kinship system in which an individual's family membership derives from and is traced through his or her father's lineage").

Perhaps nature is random in selection. GA's that use Roulette or Tournament selection follow this precept. In Roulette a wheel divided into equal sections is spun to select a chromosome for cross-over. In Tournament selection two chromosomes are randomly selected from a population and one of them is eliminated based on fitness.

Do GA's find Global Optimum?

To understand optimum solution, we need to understand the shape of solution space first. In engineering an optimal solution is the one that maximizes or minimizes a function. For example, in car manufacturing you want to maximize fuel consumption and minimize the wear and tear of tires. In economics you want to maximize yield of an instrument and minimize the load associated with it. So on so forth. Let us say we have the following equation:

a + 2b + 3c + 4d = 30

where a, b, c and d are variables

Suppose we want to find the values of these variables that satisfy the equation. One solution is a=10, b=2, c=4, d=1 just from inspecting it. But how do we know if there are other solutions and the one we found is the most optimal? For that we have to visualize a landscape with valleys and peaks. The optimal solution is found either in a valley (minimum) or a peak (maximum). For example if we set b=c=d=0 that gives us a single variable function. We can identify that a=30 is one feasible solution. Thus we found 2 solutions that are feasible and probably optimal but not globally optimal. We can posit a global optimum because of the feasibility of several solutions (a=b=d=0,c=10; a=c=d=0, b=15; etc.). Therefore a global optimum is one based on some criteria. We can say we want to find a solution where a,b,c,d can take any integer value not zero. One such solution: a = 7, b = 5, c = 3, d = 1 was found by a GA starting with random integer values such as these:

Chromosome[1] = [a;b;c;d] = [12;05;23;08]  
Chromosome[2] = [a;b;c;d] = [02;21;18;03]  
Chromosome[3] = [a;b;c;d] = [10;04;13;14]  
Chromosome[4] = [a;b;c;d] = [20;01;10;06]  
Chromosome[5] = [a;b;c;d] = [01;04;13;19]  
Chromosome[6] = [a;b;c;d] = [20;05;17;01]

Notice that we are not using bit strings. That the GA finds a solution at all is a mystery. Once it finds a solution whether it is globally optimum is yet another mystery. Some evolutionists suggest that one can use a gradient search to scale the peaks in the solution space. This is called Lamarckism (or Lamarckian inheritance) which from Wikipedia "is the idea that an organism can pass on characteristics that it has acquired during its lifetime to its offspring (also known as heritability of acquired characteristics or soft inheritance)". In other words, the GA can go back and forth between cross-over and Lamarckism in its quest to find the global optimum.

GA and Machine Learning

After this tour de force, can we say GA is a kind of machine learning? What is being learnt here is a process similar to generate and test or means-ends analysis or a combination of both. In generate and test, scientists consider multiple hypotheses at the same time and test one at a time. It is a slow process, but once completed is guaranteed to find a hypothesis that explains the data if the set of initial hypotheses contains the correct one—which is not always the case. In means-ends analysis, an objective function (ends) is tested for optimality based on a set of operators (means) by evolving the initial set of variables. In both these cases we can achieve convergence, a term to describe success in searching for an optimal solution, by carefully selecting the variables and manipulating them.

There is no mystery in GA's if the initial hypotheses (chromosomes) considered are potential solutions already. After all the selection, cross-over and mutation, the fitness is going to determine the final candidate solution. There are so many ways to define fitness. In biology fitness is often equated with the number of progeny. The more prolific an organism in nature, the more fit it is. By this measure, we can say if the E.Coli cellular division is uninterrupted with infinite food resource, then they fill up the whole earth and beyond. Luckily there is no infinite food source that the E.Coli can tap into. Nevertheless E.Coli is perhaps the most fittest organism nature ever created. We are taught early on by governments and scientists that population control is an imperative if every one needs to enjoy a higher standard of living. on why our ancestors chose to have multiple offspring was the prevalence of disease and famine. Not to mention malthusianism which is the idea that population growth is potentially exponential while the growth of the food supply is arithmetical at best. This completely contradicts with the way we have evolved over millennia. One reasWith the advances in medicine and agriculture, these obstacles are being overcome.

The prevailing trend in machine learning is, where we fail to solve complex problems, such as robots identifying the enemy and annihilating the enemy leaving good guys out, computers can be coached. It assumes that we need to be able to define the problem fairly accurately. In most cases, we can't do that. Robots have come to mean, for instance, the ultimate machine learning tools. Suppose, we want an office bot to bring us a stapler from a cabinet down the hall. The way to do it is manifold. You can create a rail line between the supply room and your cube and the robot is the train that goes to the cabinet, pushes a button—much like a vending machine—and brings the object you desire. The solution is feasible but not practical. We can't have rail lines crisscrossing offices. An acceptable solution is to create a robot fitted with a camera and move its limbs to accomplish the mission. But this solution is so much more complicated. Using the camera's visual inputs the robot may have to climb stairs, move its limbs carefully to not step over others' feet and last but not least grab the stapler ever so gently not to crush it (and making sure there are enough staples). Even after accomplishing this task, can we say the robot has learned anything? All we know is the robot is an accomplished stapler retriever. Can a robot understand what a staple is? So can we make robots that learn any retrieval task? For that scientists use the term “behavior.” If we can teach a set of behaviors to a robot, then we can expect the robot to function anywhere and do any task much like we do. An example of a behavior is a motor skill that involves extending arms, rotating wrists and grabbing an object with fingers. As you can see the more generally we can define our problem, the more compositional it becomes, and GA's cut through this complexity of compositions by representing each behavior as a chromosome.

What is the goal of Evolution?

If Darwin didn't publish his theories, humans would still be recanting the mythologies rich with interesting characters --that seem to emerge from nowhere—to explain our existence. However, the evolutionists that followed and including Darwin couldn't answer the question: what is the goal of evolution? If genes were independent of each other, and don't mutate, then we would all be sharing the same genes and potentially remain on earth for ever. Would that be a desirable outcome? If medicine could make us disease-free, does it mean we could become immortal? In reality genes are not independent of each other and they mutate and cannot in their natural course make us super beings. It is probably not far from truth to state that all humans are fallible because genes are not copied exactly from parent to offspring. In other words, nature makes mistakes as it marches forwards. We being subservient to nature share all things natural. We are expected, naturally, to steer evolution towards a purpose: a just society, a literate nation, planet free from hunger, etc. Those in the opposing camp, believe in intelligent design which has been stated as: certain features of the universe and of living things are best explained by an intelligent cause, not an undirected process such as natural selection. If you are an oncologist you would think cancer could not have manifested itself under intelligent design. So are auto-immune diseases where an organism's own cells suppress the immunity and make it susceptible to invasion by microbes. On the other hand, with machine learning, we are indeed adopting intelligent design, by making infallible robots. So evolution and intelligent design are not all that apart. Computer scientists unwittingly embrace intelligent design in their inventions and compete for the grand prize which could not be defined easily and elusive at best except by selection!

Can natural selection be democratic?

What if in a GA, a set of observers voted on the best outcome? Suppose a GA evolves 10 scales of music. Can it ascertain for itself which one out of the 10 is most melodious? After all, the computer on which the GA runs has no representation of melody which is intrinsically human. Leave aside GA, what about little children as they are growing up, can they identify melody? As children they are exposed to lullabies and animations, that make them musical experts in their own right. It seems the good taste for music is both natural and acquired. There is no indication that natural selection can be democratic. If it is so, who are the observers voting? Some are content to theorize the elite, even in a democracy, decide what survives. So if there is a vote, it has to be among the elites, but not the hoi polloi. For instance, British royals had exported the English language and their mannerisms to their colonies and shaped the evolution of the colonies. If you have to conduct a vote on the best computer scientist, obviously it can't be among the common people without the knowledge of computer science (otherwise Bill Gates will be chosen).

Can evolution be stalled?

Suppose UN declared universal holidays round the year with all play and no work except for people producing basic necessities of life like food, housing and clothing. The rest of us not producing basic necessities would be making better tools for them. This is what can happen with machine learning. As machines carry on our tasks we would be designing better algorithms and feeding them to sustain the machines. Some of us would be working on programs that generate programs. It is not all that far fetched given that John Koza, a prodigal student of GA founder John Holland, created a GA that evolves best computer programs by rearranging the subroutines. This goes with the belief that all the ingredients to achieve a perfect machine learner have already been made. For those of us aware of classical music, the many melodious permutations and combinations of notes have been invented by the likes of Mozart, Beethoven, et al. When the music evolved into rock and roll, blues, jazz, pop, etc. they are not all that apart from classics. They are merely a rearrangements of notes already developed by the classical musicians. If they are able to invent a new musical instrument, the notes to play are still the same, even though their arrangement varies from musician to musician. In a sense with the advancement of science and technology the evolution of natural processes has been out-paced by the evolution in the tools and artifacts. So in a relativistic sense natural evolution has been stalled. We might make a super-smart computer in our life-time, but nature cannot produce a bird that can type in a reasonable time. This leads to the topic of species extinction. Nature makes way for newly evolved species by making their parents extinct. Case in point is dinosaurs. But this happens over thousands of years. But these days we hear about more species extinct than evolved. For example, frog populations, considered as harbingers of earth's climate change, have been declining under the evolutionary pressure created by human demand for land by destroying their habitat. Some frogs have also been declining because of a fungus called chytrid. Whom do we blame for the decline of frog population? Some argue the species of frogs like Macaya Brest-Spot are not fit to survive in the new man-made environment by natural selection. Either they evolve or unfortunately, perish.

DNA Computers and GA

Many believe we have reached the limit in silicon based chips that power the computers. The silicon chip manufacturing industry for a long time followed Moore's law which refers to the observation of Intel co-founder Gordon Moore in 1965 that the number of transistors per square inch on integrated circuits would double every year. Fast forward to 2018 and we can note that the windows had slipped from one year to 18 months. It is conceivable that this follows diminishing margin of returns that can be represented as an S-curve. At the beginning the S curve has an asymptotic line when computer were made of vacuum tubes. After the invention of transistor the growth has been exponential as per Moore's law. Now we are at the peak of the S-curve where the line is flat. So where do we go next? Some scientists experimenting with DNA, the material our genes are made of, are of the belief that it is the fastest computer we can conceive of. This seems like great news for GA because the architecture on which GA's can run be made of bio chips or native, boosting their performance. There is a caveat though. Aeroplanes don't function like birds, even though they both fly. They are based on aerodynamics and fluid mechanics. One of the earliest AI languages LISP was programmed on LISP based silicon chips and it has largely been extinct. If the evolution of computers takes the course of bio chips, for Moore's law to be sustainable, we will know if GA's can survive based on what else but their fitness!

Genetic Engineering with GA

Genetic Engineering(GE) applies biotechnology to modify genes in vivo and create an organism with the altered genes. Ever since GE was developed in the 1970's, it has been applied from animals to plants. For exampe, vegetables with longer shelf-life are preferable to the retail sellers. Also animals with impaired insulin production to cause diabetes can be cured. And so on. The practical application of GA, besides engineering optimization, has been with electronic circuit design. Unlike GE, where it was feared one day scientists would create a monster no one can destroy, GA's success is modest and was feared can become a mad technician creating every conceivable electronic circuit and applying for patents. Both haven't materialized partly because of self-regulation by the scientific community.

Putting it all together

GA's are inspired by the biological aspects of nature and consider nature as primarily intelligent. They may mimick the nature but cannot help us unravel all the mysteries of nature. So GA is a means to create a machine learner that can launch us to inter-planetary travel one day and meet alien species that are evolving like us in the vast universe. If Darwin's theories are universal—there is no reason to believe otherwise—so are GA's. Evolutionary programming in general have made us appreciate natural selection and fitness. Not all people are judged this way and a just society goes out of its way to accommodate people with special needs. Each life is but one data point in the sprawling graph of nature. Its fitness is determined by some criteria like height, weight, number of offspring, etc. GA's can process millions of permutations and combinations, on appropriate hardware, in a shorter time span. The ability to understand and improve GA's lies in studying the search space or hyper-planes. Most GA's are non-deterministic and provide new hypotheses for scientists to test. As a tool for scientific research and engineering optimization GA's are pre-eminent. And understanding how they work will enable us to create faster and better machine learners that are going to make our lives richer.

References

http://www.boente.eti.br/fuzzy/ebook-fuzzy-mitchell.pdf is an e-book by Melanie Mitchell

https://arxiv.org/ftp/arxiv/papers/1308/1308.4675.pdf has a mathematical optimization problem solved with GA

http://www.cs.colostate.edu/~genitor/MiscPubs/tutorial.pdf is a tutorial on GA's

www.intelligentdesign.org

Tuesday, March 28, 2017

Extracting Excel Content as Text

The following code extracts text content in an xlsx file

package com.finra;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.PrintWriter;

import org.apache.poi.xssf.extractor.XSSFExcelExtractor;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

public class TestExcelExtractor {

	public static void main(String [] args) throws Exception {
		
        FileInputStream file = new FileInputStream(new File("c:/users/dgandikota/Test.xlsx"));

        
        //Create Workbook instance holding reference to .xlsx file
        XSSFWorkbook workbook = new XSSFWorkbook(file);
    
        XSSFExcelExtractor excelExtractor = new XSSFExcelExtractor(workbook);
        String allTxt = excelExtractor.getText();
        System.out.println(allTxt);
        allTxt = allTxt.replaceAll("\t", "|");
        allTxt = allTxt.replace("null", "");
        if (new File("c:/users/dgandikota/excelextract.txt").exists()) {
        	new File("c:/users/dgandikota/excelextract.txt").delete();
        }
        PrintWriter pw = new PrintWriter(new FileOutputStream(new File("c:/users/dgandikota/excelextract.txt")));
        pw.print(allTxt);
        pw.close();
//        System.out.println(allTxt);
        
	}
}

The sheet names are output by default

The end of sheet is found by the absence field separator (eg. line.indexOf("|") == -1)

REST Example with Headers using TomEE Plus

apache-tomee-plus-1.7.4 comes with out of the box REST

Tools to test REST services

Firefox-Poster Add-On

Chrome - postman?

Create the following class and deploy inside a war on TomEE-Plus

package com.apache.rest;

import javax.ws.rs.GET;
import javax.ws.rs.POST;
import javax.ws.rs.Path;
import javax.ws.rs.PathParam;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;

 @Path("/greeting")
 public class GreetingService {
     @GET
     @Produces( { MediaType.TEXT_XML })
      @Path("{id1}/{id2}")
     public String message(@PathParam("id1") String id1, @PathParam("id2") String id2,  @Context HttpHeaders httpHeaders) {
//called as http://localhost/greeting/{id1}/{id2}
//http://localhost//greeting/3/2
       /** how to get specific header info? **/
//        String cacheControl = httpHeaders.getRequestHeader("Cache-Control").get(0);
//         System.out.println("Cache-Control: "+cacheControl);
 /** get list of all header parameters from request **/
 Set headerKeys = httpHeaders.getRequestHeaders().keySet();
 for(String header:headerKeys){
       if (httpHeaders.getRequestHeader(header)!= null && httpHeaders.getRequestHeader(header).size() > 0 )
          System.out.println(header+":"+httpHeaders.getRequestHeader(header).get(0));
 }
         return "Hi REST!" + id1 + " " + id2;
     }

     @POST
//called as http://localhost//greeting?message=test
     public String lowerCase(final String message) {
         return "Hi REST!".toLowerCase();
     }
 }

More at http://www.vogella.com/tutorials/REST/article.html

Saturday, February 18, 2017

TomEE JMS

Why I wrote this up?

When I was trying to hook up JMS with TomEE-Plus, I looked for on-line resources. There are multiple tutorials on how to do this, but not all information could be found in one page. So this tutorial is the result of gleaning from various sources. Hope you find it useful. I am not big on analysis. So let's go straight into coding.

Tested this on TomEE-Plus-1.7.4; does not work with Tomee 7 or other servers

First Configure conf/tomee.xml as follows

<tomee>
  <!-- see http://tomee.apache.org/containers-and-resources.html -->

  <!-- activate next line to be able to deploy applications in apps -->

  <!-- <Deployments dir="apps" /> -->

<Resource id="MyJmsResourceAdapter" type="ActiveMQResourceAdapter">
        BrokerXmlConfig =  broker:(tcp://localhost:61616)
        ServerUrl       =  tcp://localhost:61616
    </Resource>

    <Resource id="MyJmsConnectionFactory" type="javax.jms.ConnectionFactory">
        ResourceAdapter = MyJmsResourceAdapter
    </Resource>

    <Container id="MyJmsMdbContainer" ctype="MESSAGE">
        ResourceAdapter = MyJmsResourceAdapter
    </Container>

    <Resource id="barQueue" type="javax.jms.Queue"/>
    <Resource id="fooTopic" type="javax.jms.Topic"/>
</tomee>

Create a servlet thus

package com.apache.jms;


import java.io.IOException;

import javax.annotation.Resource;
import javax.jms.Connection;
import javax.jms.ConnectionFactory;
import javax.jms.DeliveryMode;
import javax.jms.MessageProducer;
import javax.jms.Queue;
import javax.jms.Session;
import javax.jms.TextMessage;
import javax.jms.Topic;
import javax.naming.InitialContext;
import javax.servlet.ServletException;
import javax.servlet.annotation.WebServlet;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
@WebServlet(name="mytest", urlPatterns={"/myurl"}) 
 public class MyJmsServlet extends HttpServlet {
  
     //@Resource(name = "fooTopic")
     static private Topic fooTopic;
  
     //@Resource(name = "barQueue")
     static private Queue barQueue;
  
     //@Resource (name="MyJmsConnectionFactory" , type = ConnectionFactory.class)
     static private ConnectionFactory connectionFactory;
  
     @Override
     protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
         //...
  try {
    InitialContext ctx = new InitialContext();
    connectionFactory = (ConnectionFactory) ctx.lookup("java:openejb/Resource/MyJmsConnectionFactory");
    barQueue = (Queue) ctx.lookup("java:openejb/Resource/barQueue");
    fooTopic = (Topic) ctx.lookup("java:openejb/Resource/fooTopic");
         Connection connection = connectionFactory.createConnection();
         connection.start();
  
         // Create a Session
         Session session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
  
         // Create a MessageProducer from the Session to the Topic or Queue
         MessageProducer producer = session.createProducer(barQueue);
         producer.setDeliveryMode(DeliveryMode.PERSISTENT);
  
         // Create a message
         TextMessage message = session.createTextMessage("Hello World!");
  
         // Tell the producer to send the message
         producer.send(message);
  
         //...
     } catch (Exception e) {
      e.printStackTrace();
     }
     }
     
     public static void main(String [] args) {
       try {
           Connection connection = connectionFactory.createConnection();
           connection.start();
    
           // Create a Session
           Session session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
    
           // Create a MessageProducer from the Session to the Topic or Queue
           MessageProducer producer = session.createProducer(fooTopic);
           producer.setDeliveryMode(DeliveryMode.NON_PERSISTENT);
    
           // Create a message
           TextMessage message = session.createTextMessage("Hello World!");
    
           // Tell the producer to send the message
           producer.send(message);
    
           //...
       } catch (Exception e) {
        e.printStackTrace();
       }
     }
}

Create a client thus

package com.apache.jms;

import java.util.Properties;

import javax.jms.Connection;
import javax.jms.ConnectionFactory;
import javax.jms.Message;
import javax.jms.MessageConsumer;
import javax.jms.Queue;
import javax.jms.Session;
import javax.jms.TextMessage;
import javax.jms.Topic;
import javax.naming.InitialContext;

public class MyJmsClient {
   //@Resource(name = "fooTopic")
    static private Topic fooTopic;
 
    //@Resource(name = "barQueue")
    static private Queue barQueue;
 
    //@Resource (name="MyJmsConnectionFactory" , type = ConnectionFactory.class)
    static private ConnectionFactory connectionFactory;
    private static MessageConsumer consumer=null;
 public static void main(String [] args) {
     try {
      Properties props = new Properties();
//      props.put("java.naming.factory.initial", "org.apache.openejb.client.LocalInitialContextFactory");
      props.put("java.naming.factory.initial", "org.apache.openejb.client.RemoteInitialContextFactory");
       
           props.put("java.naming.provider.url", "http://127.0.0.1:8080/tomee/ejb");
        props.put("java.naming.security.principal", "tomee");
        props.put("java.naming.security.credentials","tomee");
      InitialContext ctx = new InitialContext(props);
    connectionFactory = (ConnectionFactory) ctx.lookup("MyJmsConnectionFactory");
    barQueue = (Queue) ctx.lookup("barQueue");
    fooTopic = (Topic) ctx.lookup("fooTopic");
         Connection connection = connectionFactory.createConnection();
  
        Session session = connection.createSession(
              false, 
              Session.AUTO_ACKNOWLEDGE);
   consumer = session.createConsumer(barQueue);
   connection.start();
   while (true) {
       Message m = consumer.receive(1); 
       if (m != null) { 
           if (m instanceof TextMessage) { 
               TextMessage message = (TextMessage) m; 
               System.out.println("Reading message: " + message.getText()); 
           } else { 
               break; 
           } 
       }
   }
     } catch (Exception e) {
      e.printStackTrace();
     }
   }
}

Run the main method of the client; make sure it is listening for messages; and pass this to the JVM -DResource/javax.jms.ConnectionFactory=connectionfactory:org.apache.activemq.ActiveMQConnectionFactory:tcp://localhost:61616
For example:
java -cp ./activemq-all-5.14.3.jar:openejb-client-7.0.2.jar:concurrent-1.3.3.jar:javax.ejb.jar:. -DResource/javax.jms.ConnectionFactory=connectionfactory:org.apache.activemq.ActiveMQConnectionFactory:tcp://172.20.0.218:61616 MyJmsClient topic

Deploy the servlet on TomEE and invoke it as http://localhost:8080/myurl

You should see the following message

Reading message: Hello World!

Required Jars:

activemq-all-5.14.3.jar

openejb-client-7.0.2.jar

concurrent-1.3.3.jar

Wednesday, February 15, 2017

Active MQ Stand Alone

Download apache-activemq-5.14.3.zip

Unzip it

set JAVA_HOME and go to the bin directory and type "activemq start"

Create a Sender Thus

import javax.jms.Connection;
import javax.jms.ConnectionFactory;
import javax.jms.Destination;
import javax.jms.JMSException;
import javax.jms.MessageProducer;
import javax.jms.Session;
import javax.jms.TextMessage;
import org.apache.activemq.ActiveMQConnection;
import org.apache.activemq.ActiveMQConnectionFactory;

public class Sender {

private ConnectionFactory factory = null;
private Connection connection = null;
private Session session = null;
private Destination destination = null;
private MessageProducer producer = null;

public Sender() {

}

public void sendMessage() {

try {
factory = new ActiveMQConnectionFactory(
ActiveMQConnection.DEFAULT_BROKER_URL);
connection = factory.createConnection();
connection.start();
session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
destination = session.createQueue("SAMPLEQUEUE");
producer = session.createProducer(destination);
TextMessage message = session.createTextMessage();
message.setText("Hello ...This is a sample message..sending from FirstClient");
producer.send(message);
System.out.println("Sent: " + message.getText());

} catch (JMSException e) {
e.printStackTrace();
}
}

public static void main(String[] args) {
Sender sender = new Sender();
sender.sendMessage();
}

}

Create a receiver thus

import javax.jms.Connection;
import javax.jms.ConnectionFactory;
import javax.jms.Destination;
import javax.jms.JMSException;
import javax.jms.Message;
import javax.jms.MessageConsumer;
import javax.jms.Session;
import javax.jms.TextMessage;
import org.apache.activemq.ActiveMQConnection;
import org.apache.activemq.ActiveMQConnectionFactory;

public class Receiver {
private ConnectionFactory factory = null;
private Connection connection = null;
private Session session = null;
private Destination destination = null;
private MessageConsumer consumer = null;

public Receiver() {

}

public void receiveMessage() {
try {
factory = new ActiveMQConnectionFactory(
ActiveMQConnection.DEFAULT_BROKER_URL);
connection = factory.createConnection();
connection = factory.createConnection();
connection.start();
session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
destination = session.createQueue("SAMPLEQUEUE");
consumer = session.createConsumer(destination);
Message message = consumer.receive();

if (message instanceof TextMessage) {
TextMessage text = (TextMessage) message;
System.out.println("Message is : " + text.getText());
}
} catch (JMSException e) {
e.printStackTrace();
}
}

public static void main(String[] args) {
Receiver receiver = new Receiver();
receiver.receiveMessage();
}
}

when compiling the files use the following class path

C:\jms>"c:\Program Files\java\jdk1.8.0_101\bin\javac.exe" -cp apache-activemq-5.1
4.3\activemq-all-5.14.3.jar;. Sender.java

When running use the same class path

Windows invokation of Client

C:\jms>"c:\Program Files\Java\jdk1.8.0_101\bin\java.exe" 
-cp apache-activemq-5.14.3\activemq-all-5.14.3.jar;.;bak\openejb-client.jar;bak\javax.ejb.jar -DResource/javax.jms.ConnectionFactory=connectionfactory:org.apache.activemq.ActiveMQConnectionFactory:tcp://localhost:61616 MyJmsClient