Dakshina Murthy Gandikota's Blog: Neural Nets

The common household thermostat functions by sampling the room temperature and comparing it with the set temperature. The result is called an error. For example, if the set temperature is 70F and the sampled room temperature is 68F, then the error is 2F. A corrective action is invoked which in this case is heating the house until the set temperature and sampled temperature are close enough. In control theory this is called a feedback controller. The variable fed back is the error. If it is positive then the house is heated else the house is cooled. If there is a feedback controller, then it must imply a feedforward controller. Unlike feedback controller that adjusts the output to minimize the error, the feedforward controller has no error but has a model to estimate the current output and simulate the model to arrive at the desired output. Thus a feedforward controller is model-based where a simulation is often required. Suppose we have a simple requirement that the house has to be cooled after 9am if it sunny; otherwise it has to be heated. How to determine sunny is the challenge for the model-based controller. Scale it up with a big office space where we may have the following rules: if it is sunday, and sunny outside, then tun-off air-conditioning; if it is a working day and sunny outside then cool the office to 70F;if it is snowing outside, heat the office to 70F, and so on. Let us suppose we have successfully modeled and the office air-conditioning is humming as desired. We forgot about public holidays, not to mention rainy days, while modeling the controller. So we dutifully add more rules to it for handling the exceptions. If the EPA (Environmental Protection Agency) that has temperature guidelines for commercial buildings, changes the desired temperature from 70F to 72F in winter months, we are faced with the challenge of rewriting all of our rules.

As you can see feedforward controllers are so much more complicated. Actually they aren't for a neural net which is a biologically inspired controller that has the ability to learn and has no complicated model for us to maintain. At its simplest definition a neural net samples the input, propagates the input sampled to the nearest neuron which computers the error by matching it with the desired output and takes a corrective action. This is similar to the way neurons operate.

A neuron has an axon that passes the signal in the form of voltage to the nearest neuron (it could be farther too making the axon long) and a dendrite to receive the signal. Scientists call them synapses. The cell body of the neuron has ions that are polarized by the voltage of the input signal and carry out chemical reactions to generate the output voltage. Since multiple axons can terminate at a dendrite—kind of a junction in the roadways—each voltage is weighted and if it exceeds a threshold, then its voltage is passed on to the cell. Suppose v=voltage, w=weight, then the input voltage for 3 neurons can be given as v1w1 + v2w2+v3w3 > threshold then the neuron is fired. Otherwise nothing happens. That is not the end of story. The weights can be positive (excitory) or negative (inhibitory). It has been hypothesized that the brain is principally composed of about 10 billion such neurons, each connected to about 10,000 other neurons. So this is really huge!

If a neuron fires repeatedly then the neighbor can ignore the signal. Take for instance, you are facing a cheetah all by yourself and a fight or flight decision has to be made. The neurons will be firing resulting in the decision that you pretend to be dead. It is the most pragmatic decision because a cheetah has claws to injure you and it can out-run you. So neither fight nor flight is the best approach. The neurons firing in the fight mode and the ones firing in the flight mode could be sending the most strong signals, but the brain region processing the signals, say prefrontal cortex, will decide what course of action to take. It need not be a life threatening situation. It can happen if you have to decide in a sultry house whether to turn on air-conditioning but a decision was taken to open the windows instead not only because it is energy saving but also for the reasons of health (fresh air).

A bit of history

Way back in 1943 neurophysiologist Warren McCullough and mathematician Walter Pitts modeled a simple neural network using electrical circuits. The neural nets described by them had thresholds and weights, but there were no hidden layers. Also there was no training mechanism. What they showed was that the neural net could master any function that a digital computer could. As early as in the 50's IBM researchers tried to simulate a neural net under the supervision of Nathaniel Rochester rather unsuccessfully. That didn't stop others from trying. With the advent of phone communication, the early telephone workers were faced with the daunting task of reducing or eliminating the echoes. So echo cancellation software based on neural network (MADALINE) was developed by Bernard Windrow and Macian Hoff of Stanford University in 1959 that seems to be the first practical application. In the 60's and 70's research teams like Widrow & Hoff (1962), Kohonen & Anderson (1972) tried different approaches like weighted connections and digital circuits. There are a number of publications like Yann Le Cun, Leon Bottou, Genevieva Orr and Klaus-Rober Muller from AT&T Labs (1998) on topics ranging from how to initialize the weights to alternatives to sigmoid functions to model neuron activation. Their main theme seems to be faster and low foot memory footprint applications which is not of much concern these days. It is to be noted that USA was not the only country developing neural nets. In 2010 Ganesan et al. applied neural nets for diagnosing cancer using demographic data of India. Their goal was to "check for lung cancer disease" -- primarily among tobacco smokers -- "quickly and painlessly and thus detecting the disease at an early stage". More recently neural nets are being developed for monitoring credit card fraud, determining credit-worthiness of loan applicants, "chemical nose" that combines spectrometry with neural nets and so on. It may be noted the credit-worthiness used to be determined by human adjudicators before AI came along and largely replaced the bias of humans with more objective data. However, neural nets don't explain how they decide. They merely give a faster response based on their training set. Where humans have bias, AI systems such as neural nets have inductive bias (more later).

The learning net

The difference between a feedforward controller and a neural net is in the ability of the latter to learn. This is particularly exciting because then we can train the net to learn any complex activity. Neural nets that propagate the error backwards –called backprop-- from the output layer to the middle layer, then to the input layer are the winners of the learning contest. So obviously scientists tried without layers and have failed in their attempts. So a canonical neural net can be defined as one with at least one neuron in the input layer connected to another neuron in the middle layer culminating in the neuron in the output layer. It took scientists led by David Rumelhart several decades since the conception of a Perceptron to come up with the canonical model. A Perceptron merely carried out boolean operation like AND, OR and NOT. The result of AND is 1 if all of its inputs are 1 each. An example is pass a resolution if all of the votes agree to it. The result of OR is 1 if at least one of the inputs is 1. An example: reject a resolution if there is one veto. The result of NOT is to flip the input from 1 to 0 and vice-versa. But Perceptron could not handle XOR which says if 2 inputs are not the same then set the output to 1. Let us say your marketing team came up with a remarkable observation that old men and younger women prefer tea. You can then look up XOR and pass it the age of men and women as inputs with the restriction that younger men and older women don't prefer tea. What about middle-aged men and middle-aged men? So we run into semantic issues and the Perceptron breaks down with all the confusion. Researchers like Marvin Minsky were the first to report that Perceptrons are computationally weak. So something needed to be done and the messenger was David Rumelhart with the unveiling of backprop. As many things go Rumelhart was the last one to stumble into this. Paul Werbos of Harvard had proposed a similar algorithm in his PhD thesis in 1974. Even before Arthur Bryson and Yu-Chi Ho, a couple of control theorists, had the same idea in 1969.

We had come to expect that neural nets can learn. What are they learning? In simple terms they are learning the weights of connections between the neurons. If the network is initialized with random weights and the errors are back propagated resulting in the adjustments of weights, then it is learning the weights. For instance we have two output neurons: drink water or drink orange juice. And the input neurons are athlete and diabetic. We want the diabetic to drink water and we don't care what the athlete drinks. So the connection between diabetic, via the middle layer, could be hard-wired to drink water neuron and we hope the connection between athlete and the two outputs can similarly be hard-wired. But the problem is athlete is recommended both water and juice which is confusing (assuming juice can't be diluted with water). There is a need to learn more about the athlete. While we adjust the weights to neurons connecting athlete to the output, we can break the stronger reinforcement between diabetic and one of the outputs. Therefore, what the net finally learns is not similar to: if you are a diabetic drink water and avoid juice with complete certainty; if you are an athlete drink water and juice. The former stronger association conflicts with the latter's weak association between inputs and outputs. This is to show that neural nets are not amenable to traditional logic. What they learn is based on the inputs and outputs you present to them. If you gave more examples of diabetics than athletes, including diabetics who are athletes, then the net could be biased towards diabetics. There is a fine balance to make. Suppose we trained the net to handle all the cases and now we want it to handle a case when an athlete is pre-diabetic, the net will be stumped because we haven't taken into account the pre-diabetics whose mapping to water or juice mirrors that of an athlete who is not a diabetic. But how the net to know who is a pre-diabetic? Even expert physicians are not sure unless they measure blood glucose levels. This is the reason scientists split the training set into two unequal sets one for training and the other for testing. Any unexpected inputs require augmentation of training sets and relearning.

Then neural nets need to not over-fit. For example the net trained with blood glucose and heart rate, instead of diabetics and athletes (assuming athletes have lower heart rates), could fall into the trap that a lower blood glucose and a lower heart rate always means an athlete. It is possible for a diabetic to have both the conditions. Nor the net's learning could be extrapolated to diabetics who are training to be athletes. A more egregious over-fit is when we try to mix a training set for dogs with that of their owners thinking a dog and its owner share many things when they obviously are not biologically identical.

Hidden Layers and Traps

We talked about a middle layer which sometimes is called a hidden layer because it is neither an input nor an output. There are two aspects to it: there could be multiple hidden layers and multiple neurons within each of them. How do we know how many of each should we start with? Researchers aren't sure what exactly the hidden layers do except they somehow handle the non-linearities that exist in the training set. A linear function mathematically can be represented as a straight line. In the net mentioned above the relationship between a diabetic and water is a linear one. A non-linear function can be a parabola or an exponential. This is what we encounter when the boundary between positive and negative examples is not a simple straight line.

Here is the classic nature versus nurture situation. All of us are born with limited neurons which cannot reproduce. That is the constraint created by the nature. With constant conditioning of the brain, or nurture, we can make the most out of them even if we don't employ 100% of the neurons that can never be replaced. Studies done on Alzheimer's patients show that it is the battle between nature and nurture with the plaques in the brains at issue. The plaques come about naturally. It is hoped that with reinforcement learning the natural hurdles can be overcome by constantly nurturing the Alzheimer's patients by their care givers.

How deep is Deep Learning?

In the 90's a cascade-correlation (CC) algorithm was developed to determine the hidden layer topography. Imagine you have an emergency situation to put off some fires in a building. You are the fire chief and you command a battalion. One strategy is to send all of the battalion into the building to put off the fire. But it has certain risk that could be life threatening and you lose the whole battalion in the worst case. So you send one firefighter at a time after carefully evaluating the outcome. CC is the battalion chief. CC starts training a net without any hidden layers first. If the learning is incomplete, as determined by the errors, it adds a hidden neuron. If the learning is complete, then we are done. Otherwise CC adds another neuron. And the process repeats until the desired accuracy is achieved. This is exactly what we needed to handle non-linearities in our training set. But it is the slowest way to learn.

With deep learning it is envisaged that multiple hidden layers can be added to the net. This is the next step in the evolution of CC. While CC determines the optimal number of hidden nodes, deep learning takes it to the next level by adding optimal number of hidden layers. So why did it take so long to figure this out? One reason is the availability of cheap memory and processing power. Another is what is known as an autoencoder which basically works like a hidden layer that learns incrementally. By stacking autoencoders one can create a deep learning net. For example to recognize the proverbial elephant one autoencoder can learn to recognize trunk, another to recognize legs, etc. with the output emerging from the composition of all the partial results. With cloud computing being the rage, one can create servers on demand called elastic computing in the cloud without ever touching them. Elastic computing has been defined by Microsoft Azure as:

Elastic computing is the ability to quickly expand or decrease computer processing, memory, and storage resources to meet changing demands without worrying about capacity planning and engineering for peak usage. Typically controlled by system monitoring tools, elastic computing matches the amount of resources allocated to the amount of resources actually needed without disrupting operations.

Autoencoders are the elastic computers for the backprop. We get the flexibility to model our net while keeping an eye on accuracy. If adding an autoencoder resulted in over-fit, then undoing it is a simple matter of removing it from the configuration. What about the weights? The researchers are not clear, unlike CC, if the weights have to be reset to random values and the learning repeated in which case the learning is impeded. If the trade-off is acceptable it is a small price to pay for the final result of a network with optimal weights that won't over-fit.

Further more autoencoders' learning is unsupervised. To understand this imagine a 100x100 pixel image that needs to be learnt. You can set each pixel value as input as well as output because an autoencoder is not expected to classify the image as a tree or an animal. What is expected of an autoencoder is the compression of the image with sufficiently small number of hidden nodes. Suppose you would like to verify if an image is a tree, then we just plugin the pixels to the inputs of an autoencoder and if the outputs match the inputs, then it is indeed a tree.

One issue with the deep neural nets is that the signal gets attenuated as it moves from layer to layer. A workaround has been to skip some layers. For instance, if the first layer identifies the data input is a diabetic, then it can directly go to the output node drink water rather than juice. This is akin to hard-wiring or injecting rules into the network. Also some researcher used specialized hardware called Graphics Processing Units (GPU) that were originally designed for video games to simulate hundreds of layers. Such brute-force methods would work for large corporations like Microsoft, Twitter and Google that can afford to throw expensive hardware at the net or even parallel processing. For hoi polloi the best hope is to invent better algorithms.

Can we predict stock market?

Using recurrent networks we can, in theory, predict which way the stock market moves based on the previous values. It is like extrapolation in the numerical analysis. Recurrent networks are defined by Wikipedia as:

A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed graph along a sequence. This allows it to exhibit dynamic temporal behavior for a time sequence. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.

So why aren't we all getting rich by predicting the stock market accurately? Here comes the credit assignment issue. Can we say which layer or node is responsible for correctly mapping an input? In most symbolic computation such as rules for medical diagnosis we can say a diagnosis follows the symptoms and lab tests. The credit is assigned fairly accurately to a rule that could be like: if the fasting blood glucose level is high, then the patient is diabetic . We can't say the same about neural nets. A net is a holistic representation of the solution space that cannot be dissected. This is unlike the way brain functions. Neurologists have shown that if a region of brain has been impaired because of say an accident, other regions take over its functions. It is also shown that the neurons in the speech region in the case of speech impaired people, the sign language learning could take place in the same region. In both the cases neurologists have shown that brain works much like the world-wide-web. If part of the web has been cut off because of power outage, the rest of the web is still functional. This is not the case with neural nets and they continue to be the black boxes holding the secrets of learning.

Gradient Descent: is it up or down?

It can be hypothesized that learning the weights follows a path of least resistance. Imagine you are blind-folded and need to get to a higher ground. You feel the ground and ascertain if it is sloped up or down and make the move. This is called gradient descent by the connectionists (the cognitive scientists who apply neural nets to their research). Why can't we just try different weights for each connection and test the error? Imagine a network with the topography 3 x 4 x 1. That is 3 inputs, 4 hidden and 1 output. There are a total of 3 x 4 + 4=16 weights to be learnt. If we restrict the precision of each weight to 3 decimal places, then we need to generate and test 1000**16 values. That is 1 followed by 19 zeroes! Even for the fastest computers, this could take an eon. So until we can make faster hardware, we stick to gradient descent to find local optima as the global optima could be an over-fit. To understand why global optimum could be an over-fit, consider a net for recognizing hand written digits. We train it so that it is 100% accurate in recognizing X's handwriting, But it will fail to recognize the same digits in Y's handwriting. It turns out the reason is the net has found a global optimum which won't apply to Y. If all we want from the net is to recognize X's handwriting then it is fine. However, if we would like our net to be used in a bank ATM and recognize the amounts on checks, it is better to find a local optimum.

Earlier it was implied that the sum of weighted voltage has to exceed a threshold for a neuron to fire. This is called activation. In backprop a common activation function is a sigmoid or an S-curve. Early on neurologists found out that repeated firing of the adjacent neurons has no particular impact . This is to say the neuron temporarily ignores the inputs. Another time a neuron ignores its inputs is when the voltage starts building up. Putting these two together scientists have come up with sigmoid or S-curve that is a smoothened step function whose shape won't change at low and high values—i.e. remains parallel to the x-axis. The reason biological neurons ignore a low voltage could be that it is just noise. What about the higher values? It is probably a feature of evolution when the voltage beyond a certain threshold cannot be arbitrarily increased keeping in consideration the sensitivity of the organism to fluctuations in the underlying chemical processes.

How does Backprop adjust the weights in the network?

We will simulate a backprop with 3 nodes arranged as follows

Input Node (I) ->Hidden node(J)->Output Node(K)

Let us say the

input is $x_{in}$ and the desired output is y

$w_{ij}$ is the weight of connection between nodes I and J

$w_{jk}$ is the weight of connection between nodes J and K

The activation function for each node can be given as $$ \sigma = {1 \over (1+e^{-\alpha}) }$$ whose derivative is -$\sigma ^2$

Then we can write

$y_j = x_{in}*w_{ij}*\sigma _j$

$Error = 0.5(y – y_j)^2$

To adjust a weight, we take the derivative of Error with respect to weight

$\partial (Error)\over\partial w$ = $\partial (0.5(y-y_j)^2)\over\partial w$ = $(y-y_j){\partial (y-y_j)\over\partial w}$ = $(y-y_j)[{\partial (y)\over\partial w} – {\partial (y_j)\over\partial w}]$

Since y does not have a direct relationship with $w_{ij}$, $\partial(y)\over\partial w$=0

${\partial(Error)\over\partial w} = -(y-y_j){\partial(y_j)\over\partial w}$

Let us substitute $y_j=x_{in}*w_{ij}*\sigma _j$ to get

$\partial(Error)\over\partial w$ = -$(y-y_j)\partial{(x_{in}*w_{ij}*\sigma _j)}\over\partial w$ $= -(y-y_j)*x_{in}*[w_{ij} * {\partial (\sigma _j)\over \partial w} + \sigma _j * {\partial (w_{ij}) \over \partial w}]$

$\partial(w_{ij})\over\partial w$ =1 and ${\partial(\sigma _j)\over\partial w} = -{\sigma _j}^2$

Upon substitution we get

$\partial(Error)\over \partial w$ = -$(y-y_j)*x_{in}*\sigma _j(1-\sigma _j)$

Now we want to minimize the error. At the minimal point of Error versus weight we assume the tangent is parallel to weight axis.

Therefore $(y-y_j)*x_{in}*\sigma _j(1-\sigma _j) = 0$

A general form of backpropagation algorithm for adjusting weights can be given as follows:

Feed the training set forward through the net and calculate for each node inputs (s) and outputs (z after applying activation function f)

calculate the change in weight as the weighted sum of products of error with actual activation output

We have assumed that the error function is quadratic, i.e. (y-yj)**2 because we want backprop to do gradient descent to adjust the weights. Other methods exist such as Newton method and Conjugate gradient. It is very likely that backprop will adjust the weight based on a local optimum. If we don't want to over-fit the net, then it should be fine. M. Forouzanfar, H. R. Dajani, V. Z. Groza, M. Bolic & S.Rajan compared the estimation errors due to ten different training algorithms belonging to three classes: steepest descent (with variable learning rate, with variable learning rate and momentum, resilient backpropagation), quasi-Newton (Broyden-Fletcher-Goldfarb-Shanno, one step secant, Levenberg-Marquardt) and conjugate gradient (FletcherReeves update, Polak-Ribiére update, Powell-Beale restart, scaled conjugate gradient). Their objective was to estimate blood pressure (BP) from the oscillometric measurements. It turns out all the off-the-shelf BP cuffs use oscillometry and a neural net was tasked to improve the errors. As to the winner, they say it was RBP. Go figure!

Some argue that backprop is computationally expensive and slow to converge to the global optimum. Obviously computing the errors for each node and adjusting the weights could take a very long time depending on the size of the training set. They prefer to use GA to specify the new weights with each iteration. If it is possibe to hierarchically specify parents and children in the net and use the parents to cross-over, i.e. swap a neuron's weights with the cross-over results of its parents. This could be done across layers as well. Also mutation is possible by adjusting the weight by a small percentage, changing the sign on the weight (positive for reinforcement or negative for inhibition), etc. The goal in all of these is to find an optimum solution that will meet the accuracy. A neural net that is supposed to diagnose cancer may do well to detect cancer with some accuracy rather than rule out cancer where there is one. Physicians call them false positives and false negatives. If the backprop says the patient has cancer when there is no cancer, then it false positive. Or when an anti-virus program says a program is a virus, when it isn't, that is a false-positive. A false negative is when the patient has really cancer after determining with test results.

Brute force or taming the complexity?

Neural nets are shown to be the most preferred algorithms for image and voice recognition. Why so we may wonder? The simple answer is that they are non-predictive. YouTube videos have a sound track. You can train a neural net to recognize certain words and label the video as U (universal) or R (restricted). Spoken words come from a dictionary of finite size (we hope the user is not using made up words). There is nothing to predict from the inputs. Similarly you can train a net to classify email as span or no-spam. Also, neural nets do well with noise. A word spoken over a noisy phone line can be captured by a net, if the training set has taken noise into consideration (this rules out the aliens trying to communicate with us from outer space).

With all these wonderful features, aren't we using brute force? The answer depends on the complexity of learning. If memory is an issue, then neural nets make the best use of it. If lazy loading is required, where the net starts learning, as the word is being spoken, then faster processing will be required. At the moment both memory and processors are cheaply available.

The leads to the conundrum why can't the net find for itself the best topography: number of hidden layers and number of nodes per layer. In every AI solution there is always what we can call an inductive bias. With GA's it is the alleles. In symbolic learning it is the existence of logic. In Baye's it is the eponymous rule. We don't seem to have a good handle on the inductive bias. When you can't beat them, join them seems to be the mantra. In a sense, that is inevitable because inductive bias in humans is inherited (genetic) and acquired (nurtured). We heard the sayings like: birds of same feather flock.

Yet another aspect of neural nets is called logistic regression which in short means mapping the real world variables to inputs. Going back to our diabetic-athlete net, we said we could have used blood glucose levels and heart rate as inputs. But these are continuous variables. A blood glucose level of 100 mg/dL could be considered high somewhat. Whereas a reading of 200 mg/dL is definitely on the higher side (ignoring post-prandial and fasting requirements). Add to the mix the measurement units where the glucose level can be expressed as ounces per gallon. How can we feed these numbers to the net? The answer seems to be again in the form of sigmoid. The significance of sigmoid or s-curve is that it will provide a range of real numbers between 0 and 1 unlike its step brother which turns on or off. S-curves are used in many ways outside the realm of neural nets. For example, the expected income of a person over life time, the performance of an athlete with each passing year, etc. can all be expressed as S-curves.

Can AI put an end to Guest Workers?

Every year US invites nearly 100,000 guest workers both from academic and commercial sectors to assist the American corporations with computer applications. Some of their tasks include writing code and testing software. Such tasks are considered as routine by the AI community who thinks they can be automated. Just as horse-carriages gave way to taxi-cabs, the guest workers one day would trickle down to the most competitive ones with AI smarts. This is not a forecast for the near future when we will need as many good programmers as we can garner to implement the manifold AI algorithms ranging from GA's to Deep learning nets. It is when the American workers are freed from the routine coding and testing tasks, they can wear their thinking caps and do what they do best: create new algorithms for old problems.

References

http://www.cs.cornell.edu/boom/2004sp/ProjectArch/AppofNeuralNetworkCrystallography/NeuralNetworkCascadeCorrelation.htm

http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/

http://briandolhansky.com/blog/2013/9/27/artificial-neural-networks-backpropagation-part-4 has a detailed example of how weights and errors are calculated in a backprop

https://www.neuraldesigner.com/blog/5_algorithms_to_train_a_neural_network discusses the effects of different search methods such as Newton, Quasi-Newton, Conjugate Gradient, etc.

https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/Biology/index.html for a history of the neural networks research

https://www.wired.com/2016/01/microsoft-neural-net-shows-deep-learning-can-get-way-deeper/ describes deep neural nets at Microsoft

Dakshina Murthy Gandikota's Blog

Wednesday, April 11, 2018

Neural Nets