Dakshina Murthy Gandikota's Blog: May 2018

Friday, May 25, 2018

Block Chain

Have you ever wondered how to send a payment by email? Did you ever question the commission charged by banks and other financial agencies for transfer of money? Do you want to vote electronically from your computer? If the answer to all these questions is yes, then you want to know about a block chain. A block chain is a network of computers –aka nodes-- each of which maintains records of transactions—called blocks—to enable transfer of digital currency and assets over the internet. Before we dive into the subject, let us first review the current banking or financial industry where moving money is routinely carried out. The financial institutions use a ledger to keep track of transactions and current balance of accounts. Let us say, Rob with an initial balance of $100 sends $50 to Pete with an initial balance of $200. The ledger after the transaction is processed will be updated to show Rob's current balance $50 and Pete's current balance $250. However, if Rob with an initial balance of $100 decides to send Pete $100 and Sally $50. This should be denied by processing the transactions serially, i.e. one at a time. This is called double pay.

The problem with financial institutions

By keeping the ledgers in a central server, the financial institutions are vulnerable to attacks by hackers. If the ledger is corrupted, there is no easy way to fix it. Also financial institutions charge commission on transactions to compensate for their enormous storage and processing needs. On a small transaction like $100, a 2% commission might look normal. But as such transactions happen over millions of times, one can imagine the extent of the cost incurred by the initiators of the transactions. Some estimate that to be about $6 billion a year. Besides, the transactions won't happen instantaneously and typically take several hours to process. Suppose one sends dollars from USA to India by a bank transfer. The bank in USA posts the transaction on the gateway in USA. The US gateway contacts the gateway in India which in turn contacts the bank in India. Then a currency conversion is done from dollars to rupees. If there is a way to pay the end user, the transaction will go through. Given the time zones and delays in the transmission, the simple money transfer takes several hours.

Block chain efficiency

A block chain transfers the money or any electronic asset almost in real-time. Also the commission or fee charged by a block chain is much smaller than that of financial institutions. One can add a fee to the transaction in a block chain to incentivize the miners which will be discussed later but that is only optional. Another important aspect of block chain is the ledger is distributed. Every node in the block chain has its own copy of the ledger. This does away with the centralized ledger and monopoly of the financial institutions. The advantage of distributed ledger is that even if a node or a group of nodes is made off-line, say by hacking, the rest of the network has the complete ledger. It is next to impossible for all the nodes in a block chain to go off-line or without the ledger. Thus a block chain ensures that the list of transactions or the ledger is available for all to see and update. The key is the ledger can only be updated but never truncated or edited to modify an older transaction.

distributed ledger.png

Different types of computing

Centralized: where a single server acts as the point of contact for information storage and retrieval. Examples are facebook, google, amazon, etc. While this is easier to implement, the problem is it won't scale. For example, the number of transactions a server can handle is limited and the availability of storage too is limited.

Distributed: A centralized system can be distributed. Any of the tech companies mentioned before have geographically spread out their centralized systems in a distributed manner. So the server accessed from India is different from the server accessed in USA. As long as the data on the two are kept in sync in real-time, there is consistency in the services.

Decentralized: the centralized and distributed models are pervious to denial of service attacks. Whereas a decentralized system, like bitcoin's block chain, is spread over hundreds and thousands of computers that are virtually impossible to hack. In a decentralized system every node acts as a peer which means every node has the same data as its peer node. So when a node is hacked, the data is not lost and can be retrieved from another healthy node.

Cryptocurrency

Block chain is based on the notion of cryptocurrency which is nothing but a made up currency just as dollars, rupees and pounds. Let's call it e-money. The participants in a block chain has initially an electronic wallet with certain e-money. They can purchase the e-money with physical currency for instance. As they transact with each other the balance in the wallet will be recalculated. The bitcoin block chain, that went online in 2009, has bitcoin as the cryptocurrency. There is an upper limit on how many bitcoins can be issued. It is 21 million at this time (the account holders on the bitcoin network can vote to increase this limit). It is called cryptocurrency because it can only be created by an activity called mining which is based on encryption technology. Miners are nodes in the block chain that compete to validate transactions by solving complex math puzzles based on cryptography. As an incentive to the validation effort, the first miner that has successfully validated a transaction will receive, say 12.5 bitcoins. Also the winning miner gets to add a block to the chain of blocks by creating an id called nonce – which is a unique number that is used only once in the block chain—and a hash – an alpha-numeric identifier that points to the last block in the block chain. Calculation of nonce and hash (especially if the first few digits of the hash have to be zeroes as some block chains need) is a time consuming process. Besides the miner may have to start from the beginning of the block chain, not necessarily always, to compute say Rob's current balance before his transaction to pay Pete can be validated. This too involves computations that can span over several minutes. Hence the miners need to be reimbursed for their efforts which is based on the cryptocurrency or e-money.

Consensus

What happens when someone creates a block with fake transactions and adds it to the block chain? Since blocks are stored in nodes by a process called distributed computing each node gets to vote on the legitimacy of the block. If a majority votes that a block is fake, then it will be removed from the block chain. And all the nodes will be updated. So the block chain is self-correcting by using distributed computing. This way the central ledger of a financial institution which is vulnerable to hacking is replaced with a distributed digital ledger that is impervious to hacking.

Land titles

A typical use case for block chain is land titles. Let us say the last known date of a land title is 1810 and the land has changed several owners. Until now, the records would be kept in physical repositories that are vulnerable to theft, fire and flood. If a land title transfer record in 1900 is lost, then there is no way to replace it and it is not possible to validate the seamless transfer of land ownership. Also someone could introduce a fake title in 2018 and gain possession of the land. Of course, by scanning the physical documents into digital format some of these issues can be mitigated. Still, a centralized storage of the digital information is vulnerable to hacking and misuse. A block chain will store the title repository in a distributed computing model and prevent all fraud.

Smart Contracts

One of the block chain implementations called Ethereum uses what are called smart contracts. A smart contract is a dynamic computer code. Suppose you run a company and want to use a block chain to pay the salaries of your employees. You can write a smart contract to implement the following pseudo code

If employee is a manager, then pay x% of the earnings plus $5000 
else if employee is a worker, then pay minimum wage plus y% of the earnings 
and so on.

Once a smart contract is created it exists for ever on the block chain and is immutable. Suppose you want to change the commission of manager from x% to z%, then a new smart contract has to be written. The old contract that is replaced would not be deleted but made inactive. This way we have a record of the history of salary payments made by your company. A tax auditor, for example, might be interested in your company's salary structure over the years and a block chain serves as the proof.

smart contract.png

Why can't we send money by email?

If you send some e-money by email to Pete, you can easily copy and paste the email to multiple recipients. The question is do you have enough e-money to pay for them all? Who is validating it?And how are they doing it? If you can generate e-money as you wish, then there will be inflation, due to increased money supply, and decrease in purchasing power with your e-money. With this now you understand why a block chain with mining is needed. In a block chain the generation of e-money is controlled by the mining activity which serves as incentive for proof of work. This is like in real world you get paid by your employer in return for work done, except that the proof of work in a block chain is done by employing algorithms and computer hardware.

Public and Private Keys

Each electronic wallet is made of public and private keys. They are based on a type of math called cryptography. A public key is what you share with others when you want to receive e-money. It is an alpha-numeric string like 0x23yu7Ia. This is like your name and address that is unique. Associated with each public key is a private key that is like your password. When you want to send money you use private key to encrypt and broadcast the encrypted transaction to the block chain. This way only you can send money from your electronic wallet and no one else can.

Different types of block chains

A block chain essentially is distributed computer storage and computing spanning over the internet. Just as web has revolutionalized the way we store electronic files, the block chain would do the same for electronic payments, land titles, voting for an election, etc. One can envision multiple block chains for different purposes. Some call it the Web 3.0 where different computers come together to participate in the digital economy based on block chains. Web2.0 is the conglomeration of web services provided by Amazon EC, Microsoft Azure, etc.

One can also envision public, private and hybrid block chains where the participants can vary based on the type of block chain. In a public block chain anyone can join and begin transacting. Such is the case with Ethereum. A private block chain can be set up by a financial institution for its members only. A hybrid block chain will have a mix of public and private participants whose main interest is to transact digital assets electronically at the speed of the internet.

The memory crunch

If a new node has to be created in a block chain, then the node has to download all of the blocks that were ever created on the block chain. This is a time consuming process. Also the blocks are usually stored in physical memory, not an off-line database, to make them available for transactions. This can be very expensive both money and computation wise. If you are a miner, who is not interested in initiating transactions but earning e-money by validating transactions, then it is worthwhile provided you have an efficient algorithm and powerful computer hardware to search for hashes in the blocks. So it is not unusual to find groups of miners pooling their resources to make an investment in expensive computer hardware and other infrastructure needed for mining.

Smart contracts based on code have to face the same scrutiny as programs that are not NP-complete, i.e. that won't terminate or take enormous resources to carry on their computation. In many cases, it is easier to run them on a subset of nodes than on each node in the entire network. For instance, a smart contract to handle voting for a candidate is better run on a single node where the results of votes cast for each candidate can be kept in memory. Each voter can be assigned a unique hash and allowed to vote only once. What makes the voting process tamper proof, ultimately, is that the code in the smart contact is shared within the block chain and can be verified by all the interested parties. Contrast this with electronic voting machines (EVM's) that are used in India whose software is not open for auditing.

The Haves and Have Nots

The proof of work is the central piece of block chain that is going to decide the winners and losers. But it is so hard and takes so much of computing power that people are already suggesting other kinds of proof to substitute proof of work. Some of them are proof of stake, proof of burn, etc. In proof of stake a wealthy miner or a consortium of miners can add a new block based on their reputation. For instance, in a real world scenario if two banks are willing to finance a project at the same interest rate and you have to pick just one bank, which one do you pick? That's where the balance sheet comes into play. The bank with the best balance sheet would win assuming a balance sheet is based on a bank's reputation. In proof of burn, the miners are actually going to spend their cryptocurrency. In real world, this is like engineers coming up with models of their designs by spending some of their money to attract investors. All of these methods that are alternatives to proof of work are going to make the block chain a highly contested topic in the future.

Online References

https://bitsonblocks.net/2016/02/01/a-gentle-introduction-to-smart-contracts/

https://www.youtube.com/watch?v=5Tr13l0O1Ws is a video presentation of proof of work coding and alternatives to proof of work

Wednesday, May 16, 2018

Cluster Analysis

As children we have learned to identify similarity between objects based on color, size, etc. For example, dogs and cats are grouped as animals; elmo and big-bird are grouped as animated characters; and so on. What makes this possible? It is probably innate and evolution made it possible to handle the diversity in the creation. Hindus had divided the ancient professions into 4 clusters: priests, rulers, traders and workers. Later on they changed it to mean by birth that led to the disarray we find now in India. However, the clustering or grouping of objects remains an important aspect of machine learners. Sometimes this is called classification where a class is identified to represent a group of entities with common characteristics. In finance, the classes are called segments and the classification process is called segmentation. In biology it is taxonomy (hierarchical classification) of all living things: kingdom, phylum, class, order, family, genus, and species. In the web, the search engines group pages using clustering algorithms that compute the similarity between the contents of pages with respect to the search string.

Where do we derive classes from?

In ancient India, hindus identified 4 professions everyone was occupied with which were called varnas. It was posited that varnas were ordained by the divine. There was no reason to think otherwise. Over time the varnas transformed into castes that are in existence to this day. Also after attaining freedom from the British, Indian leaders created states based on the predominant language spoken in various regions that can be called linguistic clusters. So clustering makes it possible for the government to allocate funds equitably and manage a billion people.

Almost all ancient cultures gazed at the starry sky and could find clusters that are called constellations. For example, four open clusters have been known from earliest times: the Pleiades and Hyades in the constellation Taurus, Praesepe (the Beehive) in the constellation Cancer, and Coma Berenices. The appearance of the Coma Berenices cluster to the naked eye led to the naming of its constellation for the hair of Berenice, wife of Ptolemy Euergetes of Egypt (3rd century bce).

Clusters are a common feature of all modern economies. According to Michael Porter of Harvard Business School a geographic economic cluster is a group of firms, suppliers, support services, specialized infrastructure, producers of related products, and specialized institutions (e.g., training programs and business associations) that arise in particular fields. Viewed globally the economic clusters evolve into supply chains where parts and services are outsourced to various countries (usually the lowest bidder) and shipped for final assembly locally. In US clusters have traditionally evolved around waves of immigrants. But there are no studies to support this claim other than the observation that more recent immigrants tended to opt for IT careers in the silicon valley when compared to the earlier immigrants who established finance and manufacturing centers in the east coast.

In economics, the classes of consumers making similar buying decisions is highly sought after. For example, a franchise might want to target different advertising material for the rich, upper middle class, middle class, etc. The classes here are based on the income and purchasing power. Similarly national economies are classified based on GDP. Sudhamathy and Jothi Venkateswaran proposed Fuzzy Temporal Clustering Approach that performs clustering of the web site visitors and the web site pages based on the frequency of visits and time spent. They claim their study can provide intelligent recommendations for the E-Commerce web sites that focus on specific product recommendations and behavioral targeting. There are 2 methods to achieving this: temporal web logs clustering (i.e. analyze the logs of e-commerce sites) and fuzzy clustering which involves analyzing change in cluster compositions and change in cluster memberships. Also e-commerce sites would like to know which ads should be directed to which users based on their previous buying decisions. For instance, a high roller who bought a top-end computer is highly likely to buy a computer accessory like a printer or a routing device. The difficult thing though is to predict whether the high roller is buying gifts for a friend or for oneself as companies would like to target their ads for gift buyers differently from personal shoppers.

Another natural grouping can be found in the postal codes that are based on the proximity of a building to a center. Philip Jennings described cluster analysis for assessing the risk factors that go into insurance premium calculation where in the clusters were defined based on Zip Codes and data from Geographic Information Systems (GIS). The GIS data provides more finer analysis as every building has a GIS code.

Finally, microarray technology has been widely applied in biological and clinical studies for simultaneous monitoring of gene expression in thousands of genes. For a complete discussion of microarray you can see https://www.genome.gov/10000533/dna-microarray-technology/. The idea behind microarray is to mix DNA of a patient possibly having a mutation with control DNA (the one without mutation for the gene of interest) and pass the two over a microchip that consists of normal DNA and mutated DNA that are made synthetically. The control DNA as expected will bind to the normal DNA on the chip. If the patient's DNA has mutation it will bind to the portions of the chip having the mutated DNA. The interesting thing is this process can be automated and a robot can be programmed to repeatedly run the microarray experiments over samples of DNA.

But how does one know if a gene mutation can cause a particular disease? Gene clustering analysis is found useful for discovering groups of correlated genes potentially co-regulated or associated to the disease or conditions under investigation.

The unending quest to differentiate oneself and group participation

Life can be called a quest to differentiate oneself from others while at the same time belonging to a group. The various ethnic societies in USA are groups of people who follow the same religion or speak the same language that could be other than English or immigrants from the same region, etc. They tend to derive their initial identity from a group while pining to differentiate themselves as unique Americans in line with the legendary American exceptionalism.

The groups of religious minorities within a larger milieu of democracy either get absorbed into the majority religion or tend to stay in the fringe. Pollsters identify these trends in every election to identify potential independent voters and the appeal of their political platform to them.

The suburban clusters to a city center tend to maintain their individual identities and even have their own mayors and city governments while at the same time tending to be identified with the city center for all practical purposes (e.g. sharing an airport).

Cluster Analysis

Studying clusters of entities is necessary in occupations like economics, statistics and biology. So computer scientists study algorithms to create clusters in data that are homogeneous within based on a similarity metric. The clusters are necessarily heterogeneous with respect to each other. The foremost algorithm for cluster analysis is called K-means where K is the number of clusters that data should be fitted to.

The algorithm starts by selecting K items from the observations to serve as the initial centroids for the clusters which can be random or based on domain specific method. Note that if there are M observations, there are $N \choose K$ = ${{n!} \over {(n-k)!k!}}$ choices where N >= K. Next, the algorithm computes the Euclidean distance for each of the remaining observations and assigns to their closest centroids (which could be the cluster's mean). This step is called “cluster assignment”. After the assignment, the algorithm recalculates the mean value of each cluster. The term cluster “centroid update” is used to designate this step. Now that the centers have been recalculated, every observation is checked again to see if it might be closer to a different cluster. All the objects are reassigned again using the updated cluster means. The cluster assignment and centroid update steps are iteratively repeated until the cluster assignments stop changing (i.e. until convergence is achieved). That is, the clusters formed in the current iteration are the same as those obtained in the previous iteration. In practice though, if there is less than 1% change, the algorithm is terminated. The convergence achieved could be a local minimum, especially when the initial K observations are randomly selected and there is no guarantee that a global minimum could be found.

K-means algorithm can be summarized as follows:

    Select K points as initial centroids.
    repeat
     Form K clusters by assigning each point to its closest centroid.

     Recompute the centroid of each cluster.
   until
     Centroids do not change.

The Euclidean distance between 2 points (x1, y1) and (x2,y2) can given as

$\sqrt{ {(x1-x2)^2} + {(y1-y2)^2}}$

The centroid could be the average or a mean of the distances between the objects in a cluster. The objective is to minimize the distance from the centroid which some times is called the prototype to represent the cluster.

The algorithm terminates when all of the observations have been accounted for and the centroids don't change significantly with each iteration.

The problem of identifying the size of K—that is the number of clusters needed to account for all observations-- is handled in a couple of ways. First we can try different K values, say from 1 to 40, and compute the sum of squares error (SSE) for each of them. When we plot the K vs. SME, if the plot looks like a hockey stick or an increase in K no more reduces SSE, then we choose K at the elbow. Other methods include silhouette method and gap statistic.

One could use probability instead of SME. For instance, a person at a company can be a student intern and a full-time employee at the same time. In that case, a student's membership in the cluster could be based on probability where all the probabilities add up to be 1. Similarly for fuzzy sets and other membership-based methods.

Online References

http://www.econ.upf.edu/~michael/stanford/maeb7.pdf for hierarchical clustering

https://www-users.cs.umn.edu/~kumar001/dmbook/ch8.pdf clustering using k-means

https://uc-r.github.io/kmeans_clustering

http://www.hbs.edu/faculty/Publication%20Files/Clusters_and_Economic_Policy_White_Paper_8e844243-aa23-449d-a7c1-5ef76c74236f.pdf for clustering in economics

http://www.enggjournals.com/ijet/docs/IJET12-04-03-045.pdf fuzzy approach to e-commerce clustering

https://www.casact.org/pubs/dpp/dpp08/08dpp34.pdf zip code based clustering

https://www.cse.buffalo.edu/DBGROUP/bioinformatics/papers/survey.pdf

https://www.genome.gov/10000533/dna-microarray-technology/

Friday, May 11, 2018

Reinforcement Learning

Fork me on GitHub Learners based on Neural Nets, Bayes, Logical inference and Genetic Algorithms have one thing in common: their learning is supervised. This is like holding the hand of a child all the way to adulthood. In real world, the child is initially supervised with feedback until the child learns to manoeuvre itself based on feedback from the environment. For example, while learning to bike the child might learn that no matter how he turns the handle bars left or right, when the bike is tilted at an angle, the fall is inevitable. This could be based on the number of times the child fell while tilted at an angle causing pain. Or it could be the release of endorphins and dopamine in the brain in reaction as the pleasure of riding without fall. Such unfettered learning is the subject matter of reinforcement learning where the learner tries to maximize a function with feedback --called reinforcement reward-- from the environment without supervision. For long this has been the holy grail of AI. We will see if this is possible let alone achievable within the limited means of memory and computational speed.

RL is better explained with a robot that is left in a location for the first time and programmed to map the territory. The robot evaluates the current position and takes a step based on the best evaluation. For example, let us say the robot can step left, right, forward and backward. The goal is to reach a charging station for an energy refill whose GPS location has been made available to the robot. As the robot takes a step, it takes its GPS location and compares it with that of the charging station to calculate distance. The objective is to minimize the distance. There could be hurdles on the way which the robot is unaware of. So the robot may have to step back at times. With reinforcement learning, the robot is guaranteed to reach the charging station, after some trial and error, if there is a path without hurdles. The interesting thing is the robot once after attaining the goal has learnt a complete path from starting point to the charging station with the intermittent paths that it had tried and backtracked that also contributed to the learning. In other words, the robot received negative reinforcement for some locations resulting in backtracking and positive reinforcement for some locations leading to the realization of the goal. Do kids learn this way? If left to their own machinations this is as close as it gets according to the researchers studying child psychology.

Biological Basis

While neural nets are based on the neuronal structure in brain and genetic algorithms are inspired by the evolution, RL is inspired by the behavior when behavior is viewed as a modification of a state based on the reward received from the environment. We know about the Pavlovian dog that responded to stimulus. The behavioral modification in response to reward goes one step further. It is within the human nature, especially among the children, to modify their behavior based on the reward or lack of it.

Exploration vs. Exploitation

Exploration refers to the acquisition of knowledge such as a robot trying to map a territory by wandering around and getting lost at times. Exploitation, on the other hand, is the application of previously acquired knowledge in a new situation with some modification. Both of these approaches are not optimal, and we have to find a proper balance between them to get maximum reward. This is said to be exploration vs exploitation dilemma of reinforcement learning.

The math behind reinforcement learning

Much of the math behind reinforcement learning is based on Bellman equation that goes like this:

$V^{*}(x_t) = r(x_t) + \gamma V^{*}x_{t+1})$

Where $V^*$ represents the ideal value function at state x, t and t+1 are time intervals, and $\gamma$ is the discount factor that is in the range[0,1] that causes immediate reinforcement to have more importance or weighted more heavily than future reinforcement. , and $r({x_t})$ is the reinforcement from the environment. The difference between actual value and ideal value is represented as an error $e$ that can be shown to be

$e(x_t) = \gamma e(x_{t+1})$

Note that the error is exponentially decaying with time and at the goal state is 0 by definition when learning is considered as complete. This is much simpler than what one encounters with neural nets. But one can use a neural net for the approximation of $V(t, w)$ of $V^{*}(x)$ as

$\Delta w(t) = -\alpha [ max(r(x_t, u) + \gamma V(x_{t+1}, w_t)) – V(x_t, w_t)] {{\partial {V(x_t, w_t)}} \over {\partial w_t}}$

where $\alpha$ is the learning rate.

Terminology in Reinforcement Learning

Policy - a mapping from states to actions. The actions for the robot are move right or move forward.

Reinforcement - a scalar variable that communicates the change in the environment to the reinforcement learning system. For example, if an RL system is a controller for a missile, the reinforcement signal might be the distance between the missile and the target (In which case, the RL system would learn to minimize reinforcement).

Markov decision process - An MDP consists of a set of states X; a set of start states S that is a subset of X; a set of actions A; a reinforcement function R where R(x,a) is the expected immediate reinforcement for taking action a in state x; and an action model P where P(x'|x,a) gives the probability that executing action a in state x will lead to state x'. Note: It is a requirement that the choice of action be dependent solely on the current state observation x. If knowledge of prior actions or states affects the current choice of action then the decision process is not Markov.

Deterministic - In the context of states, there is a one-to-one mapping from state/action pairs to successor states. In other words, with probability one, the transition from state x after performing action a will always result in state x'.

Non-deterministic - In the context of states, there exists a probability distribution function P(x'|x,a) that gives the probability that executing action a in state x will lead to state x'. state - The condition of a physical system as specified by a set of appropriate variables. unbiased estimate - The expected (mean) error in the estimate is zero.

Applications

RL was used for packet routing in dynamic networks, improving elevator performance, channel allocation in wireless networks, job-shop scheduling, missile guidance, etc. It is not clear if RA is applicable to higher cognitive functions like planning and diagnosis where the reward and feedback from environment are not applicable.

Conclusion

let us say our robot simulates the path to the charging station even before taking a step; in this case the robot needs to evaluate each and every state and compute the value function which is memory intensive and time consuming so much so that the robot may never make a move.
assuming the robot successfully met the goal, how to determine which move made it possible; to illustrate let us say the path involves forks where the robot has to choose between right and left; how to determine which of these forks is crucial? This is the credit assignment problem.
a number of alternative algorithms to exhaustive search have been proposed such as Q-learning and Advantage learning which are approximations based on assumptions for the value functions
it seems the reinforcement learning with policy gradients – i.e. using a neural net to compute the state values – has been successfully applied to ATARI games (https://en.wikipedia.org/wiki/Atari_Games) where the value of a state can be evaluated with respect to win or loss; so the policy gradient leading upto a win or loss can be used in gradient computation for the given set of inputs with some supervised learning.

One can sing the following ditty:

Hinton set the baseline for neural networks
LeCun set the baseline for convolutional neural networks
ImageNet set the baseline for visual recognition competitions
DeepMind set the baseline for deep RL algorithms
OpenAI set the baseline for deep RL environments

Online References

http://old.nbu.bg/cogs/events/2000/Readings/Petrov/rltutorial.pdf for a basic RL tutorial

http://karpathy.github.io/2016/05/31/rl/ for ATARI GAMES

Saturday, May 5, 2018

test

Dagre Interactive Demo

Input

Graphviz Definition Link for this graph

Graph Visualization

Analogical Reasoning

An analogy is commonly understood as finding a similarity between two arguments or groups of arguments. Besides similarity, according to Aristotle, an example could be used to draw an analogy. For instance, we can say sun like other stars has a life cycle. In this case we are saying sun is a star and like other stars it will have various life cycle phases. A more straightforward analogy is: the inverse square law of gravitation is analogous to inverse square law of magnetic field. Here we are referring to the syntax of the equation that defines gravitational force vs. magnetic or electro-static force. It goes without saying that the target of an analogy is not a complete replica. Let us suppose the source of analogy is "Theory of thermodynamics predicts increased entropy in the universe". And its target could be "The theory of evolutionary programming is based on entropy of the search space." Here the source and target are as far apart as they can be until an analogy is created.

Material and Causal Similarities

Suppose we have these comparisons between the forces of gravitation and electro-magnetism

1	Applies to large objects	Applies to charged particles
2	Predicts force as inverse square law	Predicts force as inverse square law
3	Uses mass as a measure	Uses charge as a measure

1&3 imply a similarity between material properties. And 2 implies a possible causal similarity However their neither of these similarities are necessary or sufficient. For instance, in mathematics an analogy can be created between a square and a cube without any causal similarity.

Let us say source has a set of attributes A1*, A2*...An*. The target has a set of attributes B1*, B2*...Bn* that can be mapped to their counterparts, (i.e. Bn* -> An*). Then we can say a positive analogy between the source and target has been drawn. A negative analogy is when: A* fail to hold in the target and B* fail to hold in source. A single negative analogy, that is strong enough, can rule out any number of positive analogies. This is the case with electrical conduction and fluid flow which share many similarities like the correspondence between Ohm's law and Poiseuille's law but when it comes to conservation they don't match, resulting in the negation of the analogy.

Where is analogy used?

Analogy can be used to generate a hypothesis a priori for further testing. If we knew under what conditions Dow is bearish, we can look for similar conditions in a different stock market like Nifty. Suppose, a presidential speech made the Dow bearish, we can offer a hypothesis like "When the premier of a country makes a negative speech about a group of companies traded in that country, then that country's stock market will go bearish". This is a convenient way to generate hypotheses for testing among the universe of hypotheses.

Plausibility vs. Probability

Plausibility implies that a strong analogy exists between a source and target based on a cursory examination. For example, you can explain the earnings of an average athlete using a bell curve to signify that there will be diminishing margin of returns as the age progresses. The plausibility is derived from the extant knowledge about bell curves that show a statistical distribution of a variable with respect to another (income and age in this case). The distinction between plausibility and probability is that the former need not be conditional on a priori probabilities but has its own distinguishing criterion—that can be called default reasoning.

Hume's 'no free lunch'

Hume (1711–1776) pointed out that ‘even after the observation of the frequent or constant conjunction of objects, we have no reason to draw any inference concerning any object beyond those of which we have had experience’. This applies to induction which takes analogical reasoning to an unknown territory of over-generalization. Suppose we study the spending behavior of consumers in a given economy and come up with an analogy with the rate of depression among clinical cases prima facie. The induced version of it is greater the rate of clinical depression, greater/lesser the spending behavior of consumers in that economy. The induced rule is not tenable until we have additional knowledge about consumers and clinical depression and the relationship between the two. Suppose, in the absence of a valid theory, we corroborate statistically the two categories of people, there is no way to tell that they share other properties (eg. the socio-economic class or height or weight).

Design Patterns and Analogy

Design patterns are a favorite theme of computer programmers across continents who stumble into the same lines of code that can be explained as a structural analogy. When we write code to feed computers, regardless of the programming language used, we tend to follow a familiar pattern of logic. Suppose someone wrote in Java a program to connect to a database, retrieve some data, process it and close the connection. If they were to write the same in another programming language, say C++, the logic is going to be the same. The important thing to note here is the release of the connection, if it is drawn from a pool of connections, without which the program will run out of connections after a few tries. In analogous terms, if we have a limited resource that we would like to share among multiple users, we should ensure that the resource is freed before someone else can use it. One can illustrate this with a hotel example where tenants possess a key to enter a room and must return the key to make the room available for another tenant's use.

Analogical Reasoning and Induction

Normally induction is when you can infer a general proposition based on many particular observations. One can observe monkeys across the continents and induce the proposition that monkeys are prehensile. The drawback here is even if there is one monkey specie whose tail does not match up with our observations, we have to draw an exception. This is the problem we run into when we are inducing rules in AI. For every induced rule we have to write where there are exceptions. With analogical reasoning we bring two disparate domains into focus and can avoid over-generalization. This is the theory behind case-based reasoning. When we call customer service about our broken dish-washer, chances are the company already knows about this problem. Based on our account of the problem, they can search their database and try to find a case that is analogous to the current one. In practice, they may not find an exact match in which case they tweak the matched case to explain all of the symptoms we are experiencing with our dish washer. Suppose we are noticing that our dish washer is making too much noise. If the best our search in the database yielded was the case about a dish-washer that was leaking. Putting two and two together one can say the rubber lining of the door, that also acts like a sound barrier, is not tight enough. While this is not as grandiose as Priestley's analogy of electro-static force with gravitation to derive the inverse square law, it is nevertheless a common use case of analogical reasoning. When computers draw analogies they do so based on the structure or syntax of the similarities. What will be more powerful is if our case-based leaner can retrieve a case from noise-proofing in windows and apply it to the dish-washer by using the commonly held attribute between the two: hermetical seal.

Analogy as precursor to induction

Consider the following syllogism

All mammals deliver children in situ
Human is a mammal
So humans deliver children in situ

Where does one get the premise "All mammals deliver children in situ?" Many suggest that it is through induction. Suppose we see elephants deliver baby elephants, horses deliver baby horses, and so on. And give them a new class called "mammals" and classify all mammals in a theoretical framework. Some argue that it is in the definition of mammals that we capture the essence. This has the advantage that if some animal that delivers children may not be a mammal who typically feed milk to their new borns. This can be explained more easily with the syllogism "All birds fly; Crow is a bird; So crows fly". What happens when we find a bird like kiwi that doesn't fly? It leads to 2 possibilities: either our inductive method needs a rework or the definition of a bird is wrong. It is easier to fix the latter as induction requires collecting a lot of evidence and then coming up with a theory that explains all of the evidence.

Scientists frequently use induction--that is based on analogies--to create new theories. Then to answer a specific question they use deduction. Contrast this with case-based reasoning where there is no inductive step but cases are tweaked to explain a set of observations. This relieves the legal scholars and computer scientists alike using case-based reasoning from time consuming evidence gathering in the inductive step.

Analogies as Explanations

Aesop's fables are an examples of allegories and metaphors to draw analogies between humans and other domains. Panchatantra tales are Indian allegories that serve similar purpose. Used this way, analogies provide elucidation of profound concepts that are otherwise hard to explain. How can we explain the meaning of creativity to a child? Tell them a tale of a fox that talked its way out of a lion's den and they will understand, albeit within the scope of the tale. It is held by some scientists that every new theory should have some analogies to its predecessors. Viewed this way scientific pursuit is a large continuum with one analogy piled over another. If Bohr's model of atom is inspired by the Kepler's description of planetary motion, then it makes sense. But where does the analogy end? Obviously the atomic nucleus doesn't behave like the sun which is considered to be not only the center of solar system but also revolving around other galaxies and so on.

Analogies as reminders and synergies

Often we hear people say, “this reminds me of ...” What is the trigger for this? Among many viable options, analogical reasoning is the most common occurrence. One can tell a long story of how hindu children are initiated into wearing a sacred thread to a person who summarizes it as: “that reminds me of my grandson's bar mitzvah!” What is the missing link here? They are exchanging stories of coming of age. Different cultures use different means to illustrate coming of age. In some African tribes the coming of age involves getting bitten by poisonous ants. Sataré Maué, for example, will require approximately 20 tocandira ant "inoculations" to ensure the highest level of protection against mosquitoes.

Analogies in Law

Lawyers use precedents to settle cases even before they ever see the daylight of a court room. The reason being all of the available evidence is going to reinforce the verdict or there is no evidence to begin with. When we have some evidence to the effect that the burglar broke a glass window and entered our house, there is likely more than one precedent where the burglary happened in similar way at a different house. The only issue here is the cost of material damage and its recovery. In cases where there is no hard evidence, but circumstantial evidence is available, then the precedent can be a powerful way to argue the case. Take for instance, someone getting sick after eating a store-bought green vegetable. Even without hard evidence for the purchase at a particular store or the presence of some bacteria—like E.Coli – to cause the illness, the case can be settled out of court using a precedent, assuming there is no nation-wide epidemic of E.Coli.

Machine Learners for Analogical Reasoning

Sowa and Majumdar describe a machine learner for analogical reasoning that came up with the following similarities between cats and cars:

Analogy of Cat to Car
Cat	Car
head	hood
eye	headlight
cornea	glass plate
mouth	fuel cap
stomach	fuel tank
bowel	combustion chamber
anus	exhaust pipe
skeleton	chassis
heart	engine
paw	wheel
fur	paint

Each concept, i.e. cat and car in this case, is represented within a conceptual graphs. Represented this way, the concept cat belongs in a conceptual graph beginning with animal to which a dog also belongs. This conceptual graph in turn can be represented as a sub-graph of living beings and so on. Then they use three methods to draw analogies:

match by labels: if cat and dog were to be compared, then it is straightforward as they share many labels, viz. number of eyes, tail, etc.
matching subgraphs: the sub-graph of living beings can be plants and animals which can then be compared
matching transformations:involves relating sub-graphs of one graph to sub-graphs of the other. For example the sub-graphs of animals can be compared to sub-graphs of automobiles

While the learner is based on the work of an American logician called Charles Peirce, it depends on a collection of concepts called WordNet knowledge base that contains over 100,000 nodes. The time it takes to search among these nodes depends on the algorithm used. While their algorithm of N log N complexity takes a few seconds, an algorithm of N^3 complexity, presumably making an exhaustive search, can take more than 30 years.

Peirce invented a geometric-topological notation called existential graphs. Like Venn diagrams for sets, the existential graphs serve as the underlying implementation of logic. Presumably this has a biological basis. The existential graphs allow to chart logical reasoning in its finest detail, making visible every single step in the reasoning process (as against notations aimed at quick, results- oriented heuristics). An existential graph for the assertion "Tom believes that Mary wants to marry a sailor" is shown to illustrate the due diligence in representing the concepts.

Representing concepts in a graph notation this way is a time exacting process. This can better be accomplished by a neural net. Still, one has to come up a training set comprising a list of concepts and their inter connections. But once that is done, it takes an efficient algorithm to traverse through the graph with the additional advantage of being able to visually represent the inference process. On the other hand, some think analogy in computer is based on the structural representation of the concepts that are predisposed to analogies. Viewed this way, almost all of the analogical representations in computers share this defect. The representation is the key to a successful analogiser. So it is not clear if the computer is any better than a child using lego sets to create complex objects. Just as using lego bricks one can build any complex structure, the syntax of the computer representations can capture any similarities between the concepts.

Like Data to Machine Learner as History to Democracy

With the raise of the social media there is plethora of data to be mined or understood. A machine learner is perfectly suited for this task where data can be used, say, in training neural nets. If enough people share their data it is theoretically possible to cure rare diseases and explain social behaviors alike. What the machine learners, in effect, need is an open society where everything about the life of a person can be known. Thus each person --their likes, dislikes, favorites, preferences, academic background, credit history and skills--serves as a datum for the learner. Democracies have come about because the history of failed nations is well known where each failed nation provides a datum to create a more robust democracy. However unlike history, which is painstakingly generated with archaeological studies and logic, the data harvested from social media is almost trivial to be of use beyond generating sales of products. Yet this raises concerns over the privacy of individuals participating on social platforms.

Online References

http://www.jfsowa.com/pubs/analog.htm

https://plato.stanford.edu/entries/reasoning-analogy/

https://wiki.eecs.yorku.ca/course_archive/2014-15/W/6339/_media/conceptual_graph_examples.pdf