Wednesday, May 16, 2018

Cluster Analysis

As children we have learned to identify similarity between objects based on color, size, etc. For example, dogs and cats are grouped as animals; elmo and big-bird are grouped as animated characters; and so on. What makes this possible? It is probably innate and evolution made it possible to handle the diversity in the creation. Hindus had divided the ancient professions into 4 clusters: priests, rulers, traders and workers. Later on they changed it to mean by birth that led to the disarray we find now in India. However, the clustering or grouping of objects remains an important aspect of machine learners. Sometimes this is called classification where a class is identified to represent a group of entities with common characteristics. In finance, the classes are called segments and the classification process is called segmentation. In biology it is taxonomy (hierarchical classification) of all living things: kingdom, phylum, class, order, family, genus, and species. In the web, the search engines group pages using clustering algorithms that compute the similarity between the contents of pages with respect to the search string.

Where do we derive classes from?

In ancient India, hindus identified 4 professions everyone was occupied with which were called varnas. It was posited that varnas were ordained by the divine. There was no reason to think otherwise. Over time the varnas transformed into castes that are in existence to this day. Also after attaining freedom from the British, Indian leaders created states based on the predominant language spoken in various regions that can be called linguistic clusters. So clustering makes it possible for the government to allocate funds equitably and manage a billion people.

Almost all ancient cultures gazed at the starry sky and could find clusters that are called constellations. For example, four open clusters have been known from earliest times: the Pleiades and Hyades in the constellation Taurus, Praesepe (the Beehive) in the constellation Cancer, and Coma Berenices. The appearance of the Coma Berenices cluster to the naked eye led to the naming of its constellation for the hair of Berenice, wife of Ptolemy Euergetes of Egypt (3rd century bce).

Clusters are a common feature of all modern economies. According to Michael Porter of Harvard Business School a geographic economic cluster is a group of firms, suppliers, support services, specialized infrastructure, producers of related products, and specialized institutions (e.g., training programs and business associations) that arise in particular fields. Viewed globally the economic clusters evolve into supply chains where parts and services are outsourced to various countries (usually the lowest bidder) and shipped for final assembly locally. In US clusters have traditionally evolved around waves of immigrants. But there are no studies to support this claim other than the observation that more recent immigrants tended to opt for IT careers in the silicon valley when compared to the earlier immigrants who established finance and manufacturing centers in the east coast.

In economics, the classes of consumers making similar buying decisions is highly sought after. For example, a franchise might want to target different advertising material for the rich, upper middle class, middle class, etc. The classes here are based on the income and purchasing power. Similarly national economies are classified based on GDP. Sudhamathy and Jothi Venkateswaran proposed Fuzzy Temporal Clustering Approach that performs clustering of the web site visitors and the web site pages based on the frequency of visits and time spent. They claim their study can provide intelligent recommendations for the E-Commerce web sites that focus on specific product recommendations and behavioral targeting. There are 2 methods to achieving this: temporal web logs clustering (i.e. analyze the logs of e-commerce sites) and fuzzy clustering which involves analyzing change in cluster compositions and change in cluster memberships. Also e-commerce sites would like to know which ads should be directed to which users based on their previous buying decisions. For instance, a high roller who bought a top-end computer is highly likely to buy a computer accessory like a printer or a routing device. The difficult thing though is to predict whether the high roller is buying gifts for a friend or for oneself as companies would like to target their ads for gift buyers differently from personal shoppers.

Another natural grouping can be found in the postal codes that are based on the proximity of a building to a center. Philip Jennings described cluster analysis for assessing the risk factors that go into insurance premium calculation where in the clusters were defined based on Zip Codes and data from Geographic Information Systems (GIS). The GIS data provides more finer analysis as every building has a GIS code.

Finally, microarray technology has been widely applied in biological and clinical studies for simultaneous monitoring of gene expression in thousands of genes. For a complete discussion of microarray you can see https://www.genome.gov/10000533/dna-microarray-technology/. The idea behind microarray is to mix DNA of a patient possibly having a mutation with control DNA (the one without mutation for the gene of interest) and pass the two over a microchip that consists of normal DNA and mutated DNA that are made synthetically. The control DNA as expected will bind to the normal DNA on the chip. If the patient's DNA has mutation it will bind to the portions of the chip having the mutated DNA. The interesting thing is this process can be automated and a robot can be programmed to repeatedly run the microarray experiments over samples of DNA.

But how does one know if a gene mutation can cause a particular disease? Gene clustering analysis is found useful for discovering groups of correlated genes potentially co-regulated or associated to the disease or conditions under investigation.

The unending quest to differentiate oneself and group participation

Life can be called a quest to differentiate oneself from others while at the same time belonging to a group. The various ethnic societies in USA are groups of people who follow the same religion or speak the same language that could be other than English or immigrants from the same region, etc. They tend to derive their initial identity from a group while pining to differentiate themselves as unique Americans in line with the legendary American exceptionalism.

The groups of religious minorities within a larger milieu of democracy either get absorbed into the majority religion or tend to stay in the fringe. Pollsters identify these trends in every election to identify potential independent voters and the appeal of their political platform to them.

The suburban clusters to a city center tend to maintain their individual identities and even have their own mayors and city governments while at the same time tending to be identified with the city center for all practical purposes (e.g. sharing an airport).

Cluster Analysis

Studying clusters of entities is necessary in occupations like economics, statistics and biology. So computer scientists study algorithms to create clusters in data that are homogeneous within based on a similarity metric. The clusters are necessarily heterogeneous with respect to each other. The foremost algorithm for cluster analysis is called K-means where K is the number of clusters that data should be fitted to.

The algorithm starts by selecting K items from the observations to serve as the initial centroids for the clusters which can be random or based on domain specific method. Note that if there are M observations, there are $N \choose K$ = ${{n!} \over {(n-k)!k!}}$ choices where N >= K. Next, the algorithm computes the Euclidean distance for each of the remaining observations and assigns to their closest centroids (which could be the cluster's mean). This step is called “cluster assignment”. After the assignment, the algorithm recalculates the mean value of each cluster. The term cluster “centroid update” is used to designate this step. Now that the centers have been recalculated, every observation is checked again to see if it might be closer to a different cluster. All the objects are reassigned again using the updated cluster means. The cluster assignment and centroid update steps are iteratively repeated until the cluster assignments stop changing (i.e. until convergence is achieved). That is, the clusters formed in the current iteration are the same as those obtained in the previous iteration. In practice though, if there is less than 1% change, the algorithm is terminated. The convergence achieved could be a local minimum, especially when the initial K observations are randomly selected and there is no guarantee that a global minimum could be found.

K-means algorithm can be summarized as follows:

    Select K points as initial centroids.
    repeat
     Form K clusters by assigning each point to its closest centroid.

     Recompute the centroid of each cluster.
   until
     Centroids do not change.

The Euclidean distance between 2 points (x1, y1) and (x2,y2) can given as

$\sqrt{ {(x1-x2)^2} + {(y1-y2)^2}}$

The centroid could be the average or a mean of the distances between the objects in a cluster. The objective is to minimize the distance from the centroid which some times is called the prototype to represent the cluster.

The algorithm terminates when all of the observations have been accounted for and the centroids don't change significantly with each iteration.

The problem of identifying the size of K—that is the number of clusters needed to account for all observations-- is handled in a couple of ways. First we can try different K values, say from 1 to 40, and compute the sum of squares error (SSE) for each of them. When we plot the K vs. SME, if the plot looks like a hockey stick or an increase in K no more reduces SSE, then we choose K at the elbow. Other methods include silhouette method and gap statistic.

One could use probability instead of SME. For instance, a person at a company can be a student intern and a full-time employee at the same time. In that case, a student's membership in the cluster could be based on probability where all the probabilities add up to be 1. Similarly for fuzzy sets and other membership-based methods.

Online References

http://www.econ.upf.edu/~michael/stanford/maeb7.pdf for hierarchical clustering

https://www-users.cs.umn.edu/~kumar001/dmbook/ch8.pdf clustering using k-means

https://uc-r.github.io/kmeans_clustering

http://www.hbs.edu/faculty/Publication%20Files/Clusters_and_Economic_Policy_White_Paper_8e844243-aa23-449d-a7c1-5ef76c74236f.pdf for clustering in economics

http://www.enggjournals.com/ijet/docs/IJET12-04-03-045.pdf fuzzy approach to e-commerce clustering

https://www.casact.org/pubs/dpp/dpp08/08dpp34.pdf zip code based clustering

https://www.cse.buffalo.edu/DBGROUP/bioinformatics/papers/survey.pdf

https://www.genome.gov/10000533/dna-microarray-technology/

No comments:

Post a Comment