In the previous articlewe described the Bayesian framework for linear regression and how we can use latent variables to reduce model complexity. In this post, we will explain how latent variables can also be used to frame a classification problem, namely the Gaussian Mixture model or GMM in short that allows us to perform soft probabilistic clustering. This model is classically trained by an op t imization procedure named the Expectation-Maximization or EM in short for which we will have a thorough review.
At the end of this article, we will also see why we do not use traditional optimization methods. This article contains a few mathematical notations and derivations. We are not trying to scare anybody. We believe that once the intuition is given is it important to dive into the math to understand things for real. This post was inspired by this excellent course on Coursera: Bayesian Methods for machine learning.
If you are into machine learning I definitely recommend this course.Gaussian Mixture Model
This model is a soft probabilistic clustering model that allows us to describe the membership of points to a set of clusters using a mixture of Gaussian densities. It is a soft classification in contrast to a hard one because it assigns probabilities of belonging to a specific class instead of a definitive choice. In essence, each observation will belong to every class but with different probabilities. We take the famous Iris classification problem as an example. So we have Iris flowers divided between 3 classes.
For each of them, we have the sepal length and width, the petal length and width, and the class. The Gaussian Mixture model tries to describe the data as if it originated from a mixture of Gaussian distributions. So first, if we only take one dimension, say the petal width, and try to fit 3 different Gaussians, we would end up with something like this :. The algorithm found that the mixture that is most likely to represent the data generation process is made of the three following normal distributions :.
The setosa petal widths are much more concentrated with a mean of 0. The other two classes are comparably more spread out but with different locations. Now the constituents of the mixture are the following :. Note that the GMM is a pretty flexible model. It can be shown that for a large enough number of mixtures, and appropriate choice of the involved parameters, one can approximate arbitrarily close any continuous pdf with the extra computational cost that it entails.
So how does the algorithm finds the best set of parameters to describe the mixture?In the world of Machine Learning, we can distinguish two main areas: Supervised and unsupervised learning. The main difference between both lies in the nature of the data as well as the approaches used to deal with it. Clustering is an unsupervised learning problem where we intend to find clusters of points in our dataset that share some common characteristics. Our job is to find sets of poi n ts that appear close together.
In this case, we can clearly identify two clusters of points which we will colour blue and red, respectively:. Please note that we are now introducing some additional notation. A popular clustering algorithm is known as K-means, which will follow an iterative approach to update the parameters of each clusters. More specifically, what it will do is to compute the means or centroids of each cluster, and then calculate their distance to each of the data points.
The latter are then labeled as part of the cluster that is identified by their closest centroid. This process is repeated until some convergence criterion is met, for example when we see no further changes in the cluster assignments.
One important characteristic of K-means is that it is a hard clustering methodwhich means that it will associate each point to one and only one cluster. A limitation to this approach is that there is no uncertainty measure or probability that tells us how much a data point is associated with a specific cluster. So what about using a soft clustering instead of a hard one?
Each Gaussian k in the mixture is comprised of the following parameters:. Let us now illustrate these parameters graphically:. Each Gaussian explains the data contained in each of the three clusters available. The mixing coefficients are themselves probabilities and must meet this condition:. Now how do we determine the optimal values for these parameters?
To achieve this we must ensure that each Gaussian fits the data points belonging to each cluster.
Gaussian Mixture Models Explained
This is exactly what maximum likelihood does. In general, the Gaussian density function is given by:. Where x represents our data points, D is the number of dimensions of each data point. For later purposes, we will also find it useful to take the log of this equation, which is given by:. If we differentiate this equation with respect to the mean and covariance and then equate it to zero, then we will be able to find the optimal values for these parameters, and the solutions will correspond to the Maximum Likelihood Estimates MLE for this setting.
I am learning about Gaussian mixture models GMM but I am confused as to why anyone should ever use this algorithm. What is the metric to say that one data point is closer to another with GMM? How can I make use of the final probability distribution that GMM produces?
What can I do with it? I'll borrow the notation from 1which describes GMMs quite nicely in my opinon. That is precisely how a GMM can be used to cluster your data. K-Means can encounter problems when the choice of K is not well suited for the data or the shapes of the subpopulations differ. The scikit-learn documentation contains an interesting illustration of such cases. The choice of the shape of the GMM's covariance matrices affects what shapes the components can take on, here again the scikit-learn documentation provides an illustration.
More on this can be found here. The elements of statistical learning. New York: Springer series in statistics, Pattern recognition and machine learning. It may fail if these conditions are violated although it may still work if the clusters are very widely separated. GMMs can fit clusters with a greater variety of shapes and sizes. GMMs give a probabilistic assignment of points to clusters. This lets us quantify uncertainty. For example, if a point is near the 'border' between two clusters, it's often better to know that it has near equal membership probabilities for these clusters, rather than blindly assigning it to the nearest one.
The probabilistic formulation of GMMs lets us incorporate prior knowledge, using Bayesian methods. For example, we might already know something about the shapes or locations of the clusters, or how many points they contain. The probabilistic formulation gives a way to handle missing data e. We can still cluster a data point, even if we haven't observed its value along some dimensions.
Gaussian Mixture Model
And, we can infer what those missing values might have been. GMMs give a probability that each each point belongs to each cluster see below. These probabilities can be converted into 'hard assignments' using a decision rule. For example, the simplest choice is to assign each point to the most likely cluster i. The expression you wrote is the distribution for the observed data. However, a GMM can be thought of as a latent variable model. Each data point is associated with a latent variable that indicates which cluster it belongs to.
When fitting a GMM, we learn a distribution over these latent variables. This gives a probability that each data point is a member of each cluster. Sign up to join this community. The best answers are voted up and rise to the top.
Home Questions Tags Users Unanswered. Why use a Gaussian mixture model?In statisticsa mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population.
However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.
Some ways of implementing mixture models involve steps that attribute postulated sub-population-identities to individual observations or weights towards such sub-populationsin which case these can be regarded as types of unsupervised learning or clustering procedures. However, not all inference procedures involve such steps. Mixture models should not be confused with models for compositional datai. However, compositional models can be thought of as mixture models, where members of the population are sampled at random.
Conversely, mixture models can be thought of as compositional models, where the total size reading population has been normalized to 1. A typical finite-dimensional mixture model is a hierarchical model consisting of the following components:. In addition, in a Bayesian settingthe mixture weights and parameters will themselves be random variables, and prior distributions will be placed over the variables. In such a case, the weights are typically viewed as a K -dimensional random vector drawn from a Dirichlet distribution the conjugate prior of the categorical distributionand the parameters will be distributed according to their respective conjugate priors.
This characterization uses F and H to describe arbitrary distributions over observations and parameters, respectively. Typically H will be the conjugate prior of F. The two most common choices of F are Gaussian aka " normal " for real-valued observations and categorical for discrete observations.
Other common possibilities for the distribution of the mixture components are:. A typical non-Bayesian Gaussian mixture model looks like this:. A Bayesian version of a Gaussian mixture model is as follows:.
A Bayesian Gaussian mixture model is commonly extended to fit a vector of unknown parameters denoted in boldor multivariate normal distributions. In a multivariate distribution i. Note that this formulation yields a closed-form solution to the complete posterior distribution.
Such distributions are useful for assuming patch-wise shapes of images and clusters, for example. One Gaussian distribution of the set is fit to each patch usually of size 8x8 pixels in the image.Documentation Help Center. Gaussian mixture models GMMs are often used for data clustering. You can use GMMs to perform either hard clustering or soft clustering on query data.
To perform hard clustering, the GMM assigns query data points to the multivariate normal components that maximize the component posterior probability, given the data. That is, given a fitted GMM, cluster assigns query data to the component yielding the highest posterior probability.
Hard clustering assigns a data point to exactly one cluster. For an example showing how to fit a GMM to data, cluster using the fitted model, and estimate component posterior probabilities, see Cluster Gaussian Mixture Data Using Hard Clustering. Additionally, you can use a GMM to perform a more flexible clustering on data, referred to as soft or fuzzy clustering.
Soft clustering methods assign a score to a data point for each cluster. The value of the score indicates the association strength of the data point to the cluster. As opposed to hard clustering methods, soft clustering methods are flexible because they can assign a data point to more than one cluster.
When you perform GMM clustering, the score is the posterior probability. GMM clustering can accommodate clusters that have different sizes and correlation structures within them. Therefore, in certain applications, GMM clustering can be more appropriate than methods such as k -means clustering. Like many clustering methods, GMM clustering requires you to specify the number of clusters before fitting the model. The number of clusters specifies the number of components in the GMM.
Consider the component covariance structure. You can specify diagonal or full covariance matrices, and whether all components have the same covariance matrix.
Specify initial conditions. As in the k -means clustering algorithm, EM is sensitive to initial conditions and might converge to a local optimum.
Implement regularization. For example, if you have more predictors than data points, then you can regularize for estimation stability. This example explores the effects of specifying different options for covariance structure and initial conditions when you perform GMM clustering.
Load Fisher's iris data set. Consider clustering the sepal measurements, and visualize the data in 2-D using the sepal measurements. The number of components k in a GMM determines the number of subpopulations, or clusters. In this figure, it is difficult to determine if two, three, or perhaps more Gaussian components are appropriate.
A GMM increases in complexity as k increases. Specify Different Covariance Structure Options. Each Gaussian component has a covariance matrix.
Geometrically, the covariance structure determines the shape of a confidence ellipsoid drawn over a cluster. You can specify whether the covariance matrices for all components are diagonal or full, and whether all components have the same covariance matrix. Each combination of specifications determines the shape and orientation of the ellipsoids.
For reproducibility, set the random seed. Create a 2-D grid covering the plane composed of extremes of the measurements.
It only takes a minute to sign up. I am new to using GMMs. I was not able to find any appropriate help online. Could anyone please provide me right resource on "How to decide if using GMM fits to my problem? In my opinion, you can perform GMM when you know that the data points are mixtures of a gaussian distribution. Basically forming clusters with different mean and standard deviation.
There's a nice diagram on scikit-learn website. GMM classification. An approach is to find the clusters using soft clustering methods and then see if they are gaussian.
If they are then you can apply a GMM model which represents the whole dataset. GMMs are usually a good place to start if your goal is to either 1 cluster observations, 2 specify a generative model, or 3 estimate densities. In fact, for clustering, GMMs are a superset of k-means. Sign up to join this community. The best answers are voted up and rise to the top. Home Questions Tags Users Unanswered. When to use Gaussian mixture model? Ask Question. Asked 3 years, 8 months ago.
Active 2 years, 5 months ago. Viewed 9k times. Franck Dernoncourt Vinay Vinay 1 1 silver badge 3 3 bronze badges. It's used when data follows is a mixture of more than 1 normal distribution. See another question - stats. We dont know the underlying process clearly and hence we are trying to model using machine learning methods.If you find this content useful, please consider supporting the work by buying the book! The k -means clustering model explored in the previous section is simple and relatively easy to understand, but its simplicity leads to practical challenges in its application.
In particular, the non-probabilistic nature of k -means and its use of simple distance-from-cluster-center to assign cluster membership leads to poor performance for many real-world situations. In this section we will take a look at Gaussian mixture models GMMswhich can be viewed as an extension of the ideas behind k -means, but can also be a powerful tool for estimation beyond simple clustering.
Let's take a look at some of the weaknesses of k -means and think about how we might improve the cluster model. As we saw in the previous section, given simple, well-separated data, k -means finds suitable clustering results. For example, if we have simple blobs of data, the k -means algorithm can quickly label those clusters in a way that closely matches what we might do by eye:. From an intuitive standpoint, we might expect that the clustering assignment for some points is more certain than others: for example, there appears to be a very slight overlap between the two middle clusters, such that we might not have complete confidence in the cluster assigment of points between them.
Unfortunately, the k -means model has no intrinsic measure of probability or uncertainty of cluster assignments although it may be possible to use a bootstrap approach to estimate this uncertainty. For this, we must think about generalizing the model. One way to think about the k -means model is that it places a circle or, in higher dimensions, a hyper-sphere at the center of each cluster, with a radius defined by the most distant point in the cluster.
This radius acts as a hard cutoff for cluster assignment within the training set: any point outside this circle is not considered a member of the cluster. We can visualize this cluster model with the following function:.
An important observation for k -means is that these cluster models must be circular : k -means has no built-in way of accounting for oblong or elliptical clusters.
So, for example, if we take the same data and transform it, the cluster assignments end up becoming muddled:. By eye, we recognize that these transformed clusters are non-circular, and thus circular clusters would be a poor fit. Nevertheless, k -means is not flexible enough to account for this, and tries to force-fit the data into four circular clusters.
This results in a mixing of cluster assignments where the resulting circles overlap: see especially the bottom-right of this plot.
In Depth: Gaussian Mixture Models
One might imagine addressing this particular situation by preprocessing the data with PCA see In Depth: Principal Component Analysisbut in practice there is no guarantee that such a global operation will circularize the individual data.
These two disadvantages of k -means—its lack of flexibility in cluster shape and lack of probabilistic cluster assignment—mean that for many datasets especially low-dimensional datasets it may not perform as well as you might hope. You might imagine addressing these weaknesses by generalizing the k -means model: for example, you could measure uncertainty in cluster assignment by comparing the distances of each point to all cluster centers, rather than focusing on just the closest.
You might also imagine allowing the cluster boundaries to be ellipses rather than circles, so as to account for non-circular clusters. It turns out these are two essential components of a different type of clustering model, Gaussian mixture models.
A Gaussian mixture model GMM attempts to find a mixture of multi-dimensional Gaussian probability distributions that best model any input dataset.
In the simplest case, GMMs can be used for finding clusters in the same manner as k -means:. We can visualize this uncertainty by, for example, making the size of each point proportional to the certainty of its prediction; looking at the following figure, we can see that it is precisely the points at the boundaries between clusters that reflect this uncertainty of cluster assignment:.
Under the hood, a Gaussian mixture model is very similar to k -means: it uses an expectation—maximization approach which qualitatively does the following:. The result of this is that each cluster is associated not with a hard-edged sphere, but with a smooth Gaussian model. Just as in the k -means expectation—maximization approach, this algorithm can sometimes miss the globally optimal solution, and thus in practice multiple random initializations are used.
Let's create a function that will help us visualize the locations and shapes of the GMM clusters by drawing ellipses based on the GMM output:. With this in place, we can take a look at what the four-component GMM gives us for our initial data:. Similarly, we can use the GMM approach to fit our stretched dataset; allowing for a full covariance the model will fit even very oblong, stretched-out clusters:.
This makes clear that GMM addresses the two main practical issues with k -means encountered before. This hyperparameter controls the degrees of freedom in the shape of each cluster; it is essential to set this carefully for any given problem. The resulting clustering will have similar characteristics to that of k -means, though it is not entirely equivalent.
We can see a visual representation of these three choices for a single cluster within the following figure:. Though GMM is often categorized as a clustering algorithm, fundamentally it is an algorithm for density estimation. That is to say, the result of a GMM fit to some data is technically not a clustering model, but a generative probabilistic model describing the distribution of the data.
If we try to fit this with a two-component GMM viewed as a clustering model, the results are not particularly useful:. But if we instead use many more components and ignore the cluster labels, we find a fit that is much closer to the input data:.