One of the most important tasks in visual intelligence is to be able to identify and classify objects in an image. One method by which this can be achieved is by using a maximum likelihood estimator to classify the individual pixels of an image. The process of doing this can be broken down into the following steps:
- Data acquisition
- Ground truth
- Define feature space
- Creating a model
- Evaluating the classifications
I'm going to give a little more detail on each of these and then apply them to an example.
Remote sensing is the usual method used to aquire data. There are two broad types of sensors that can be used: active sensors, which add energy to the environment; and passive sensors, which absorb energy from the environment to sense without adding any aditional energy. One or more sensors of different types can be used to make up a dataset.
In the example, the dataset was compiled of data gathered through both types of sensor:
- Active sensing: Light detection and ranging
- First echo (fe) and Last echo (le)
- Passive sensing: Camera sensors
- Colour (RGB) sensor
- Near infrared (NIR) sensor
The RGB image looks like this:
For this image, the task is to classify the buildings, vegetation, cars and ground into distinct classes.
After aquiring the data, a ground truth can be made. This is useful for selecting samples as well as being useful for validating the classification of your algorithm. Ground truth is a "hand-tagged" (another algorithm that is able to accurately classify the image could be used in place of human effort) ideal view of the output from classification.
For the example case, the ground truth is:
Here the buildings are dark blue, the vegetation is light blue, the cars are yellow and the ground is red. These colours will also be used for the output of the maximum likelihood classifier.
Defining the feature space
Due to the fact the data is multiple images rather than a single image, there is a large amount of data for each pixel in the image. The feature space defines which features are mapped to each pixel and how.
As the data was gathered using a multitude of different sensors, each feature can be represented as a multi-dimensional vector, with each dimension representing the data from a specific sensor. In the example, the feature vectors are:
A number of samples from the feature space must be collected for each class to be classified. The number of samples generated needs to be large enough to be representative of the feature space for that class. In this case, this means at least twice as many samples are needed than the feature dimensionality. As our feature space has six dimensions, a minimum of 12 samples should be collected. The more samples that are used, the more accurate the classifier is likely to be. Therefore, a larger number of samples is recommended.
The samples could be hand-selected from the image, or, if there is a labelled ground truth, randomly selected from the ground truth. In the example, 50 samples are used for each class and they are randomly selected using the following method
class_elements = all elements from ground truth in current class random_samples = from class_elements take 50 random items for sample in random_sample: sample_feature_vector = [fe(sample), le(sample), R(sample), G(sample), B(sample), NIR(sample)]
Creating a model
A statistical model can be generated using Bayes decision rule. Each class has a class conditional probability distribution function (pdf) that is used in the formula to generate a probability for a given feature vector being a part of a specific class.
In the example we're following, it is assumed that the classes all have a gaussian distribution as their pdf. Therefore, the data can be fitted to a gaussian model. A different gaussian model is defined for each class as described by the following equation:
- x stands for the current feature vector being sampled
- ωi stands for the class i
- Σi stands for the covariance matrix generated from the sample of class i
- μi stands for: the mean value of the sample of class i
In the example the gaussian models generate the following "heat-maps" of probability for each class:
The final step is to classify the pixels based on their feature vector. Here, the model for each class is applied to the feature vector for each pixel and the largest value for each pixel is accepted as the classification.
In the example, 50 samples are taken for each class. These samples are then used to generate gaussian models and the maximum likelihood estimation is used to classify each pixel. The final output was this:
Evaluating the classifications
The final step is to evaluate the classification accuracy. Here you compare the classified image generated to the ground truth to find out how well the classification algorithm functioned.
One way to evaluate the performace is with a confusion matrix. This is a matrix that shows the correct and incorrect classifications for each class. The leading diagonal shows the correct classifications and the total accuracy can be calculate by the sum of the leading diagonal divided by the sum of all elements of the matrix.
The example classification matrix:
As you can see, a maximum likelihood estimator is an extremely simple method for classifying the features of an image and yet, the achieved accuracy is extremely high. If required, this accuracy could be further improved by applying expert knowledge to the problem.