Having prepared the classification ingredients, namely the descriptors to be used, we can move on to construct the tools of the classification. To make this happen, we had to decide at first on the nature of the data to be used and whether or not to discriminate the gestures to be recognized.  The category of gestures that was handled in my Master’s Thesis Application concerned only the ones that are not time related, namely those that are characterized only by the pose and not its temporal state alteration. However I have constructed a classification system that can recognize both types of gestures. The basic assumption that is being made concerns the interconnection of the dynamic and the static gestures; the static gestures are considered to be the building blocks for the manifestation of a dynamic gesture, ie. multiple different specific consecutive static gestures can actually construct a unique dynamic one.  This approximation allows me to independently classify the different types of gestures and use the resulting classification scores to provide a unified result through a new classification mechanism. In other words, a cascaded system is proposed to approach in multiple parts the classification problem. The main building blocks were decided to be Linear Kernel Support Vector Machines and Random Decision Forests, so as to keep the demanded classification time at a minimum, while not sacrificing much accuracy.

There are two ways with which one can approximate the classification problem of two unique classification entities, either to provide a double classification output, returning a unique verdict for each entity, or to output a single decision, by nonlinearly joining the entities spaces and hoping for the best. A hybrid version of those approaches is to decide, based on an advanced scores threshold, on a single output, which would satisfy in one or another way both approaches outcomes. This hybrid version was not presented in my Master’s Thesis, as I believe it to be too complicated, encompassing the use of fuzzy logic participant functions and rule based analysis. This is the reason why I will focus only on explaining the two basic approaches and may write another post in the future presenting this complicated thinking.

I am going to follow an image explanation scheme hereafter for the rest of the post.

Dual Output (In Sync system)

Let’s view at first the perspective of a dual decision output:

Here we define three classifiers. $Cl_{stat}$ is a RDF of 29 decision trees and $Cl_{dyn}$ is a linear SVM array. Each classifier receives a subset of the selected descriptors. 3DXYPCA is found to be more than enough for a static gesture recognition, while GHOG and ZHOF have proved to make a better pair to serve as input in the $Cl_{dyn}$ classifier.

CDBIMM

The CDBIMM Classifier contains the originality of the proposed system. Its name comes from “Combined Dynamic Bayesian Inference Mixture Model” . By testing the trained $Cl_{stat}$ classifier on the training data of the dynamic gestures, we construct a mapping of probabilities P(p|a), where p is the identified static gesture and a is the corresponding ground truth of the dynamic gesture. We basically find how much of a static gesture is included in a dynamic one, ie. we identify the static building blocks of the dynamic gestures. We call this mapping Coherence Matrix (CM). We make the incorrect assumption here that the mixing of static gestures is not time related, which will influence negatively later the classification of dynamic gestures. Apparently, a better scheme is to construct a 2nd level Coherence Matrix, which depicts the probabilities P(p(t),p(t-1)|a), so as to include the time dependence. Nevertheless, we keep it simple and continue using the 1st level CM.  The desired output of the CDBIMM is a better prediction of the dynamic gestures. Diving into the probability maths, we emerge with the following equation, that describes the prediction, given the scores of both classifiers:

$\displaystyle \tilde{S_d} = S_d \times [C^T (S_{p}^-\times(CS_d))]$

where $C$ is the Coherence Matrix, $S_{p}$ is the static scores vector and $S_{d}$ is the dynamic scores vector. $x^-$ is the scalar inverse of vector x  and $\times$ is the Hadamard (scalar) product.

One can observe that there is a problem with the dynamic scores above. They are not probabilities, as they are produced by SVMs . However, if we use Platt scaling, this problem is alleviated and SVMs ‘probabilities’ can be used in the equation above.

Single Output (Combined system)

The single output system is pretty much the same as the dual one, apart from the fact that the scores coming from the two classifiers (CDBIMM and $Cl_{stat}$) are concatenated and passed on a final classification stage,consisting of another RDF classifier, that produces a unified prediction.

From Scores to Predictions

A fact that was not previously mentioned is the mechanisms with which the produced scores are transformed to a gesture prediction. Fanello et al. in their paper propose a technique that bases the choice of the prediction on the standard deviation between scores. When the standard deviation falls in a local minimum (in a temporal window of a few frames), one can say that the participating classifiers cannot provide us with a certain prediction. This is where we assume that an action has ended and another action begins. By looking at the argmax of the gesture appearance frequency (by viewing the max score as a vote) in every interval created by those “breakpoints”, we can infer the gesture being observed. This is a rather robust way to convert the scores to a prediction. Unfortunately, it is highly vulnerable to any appearing noise in the scores. Nevertheless, this method in conjuction with another one, that keeps a prediction stable until a score of another class is observed to be higher than a specific threshold, were used to provide a decision for the dynamic and the static gestures respectively.