The current stance of approaching depth specific classification problems is by using features constructed for the RGB space, which are enhanced to exploit the depth information; SURF, HOG and other “passive” descriptors to recognize the textural nature and shape variation of the analyzed scene, HOF, DMMs, Dense Trajectories and other “dynamic” descriptors to provision the classification system with the intrinsic relative movement of the scene. Characterized by high efficiency when handling RGB-D streams, they lose their actual power if the RGB space is not utilized.
The ability to classify in the absence of the RGB space provides color and brightness invariance, a hugely helpful property for various applications. The depth is a singular property of a statically observed system, ie. a system that is being observed from a non moving sensor, as it cannot be altered without internal movement or external physical intervention. The drawback of the depth channel is the lack of the ability to characterize actions and objects that rely on visible light to be classifiable. Fortunately, this is not the case here, as we need to create a tool to draw on any surface with one’s hands, irregardless of the surface’s hue, saturation or brightness.
If the total combined system was not to perform in realtime, state of the art descriptors, such as ConvNet features (which require high computational time) and Local Binary Patterns of DMMs (which assume a fixed size temporal action response with few deviations that increase polynomially the extraction time), would accomplish great results. Furthermore, I am a little biased towards basing entirely my work on implementations that were officially solely used on specific existing datasets, as I had found out in the past , by experimenting with them, that those tend to overfit to the supplied data. Therefore, three descriptors were used to try to encompass the depth information in a highly computationally efficient fashion and are described in the following sections. Before extracting any of the following features, I used the information about the orientation of the hand relatively to the other links and the frame edges, extracted from the last link of the found skeleton, to negate its action, by rotating the hand mask in the opposite direction.
I came up with this descriptor after experimenting with the 3DHOF descriptor, described by Fanello in a 2013 paper. The basic concept was quite simple to begin with; instead of creating a 2D histogram of the observed movement gradients as it is done while using HOF descriptors, construct a 3D histogram, by relying on the fact that the intensity of each pixel describes the depth of a point in space and that it can be reprojected it in 3D space using the pinhole camera model. The movement prediction is performed using an Optical Flow algorithm, a choice which I found unfortunate. This is where I diverged from the initial idea. By analyzing the way with which Optical Flow was computed, I found it to be completely off-ground to the nature of the depth stream. The difference in intensity along time, which would be interpreted by the Optical Flow algorithm as a point dislocation on the XY plane, does not have the same meaning when handling a depth image, apart from the regions where there is depth discontinuity, that is the objects’ contour region. While there are many ways to bypass this, by applying for example beforehand a regional depth normalization, we would only prohibitively increase the computational time. A fast simplistic alternative solution was to entirely remove the XY plane movement prediction and take only into account the dislocation caused by reprojecting the point in 3D space. This approach proved to offer a slightly better accuracy outcome, while highly accelerating the whole system.
The 2013 paper of Fanello et al. proved to be a good source of information. A reason for this is that the authors actually moved on and created a realtime working system, showing its actual sustainability and robustness. Therefore, it seemed to me acceptable to borrow another descriptor of theirs, and named GHOG. The GHOG is a morphological descriptor, which is a simplified implementation of the HOG features. The basic difference is that, instead of using blocks in which local HOG features are computed, we perform the HOG computation in the whole Region of Interest. I enhanced the method by resizing the ROI to 30 x 30 * (imy/imx) before applying GHOG, maintaining the ratio of the X and Y dimension. In this manner, I avoided capturing most of the noise, particularly the one appearing in contour edges.
Having read in many papers that extracting features from slices in the temporal space, focusing on specific dimensions each time, can be beneficial in classification, I thought of a morphological descriptor of such nature. It produces features by resizing the hand depth mask to a 32×32 one and calculating the first eigen vector of the PCA performed on each space plane of the result, using the XY plane PCA to negate the mask orientation and the others (which are calculated afterwards) to end up with a concatenated vector of 64 elements. The downside of this descriptor is that it deforms the image during resizing, so it depends on the ratio of length and width of the input depth mask. Apart from the initial resize, there is little overall loss of valuable information, as any variance in other directions can be modeled by a non linear combination of the calculated variances with little error, therefore this descriptor can be characterized as quite dependable.
If we had let the zeroed background as is, the produced PCA would depict only the distance of the hand on XZ and YZ planes, which would render the method unusable. In order to avoid this fact, before calculating each PCA, we fill the zeroed pixels with the mean intensity along each plane, so as to “deactivate” their involvement.