After collecting the required elements to perform the wanted gestures classification, we are ready to perform the training operation.
If the system is In Sync, as proposed in the previous post, we firstly have to train the and the , then train the CDBIMM. If the system is Combined, an additional training stage is needed to train the CRDF. The input to the contains temporal information, so a buffer (running window) of concatenated features of 10 consecutive features is passed to it. Due to the fact the the preprocessing stage might not have provided us with a hand mask due to various reasons, a careful buffer construction is made, with the indexes of each frame being taken under consideration. On the other hand, receives as input only the features coming from a single frame.
After having constructed the features arrays from the training data, we perform the necessary steps to train each classifier. The validation sets were used to evaluate the performance of different configurations, so that to select the best one, which is the one we propose in the previous posts.
The metrics we used were of two types, F-Scores for each Gesture and Mean Accuracy, and were performed in two spaces, the Macro Space and the Micro one. As Micro we define the space where the frames are considered the main entities, whereas as Macro we consider the one where the gestures occurrences are examined. The Macro metrics are rather harsh ones, due to the fact that when a frames interval matching a gesture in the dataset does not include any prevalent category, that is when the most frequent category has less than 50% of appearance, then the whole interval is assumed a negative sample for all the available classes. The comparison of the classification results with the ground truth is done as it would be done in real-time. This means that the shifting caused by buffering of the frames concerning the dynamic gestures is not considered, causing a general fall in measured accuracy, but providing at the same time a more representative result of the reality.
By examining the validation sets, we got a 0.69 mean micro and 0.59 mean macro accuracy for the classification of dynamic gestures from the , whereas we got 0.87 mean micro and 0.88 mean macro accuracy for the classification of passive gestures from the . We did not perform any validation testing for the next layers of the cascaded system, as this would increase prohibitively the required experiments and would not considerably affect the produced results.