Today I am going to offer you an overview of the project I created for my Master’s Thesis. Oscillating between a generic title, which could be mentioned in short as ‘Identifying Dynamic Gestures with Depth Sensor in Realtime’, and a more specific one, being the one I chose for this blog post, I ended up constructing both a generic system and a specialized application.
The initial speculations of this problem confrontation complexity were quite far from the reality. The main reason for that was the following assumptions:
- Kinect has an included Skeleton Recognition algorithm, which is quite fast and effective and could be used to extract the user’s moving hand from the video stream.
- There must be many publications, which are approaching in a similar way the problem and provide a practical implementation.
Well, both proved to be wrong, in one way or another, forcing me to dive into the deep waters of improvisation and general research, particularly as far as the preprocessing step is considered.
Identifying The Problems
In order to successfully code a program, that will allow a user to create a drawing on any surface, just by moving his hands on top of it, while the Kinect sensor is actively recording his actions, we had to initially come up with a simple flow diagram. The diagram would have to be comprised of the main subsystems (sub-algorithms), needed to perform such a task efficiently. The main subsystems were deduced to be:
- Early Preprocessing: The input depth stream is analyzed and a mostly noise-free output is produced.
- Moving Object Recognition: Identifies the foreground, which is the moving object. (Assumption 1: Drawing means moving an arm in a static background)
- Skeletonization: The moving object is skeletonized, in order to find and extract the final existing link. (Assumption 2: The biggest moving object is the arm and the hand (the final link of the arm) along with the final endpoint is used for drawing. Assumption 3: The moving object is entering the sensor field of view .)
- Features Extraction: Highly invariant depth features are extracted from the final link mask.
- Gesture Recognition: The hand gesture is recognized.
- Drawing Action: The recognized gesture is mapped to a drawing action, which is performed on a virtual canvas using the moving object endpoint.
- Post Drawing Actions: The stroke drawn by the endpoint of the moving object is temporally tracked, so that movement patterns are recognized (eg. circular movement) and additional drawing actions are performed.
As one can observe, I set some constraints while building the basic diagram for the shake of lower computational cost, worsening however in a small degree the user’s experience. From the assumption 2, it is implied that other user’s body parts, apart from his arm, are not allowed to exist in the sensor’s field of view. Due to this implication, a suggested positioning of the sensor with respect to the user is to be set above him, eg. hanging from the ceiling or placed on a tripod and looking below.
In the following sections I am going to write a bit more about each subsystem.
A sample collection of the frames I am using for my experiments can be observed below:
The input I receive is a rather noisy depth image, with highly altering depth levels and surfaces, where Kinect seems to incorrectly recognize depth (they absorb IR radiation). The appearing noise can significantly damage the recognition steps, so I perform an initial preprocessing strategy. The strategy involves a calibration stage, performed in the beginning of the algorithm’s execution. (Assumption 4: Calibration requires an initial time interval, where just the background can be seen by the sensor, thus the user’s hand should not exist in the image for a short time (around 1 sec) in the beginning of the algorithm execution). During this time, a background image is built from the valid frame pixels, namely those that contain information about the observed depth (they are not zero). In addition to this background model, noisy pixels, which are characterized by highly altering intensity throughout the calibration stage, are marked as untrusted, with their intensities throughout the calibration being saved as reference. In the following frames, any non valid pixels receive the corresponding intensity from the background image. Pixels that were found untrusted during the calibration stage, whose intensity belongs to the saved intensities set, are also considered invalid and their intensity is replaced by the background model image.
Moving Object Recognition
In order to extract the hand from the background, we applied to the preprocessed frames an existing object recognition algorithm, combined with a shadow recognition algorithm, all packed up together in the existing OpenCV library. I am referring to the MOG2 algorithm, initially conceived and implemented by Zoran Zivkovic, combined with the quite useful shadow detection algorithm by Prati et. al. This pair was chosen for its high effectiveness to extract the moving object and its low computational costs. The reason why I decided to apply the shadow detection algorithm was the fact that MOG2 algorithm alone handles as artifacts most of the changes in intensity and removes them from the final result, as it was implemented at first to be applied on non depth video streams. The drawback of this approximation, which is provoked by the nature of the Kinect depth sensor itself, is that the mirroring of the moving object on IR reflective surfaces can be falsely considered as a moving object. (An idea to confront this problem and to partially negate the effects of artifacts was to segment the background image and compute for each segment its center of intensities mass displacement, allowing algorithmically only movement coming from the image edges. However, this was abandoned, due to high computational costs. I may post a related post in the future.)
With this slight modification and by performing thereafter a morphological opening, we receive a quite clear image of the moving objects. Finally, after performing contour detection and keeping the contour with the biggest area, we assume that we have found the moving arm.