Introduction
As part of my academic research endeavours, I'm undertaking to train myself to analyse research papers with a more methodical and critical eye.
The particular paper reviewed in this article is "A Real-Time Hand Posture Recognition System Using Deep Neural Networks" by Tang et al.and is part of a larger survey entitled "A survey of deep neural network architectures and their applications" by Liu, W. et al.
The approach I've used to structure my review process is outlined in Research Review Process
Table of Contents
Research question
How can hand positions be detected and classified in real time more effectively and efficiently, given the current limitations of current methods?
Research aim
Describes a HPR (Hand posture recognition) system that accurately and efficiently detects, segments, and classifies hand positions for sign language recognition in real time using a deep neural network.
Type of research
Empirical and quantitative research.
This research defines an HPR system for extracting high-quality hand-only images from arbitrary hand-signing scenes and classifying them into sign language representations.
Two well-defined, repeatable algorithms are developed to detect, segment and track hands from image scenes.
Quantitative testing of the quality of the hand-only images (produced by the HPR system) is done using a DNN.
The DNN’s results are used to indicate how good the quality of the hand-only images is when used in a neural network as opposed to using manually selected features, i.e to determine if the DNN learns better as a result of using the generated hand-only data.
Mode of enquiry
This is scientific research.
The use of algorithms to define the hand segmentation and object recognition phase is repeatable and therefore automatable, which makes this approach systematic.
Multiple experiments are used to test the quality of the produced hand-only data, for example, a DBN, CNN, and SVM. Also, tests using manual features, e.g HOG, in addition to learning automatic features (DNN) of the images are carried out. Comparison of test results (and frequent use of tables during analysis) is used to determine the best parameters and best classification results.
Methodology
The research defines a repeatable design that uses algorithms to compose an HPR system that produces hand-only image data, which is later tested using a DNN to see how well it learns using this type of data.
For example, the HPR system contains a well-defined approach that is modelled mathematically, and algorithms are designed and implemented to detect and segment hands from scenes where signers are performing various hand postures. The process uses a robust approach that incorporates skin colour and depth information to extract only the hands from images. This approach also incorporates computer vision techniques (eg, region growing, ellipse hypothesis, Otsu’s method, etc) that overcome problems such as occlusion and noise from other skin areas eg, nose, head, etc and illumination.
The resulting hand-only image data is experimentally tested by passing it through a DNN to see how well it can be used by DNNs to classify/learn sign language labels. Comparative tests with SVM are also used.
The deliverables of the research are:
- Propose a two-stage HPR system for sign language recognition
- Propose an effective algorithm to implement hand detection and tracking
- Apply deep neural networks to automatically learn features from hand posture
- Test through experimentation that the proposed system works quickly and accurately.
Research Methods
- Experiments using different types of classifiers (DBN, CNN, SVM)
- Mathematical modeling
- Neural network design
- Algorithm design
- Computer vision
The research uses primarily experimentation for quantitative/empirical measurements.
This system develops algorithms to identify and segment hands from signing scenes. Algorithms are developed that utilise computer vision techniques to segment and identify hand-only parts of images.
The paper primarily uses deep neural networks (eg, 3 hidden layer DBN) to test the effectiveness and efficiency of the hand position data that is generated from their two-stage HPR system (hand segmentation and hand tracking).
A DBN is primarily used; however, experiments on a CNN (LeNet-5) are also conducted for comparison. While non-linear approaches are tested (DBN and CNN), a linear classifier (SVM) is also used.
Research techniques
- Experiment using classifiers (linear and non-linear) such as DBN, CNN, SVM
- Contrastive divergence algorithm and pre-training algorithm (greedy-layer-by-layer)
- Back propagation (fine-tuning)
- Hand detection algorithm - Probabilistic model using a Gaussian distribution, using the colour of the face and a colour thresholding algorithm described in the paper
- Region growing technique
- The hand tracking algorithm described in the paper uses Kalman tracking
- Occlusion algorithm using Object Hypothesis and Otsu’s method
- HOG feature extraction
The research primarily uses a neural network to test the quality of the hand-only images that are generated by the algorithms within the HPR system. This data is used as training data to determine how well it learns from this data. Additionally, to aid robustness, the research also randomly rotates the generated hand-only image data.
Experiments using both linear and non-linear models were used, such as SVM and DBN/CNN, respectively. Manually extracted features, such as HOG were also experimented with in addition to automatically learning the features.
Computer vision techniques such as thresholding, region growing etc., are used to underlie the basis of the two-stage HPR process.
Data
- Video of signers performing signing postures
- Samples of 32x32 hand-only grey-scale training and testing images (338,000 and 169,000 training and testing, respectively) created bythe HPR system using the video.
- 36 output classes of signing labels
Raw video of hand-signing scenes is sampled by the HPR system, which identifies and segments the corresponding images into processed hand-only 32x32 grey-scale images. This is used as training/testing data for the neural network for later classification into sign language symbols.
For comparison against manually generated features (DNNs), some manual features, such as HOG were also generated/used with SVM.
During testing, they report that experimental results for DBN, CNN, and HOG+SVN models yielded accuracy of 98.12%, 94.17 %, 87.58 % respectively, indicating that the produced hand-only data improves the training/learning of the models, with DBN learning and inference performance being the best.
Information
The quality of the image-only data produced by the HPR system (hand identification, segmentation and tracking algorithms) yields high-quality image-only data. This is indicated by the improved inference performance of the DNN when using this data.
Knowledge
The HPR system described in this research, combined with the use of DNNs, recognises sign language in real-time with better results than current state-of-the-art approaches.
The combination of both colour and depth information is effective for the hand segmentation task (a crucial aspect of producing quality hand-only images).
It is possible to effectively solve the occlusion problem inherent in many other techniques in HPR using the object hypothesis eclipse method and Otsu’s method
Correlation vs Causation
Variables:
- Data generation process (hand segmentation and tracking)
- Data
- Models (DBN, CNN, SVM)
- Feature generation (automatically learnt vs manual/HOG)
Empirical tests on DBN (3 hidden layers and described in Hinton, 2006) and CNN (LeNet-5) and HOG+SVN were conducted, all using the same generated HPR hand-only image data, across all the kinds of signing postures that exist, to see if some positions are more biased in learning success than others.
Across all experiments and using the same data, it's shown that DBN results in the lowest error discrepancy.
While the results are positive that the generated hand-only images cause good accuracy in all models, no 3rd-party hand-signing data is used to compare how model classification performance compares when using non-HPR generated data. That said, the generated data is likely to be better due to the empirical results showing that the hand segmentation and tracking algorithms produce excellent results.
As the classification accuracy of neural networks is mainly due to learning from the data (and it's the data that has been produced), the good results in the classification scores in different models likely suggest that the data is improved, causing learning to be also good.
Literature review
Referenced papers
See Figure 3: Distribution of years of research referenced papers
This paper was published in 2015.
Citations
- IEEE reports 26 citations https://ieeexplore.ieee.org/document/8903136
- ACM reports 109 citations https://dl.acm.org/doi/10.1145/2735952
Reasoning process (deduction/induction)
Figure 4: Paper’s deductive process
Subjectivity/Objectivity
Construct Validity
- No obvious flaws
Internal Validity
Research Correctness
Objectivity
- The issues in hand posture recognition are well discussed and provide a convincing representation of the problems from the field/existing research (good literature review).
- A repeatable process derived from a description of an algorithmic approach is described using various computer vision techniques
- The process described is practical, i.e from signing scenes, hands are segmented and classification of the hands into labals occur.
Subjectivity/Specificity
No obvious flaws
Research technique
Objectivity:
The techniques are systematic, mostly algorithmic, drawing on applications and novel combinations of well-known computer vision approaches. This makes the approach automatable and easier to validate by third parties to verify accuracy and if conclusions are reproducible and objective.
The use of computer vision techniques, which are already well-defined and are routinely used in the area of object recognition, makes their use in identifying and segmenting hand postures appropriate and objective.
Subjectivity/Specificity:
- The approach uses specific computer vision techniques, which may now be outdated.
- The approach has not been shown to be able be used with subjects that do not wear long sleeves. Its limited to detecting hands from a specific type of clothing style.
Research techniques vs research question
Objectivity:
The formation of a systematic approach using Computer vision algorithms to recognise hand postures, which is inherently concerned with segmentation and object recognition tasks, and the use of neural networks, which are inherently concerned with classification, is appropriate as a means to classify postures using computer vision techniques.
Also, the comparison of manual feature descriptions (eg, HOG) for hand-posture data against those that learn the best features is useful to determine which kind of model results in better classification using this type of hand-only image data. As neural networks can be used to both learn features from the data and be used for classification, their use in this research is appropriate.
Subjectivity/Specificity:
- These techniques only answer the question at the time of publication. Newer models and designs for CNNS, DBNs have been developed since this research was published.
- The results are biased to the specific techniques used in the research
Conclusion vs methods
Objectivity:
The comparison of classification results from different models shows how the improved quality of input images (produced by the HPR system) improves their classification results.
Equally, the use of different types of classification models helps to show that DBN neural networks improve classification compared to other non-neural network type approaches, such as SVN or CNN. The conclusion, therefore, that DBN using hand-only data is the best method supports the pursuit of the best method for HPR classification.
Subjectivity/Specificity:
Technically, while the solution works well (and it used with varying DNNs), there is no empirical indication of how good it is compared to other HPR approaches (DNNs or otherwise).
External Validity
Objectivity
- The approach is systematic, repeatable and uses an combination the use of known algorithms and learning systems (NNs).
- If the conditions that the research were limited to are met (such as signers wearing long sleeves and subject to single singers in the scene), it's likely that other/external signer images outside of the data used in the research will fare similarly to what is reported in the research.
Subjectivity
- Only a specific types of clothing can be worn by signer, which limits the applicability of the approach to only those using this clothing configuration.
- While the approach might be resistant to rotation of frontal signer images, it might not be well accounted for if the signer's video is captured when subjects are in a crowd or if the video records signers that are not recording straight, head-on in front of the camera. For example, if the real-time recognition needs to occur while signers are moving while gesturing conversationally, like being in a park while recording natural signers outside of controlled conditions, it's unlikely that the recognition system will work correctly because the sytem is trained using front, standing signers in perfect conditions. These would need some accounting for this if this approach is to be generalisable to real-world conditions.
Data Validity
Data subjectivity (specificity/narrowness)
Objectivity:
The HPR system generates the hand-only images from video footage of hand signers performing hand poses. This data is used for classification via the neural network.
The data is appropriate because a key aspect of this research is creating higher-quality hand-only data over and above using the results as input to the neural network for hand pose classification.
Subjectivity/Specificity:
The signing video footage is not provided with the research; however, any signing footage can be used to re-create the types of quality hand-only images the HPR system is designed to create. This means that a similar quality of images can be recreated and used for testing neural network performance. The data is therefore inherently biased, but as the procedure is systematic, automatable, it is considered to be objective and appropriate for this research.
- Limited field-of-view perspective of the captured video footage.
- The research does not mention how different ethnicities which have variable skin tone might be affected by this. For example, a person wearing a dark top and who has a dark complexion who is signing, might not have their gestures effectively distinguished. There are no experiments to suggest the approach is resilient to this.
- There is no evidence that the face detection and skin colour detection will work if the signer is from a cultural background that limits exposure to faces (like hijab). This might result in the approach being discriminately poor based on cultural clothing conventions.
Data vs Research Question
Objectivity:
As the research question aims to find the best way to identify and classify hand posture, the use of creating newer, higher-quality hand signing (posture actions) images from signing video footage for classification is appropriate and supports the research question.
Subjectivity/Specificity:
- The data used in the research in generated by the research team and used to measure their classification and tracking algorithms. This means the specific data used in the research is the only verifiable data that causes the results they report.
- Real-time is subjective and is limited to what the research considers real-time. For example, on constrained devices, real-time might represent a camera taking an image every second, not 60Fps. So this research is limited to that measurement of real-time.
Summary of general risks to validity
- The paper is old (8 years), and the approaches rely heavily on the use of a specific device that was available at that time (Kinect)
- For example, the skeletal tracking and face detection aspects of the approach rely on Kinect providing this ability.
- Being this old, there are advances in computer vision techniques that might render these older techniques inefficient.
- The approach is only usable in Windows, making the approach platform-dependent.
- Kinect sensors are also old and have been discontinued by Microsoft
- It is said that the poor performance of conventional cameras was a problem at the time of research, but newer, better hardware/devices and more performant devices are likely possible now
- There is an assumption during hand tracking that the hand does not deform during the last frame
- The approach relies on the fact that signers wear long-sleeved garments, limiting the generalisability
- Only a single type of configuration of the CNN (LENET-5) and DBN(Hinton's 2006 example ) was used. It may have been good to try others.
- It might no longer be necessary to use the segmentation algorithm nowadays to learn hand features. Approaches that use CNNs could be used on the entire recorded frame and learn the hand positions.
Credibility concerns
Objectivity:
No significant gaps in the distribution of papers referenced (see figure 3)
Subjectivity/Specificity:
- The research is 10 years old.
- Specific attention to hand segmentation may no longer be required. For example, if AE or nwer DNNS can be used to derive the most relevant features of a picture without needing to isolate them by manually segmenting/choosing the hand as the relevant feature.
Relevance, Contribution, Originality and Novelty
Implications & Contributions
Hand identification, segmentation, tracking and classification are accurate and effective, and it works well in real-time.
Overcomes many shortfalls in current methods, such as being able to deal with obstructions, non-skin noise (eyes, nose, etc), light, colour variations, subtle variations in finger positions, etc.
Removes the need to manually specify features about hands and their positions.
Opinion
This research takes an objectivist approach, which aims to uncover hidden truths (the effectiveness of segmentation and tracking algorithms and their improvement for DNNs), favouring empirical testing, a systematic approach and a thorough investigation into cause and effect.