- Details
- Category: Blog
- By Stuart Mathews
- Hits: 944
Introduction
As part of my academic research endeavours, I'm undertaking to train myself to analyse research papers with a more methodical and critical eye.
The particular paper reviewed in this article is "A survey of deep neural network architectures and their applications" by Liu, W. et al.
The approach I've used to structure my review process is outlined in Research Review Process.
Table of Contents
- Research question
- Research aim
- Type of research
- Mode of enquiry
- Methodology
- Research Methods
- Research techniques & Data analysis
- Data
- Information
- Knowledge
- Correlation vs Causation
- Literature review
- Reasoning method (deduction/induction)
- Subjectivity/Objectivity & Threats to validity
- Relevance, Contribution, Originality and Novelty
Research question
What are the prevalent deep learning architectures, and how are they currently being used?
Research aim
This paper draws from research papers to determine what the aggregate applications and types of deep neural networks are. In this respect, it might be considered interpretive research; however, it also contains a large descriptive element because it describes the 4 main types of DNNs are and how they work, in addition to a description of the main applications of DNNs.
There is also an element of predictive research as the authors try to predict the future avenues of research for DNNs. However, on balance, this is likely to be classified as descriptive research or generally a descriptive survey.
Type of research
Qualitative.
The authors read papers and interpret the overall applications for various DNN architectures. There does not however appear to be a explicitly systematic approach. Indeed there is no explicit methodology described in the paper to inform the method used.
Mode of enquiry
Non-Scientific.
There are no quantitative measurements taken, nor experimental research undertaken. Their is no particular strategy/methodology they seem to employ to pick the papers and its possible therefore that they could leave out some important papers or discount the importance of others. The conclusions are qualitative and interpretive which makes it more subjective in nature. There is more risk of subjective opinions in conclusions.
Methodology
Non-empirical, theoretical research based on the gaining a contextual understanding of existing research papers in order to determine what the prevalent architectures and applications are for DNNs.
The main research deliverables are:
- Determine the different types of widely-used architectures for deep learning.
- Progress that each architecture has made.
- Determine the applications of deep learning.
Research Methods
- Survey
- Qualitative narrative review
The research surveys a variety of research papers and describes and identifies the common themes such as the main types and applications of DNNs.
Research techniques & Data analysis
- Thematic review
- Use of secondary material for review and analysis of research papers.
The authors use the data from multiple research papers to establish conclusions about what is required in the future based on what they have found, such as a determination of the main types of architectures and applications found in the papers, and to analyse them to inform their conclusions/predictions. They also use the papers to help explain how the architectures are set up and work, and provide a helpful overview of approaches.
Data
The research papers up until 2017. They appear to be from a wide variety of sources, concentrating on DNNs. See the literature review step for more context on this data.
Information
- RBMs, DBNs, CNNs and AE are the main DNN architectures.
- Speech recognition, pattern recognition and computer vision are the main applications for these DNN architectures.
- DNNs are key applications for processing unlabeled big Data (unsupervised)
Future avenues of research include DNN optimisation, the combination of reinforcement learning and that DNNs use in/complex systems should be more thoroughly researched.
Resource consumption is a concern, especially in low-resource environments such as Mobile.
Stability and consistency of DNN are of concern for the future.
Knowledge
Current state of main DNN architectures, approaches and applications as of 2017
Correlation vs Causation
The aggregation of themes that appear in the surveyed papers suggests their commonality. It's possible however, that a bias in the selection of papers could affect the drawn conclusions, such as which specific architectures are used most often, particularly if a particular architecture was not covered because papers selected did not cover it, or it was biased to a particular field. This does, however, seem unlikely.
Literature review
Referenced papers
Figure 7: Distribution of years of research referenced papers
This paper was published in 2017. See Figure 7 for the chronology of its referenced papers.
Note: only papers from 2000 onwards are included.
Citations
- 2744 citations from Science Direct: https://www.sciencedirect.com/science/article/abs/pii/S0925231216315533
- No representation in ACM or IEEE.
Reasoning method (deduction/induction)
Induction. The research papers are the data, and the outcomes are interpretations/conclusions about that data, such as common themes, observations and future avenues. Also, the papers allowed the authors to group topics and extract meaning from the data, see Figure 8.
Figure 8: Paper’s inductive process
Subjectivity/Objectivity & Threats to validity
Construct Validity
- No obvious flaws
Internal Validity
Research Correctness
Objectivity
- The paper does cite others' research to convince us that it draws not only from the author's potentially subjective interpretations but that they are objectively based on other research.
Subjectivity/Specificity
- The papers reviewed aren't systematically selected by any criteria and so the conclusions are limited to the particular papers that the survey/authors decided to use.
- The authors don't explicitly mention a methodology they take for constructing the paper, suggesting that a poor/no strategy was taken to select papers.
Research technique
Objectivity:
- The quality of the papers is likely to be good/objective, but should be supported by the use of other corroborating sources to be more objective about the main DNN architectures and applications.
Subjectivity/Specificity
- The techniques are appropriate and involve reviewing papers, referring to them in explanations and using them to aggregate and extract common themes from the reviewed papers. The techniques are, however are qualitative and subjective by their nature.
Research techniques vs research question
Objectivity:
- Surveying research papers to determine common themes and patterns supports the aim to determine the main DNN architectures and their applications.
Subjectivity/Specificity:
- In order to determine what the prevalent deep learning architectures are and how they are currently being used, the authors have selected a subset of areas/disciplines. This means the prevalent architectures in areas they do not mention are discounted or missing.
Conclusion vs methods
Objectivity:
- By aggregating the common architectures in the papers they authors have reviewed, they have objectively determined that from them RBMs, DBNs, CNNs and AE are the main DNN architectures. These are limited to the papers they have reviewed.
- Similarly, the papers they chose has lead them to suggest that Speech recognition, pattern recognition and computer vision are the main applications for these DNN architectures.
Subjectivity/Specificity:
- The conclusions are supported by the papers they reviewed; however, the paper selection criteria are not known.
- Predictions are subjective.
- It's not certain that the authors have reviewed enough papers to suggest all the main architectures and applications. Indeed, this would be difficult to do, but its likely to the conclusions are based on papers reviewed and authors' opinions.
External Validity
Objectivity:
- It's likely that of the papers that were reviewed, the proposed main DNN architectures are aggregated/compiled correctly by reviewing the similarity and variances of the architectures they reviewed. This at least means these are the main architectures based on their aggregation results and research.
Subjectivity:
- Only the main architectures and applications that are representative of the papers reviewed.
Data Validity
Data subjectivity (specificity/narrowness)
Objectivity:
- The is a large variety of sources (papers) to derive their conclusions from.
Subjectivity/Specificity:
- The research data are research papers, but the methodology of the paper selection criteria is missing, making the selection possibly biased. The data is subject to the selected papers.
Data vs Research Question
Objectivity:
- The data (research papers) is appropriate and supports the aim to find the main DNN architectures and applications based on the survey of the papers selected in this research.
Subjectivity/Specificity:
- The main architectures and applications are limited to the papers reviewed. The predictions are subjective.
Summary of general risks to validity
Credibility concerns
Objectivity:
- A large concentration of recent research over the last 10 years, before the paper being published, suggests methods are indeed state of the art, and therefore, this is likely a good survey. Need to check if there are any 2017 papers, as none appear in the survey, i.e the year the survey was done.
2744 citations from ScienceDirect suggest that this paper is popular and is likely well-received by researchers.
Subjectivity/Specificity:
- The research is 8 years old - applications and architectures are likely to have improved, and some may have become obsolete and depreciated in favour of newer architectures and the applicability to more domains.
- No representation in ACM or IEEE
Relevance, Contribution, Originality and Novelty
Implications & Contributions
- Research provides a helpful summary and good balance of technical detail required to understand the main deep neural network architectures.
Very good for establishing a base understanding of DNNS.
Helps to indicate the gaps in current applications of DNNs.
Opinion
The research is subject to the papers reviewed; however, the scope of reviewed papers is large, encompassing mostly recent papers, suggesting that it is helpful in suggesting the state of the art near or around the time of publishing.
- Details
- Category: Blog
- By Stuart Mathews
- Hits: 873
Introduction
As part of my academic research endeavours, I'm undertaking to train myself to analyse research papers with a more methodical and critical eye.
The particular paper reviewed in this article is, "Multimodal Deep Autoencoder for Human Pose Recovery" by Hong et al., and is part of a larger survey entitled "A survey of deep neural network architectures and their applications" by Liu, W. et al.
The approach I've used to structure my review process is outlined in Research Review Process.
Table of Contents
Research question
How can data with multiple feature descriptors be combined and used as input to create richer and better data for DNNs to learn from and to more accurately classify 2D pose images into their 3D counterparts?
Research aim
The research describes a learning process called MDA (MultiModal deep auto encoder) that correlates 2D images to 3D images (image space to pose space) using mixed feature descriptors (modalities) and their inner representations/code using Autoencoders.
Type of research
This is quantitative research as the neural networks transform image data into numerical outputs, and the model aims to reduce the discrepancy in a quantitative error function.
Mode of enquiry
This is scientific research which describes a systematic and repeatable process for combining multiple 2D features into a lower representation using MVHLRR.
An algorithm is designed that incorporates Autoencoders to extract the inner representation of the 2D and 3D data, and extensive experimentation, such as using different types of data formats to compare and test how well the MVHLRR affects learning as opposed to other types of data.
Methodology
The paper presents a procedure (MDA) for learning that consists of a well-defined process, which is modelled mathematically for combining multiple features into a unified format (LRR) and then using that format as input to a neural network for classification.
This approach describes using two Autoencoders to create and map the 2D hidden representations to their respective 3D hidden representations. This approach combines multiple 2D features (multimodal) into 1 uniform feature descriptor (LLR) and then uses a deep neural network to learn the association between that LLR data and the hidden representation of 3D image pose data, effectively, mapping LLR data to 3D data, treating the 3D data as labels.
The research designs an approach and then evaluates its performance, making it clearly a piece of design science research.
Experiments are conducted to test the approach, making this primarily experimental in nature; for example, multiple experiments are conducted using MDA and other approaches such as RobustPose, DCCN, etc.
As part of this research, an algorithm is presented that indicates how the process and design can be automated/repeated.
The main research deliverables are:
- Propose a non-linear mapping using DNNs for pose recovery (2D poses to 3D pose classification)
- Develop an approach for creating a input representation that fuses multiple feature descriptors (multimodal)
- Demonstrate that using the multimodal input format/representation reduces classification error in DNNs
Research Methods
- Mathematical modeling
- Experiments using linear and non-linear classifiers
- System design (MDA) and algorithm design.
The research explains the mathematical process of how an autoencoder works, specifically how it generates the code (or inner representation) by learning to minimise the error between the input and the desired output and using backpropagation to update weights. That inner representation will then be used as data for the DNN. Authors also model the MVLRR (Multi-view Hypergraph LRR) using mathematical analysis.
The research uses a variety of experimental tests. For example, the research conducted 3 trials, performed diagnostic experiments using various types of features (HC, HOG, SC, MDA, NoAe) and models (MDA, DCNN, Linear regression, RobustPose, TGP, etc).
Research techniques
- Experiments using neural network (classification) and Autoencoders (feature extraction)
- Back propagation Neural Network
- MHLRR (MultiView Hypergraph Low rank representation)
- Denoising AutoEncoder for extracting inner representations
- HOG, Shape Context
- Mocap (Motion capture)
- Linear regression, Deep convolutional Neural network, RobustPose, Twin Gaussian Process, Latent Gaussian Mixture Regression testing
Feature combinations (AVG, CON, MSE, MODEX, MSDL, MDA - subject of this research)
The research uses a 3-step generic process of converting multiple ways of representing a 2D Image (features) into a single, combined multimodal representation called a Multi-view HyperGraph Low Rank Representation MHLRR. MHLRRR uses multiple weighted contributions from multiple features better than existing methods, which are less flexible, for example, using feature concatenation.
Autoencoders are used as a generic way to automatically extract features from their richer combined representation of features (MDA). The features are therefore represented by the configuration of the weights in the hidden layers. The research specifically uses denoising AEs, as adding noise helps to learn better features.
A neural network takes the output of the autoencoder and uses this as the input (2D features) and output (3D features) to learn a mapping between the two using a DNN. Back propagation is used to update the weights after the discrepancy in the error function is calculated.
Data
- Walking dataset - spiral walking motion
- Human3.6M dataset - 3.6 MIllion 3D poses and related images
- HumanEva-I datasets - 5 motion types performed by 4 subjects
They extract multiple kinds of features from each, and turn them into a unified LLR, and 3D poses are a collection of joint coordinates in 3d space.
Information
Results show that combining Autoencoders with a multimodal input allows DNN to learn better than if using only a single feature. They report 20-25% error reduction using the proposed approach compared to other approaches.
Knowledge
It is possible to combine multiple feature descriptors into a unified Low Rank Representation (LRR) using Laplacian HyperGraphs to good effect to improve the performance of neural networks in correlating/mapping data to labels.
Using a combined set of input features that are then correlated to the desired output (3D poses) using a deep neural network results in much better results than if a single feature were used.
It's possible to use inner representations (or code output from AE) as the basis of input into the network, and it learns better.
In the learning process, the expected output to be learnt can be non-labels in the traditional sense, but instead definitions of other higher representations, such as an Autoencoder’s code.
This process is generalizable enough such that it could likely be used to create action sequences for animation characters is video games.
Correlation vs Causation
Variables:
- Learning approach/methods (baseline, others: BP-NN, DCNN, etc.)
- Layers in DNN design
- Noise level (for Autoencoders)
- Datasets
- MultiModel methods for feature combination (eg, AVG, CON, MSE, MDA)
The use of varying approaches to combining multiple dimensions of data is evaluated, as are the models used for classification and feature extraction.
The model’s classification accuracy determines which combinations are the best, and shows that MDA is objectively better than the alternatives. This shows that MDA causes better classification results (least error discrepancy in the error function)
Also, for automatic feature extraction using Autoencoders, different noise levels are tested to see which produces the best features and, correspondingly, the best classification results when used as input. Multiple datasets reduce bias for a particular set of data.
Literature review
Referenced papers
Figure 5: Distribution of years of research referenced papers
The distribution of referenced papers does not have significant gaps to suggest a particular bias, see the figure
Citations
IEEE reports 504 citations in papers. https://ieeexplore.ieee.org/document/7293666/
ACM reports 218 citations. https://dl.acm.org/doi/10.1109/tip.2015.2487860
Generally, considering the above, on balance, this paper is moderately popular.
Reasoning method (Induction/Deduction)
Figure 6: Paper’s deductive process
Subjectivity/Objectivity & Threats to validity
Construct Validity
- No obvious flaws
Internal Validity
Research Correctness
Objectivity
- The data used is freely available and commonly used in the human pose recognition dicipline.
- A very systematic and definable process is used using repeatable algorithms. These also draw of a combination of existing methods from computer vision making the process less specific and subjective/specific to the authors.
Subjectivity/Specificity
- The approach is limited to combining feature descriptors from 2D images (pixel data).
Research technique
Objectivity:
- Using neural networks to map 2D to 3D images makes sense, as this is inherently what neural networks are designed to do. They also learn about the data provided to them; therefore, the research’s techniques of providing richer feature data using Autoencoders to aid this mapping (learning) make sense in the context of this research.
Subjectivity/Specificity:
- The specific feature descriptors used in the research limit the scope of LLR to only those descriptors. It not clear if other 2D image descriptors will work the same way or yield the same favourable results.
- This research is limited to the specific use of the AutoEncoder techniques available at the time. These might have improved since then.
Research techniques vs research question
Objectivity:
- The use of neural networks to perform the classification of 2D images to 3D images supports the research question of finding an approach to 2D to 3D classification.
- The use of Autoencoders supports the aim to improve the classification by providing better (automatic) features for the neural network classification process.
Subjectivity/Specificity:
- The fusion of descriptors is only known to work on the descriptors used in the research, eg, HOG. This limits the scope of proven applicability of combining them using MDA/LRR to those used in the research. For example, its not clear which types of descriptors are better than others - could merely using textual descriptions be used in the fusion?
- It's not clear that this fusion technique is still effective in 2025.
Conclusion vs methods
Objectivity:
- They use a variety of existing approaches to compare the performance of their approach on the same dataset. For example, they use CNNs, Linear regression, Latent Gaussian Mixture Regression (LGMR), etc. This helps support their specific conclusion as it is empirically evaluated to be better than the others.
- The methods yield good and simplifying and combinig diffirent representations using autoencoders is a good idea to concentrate the data with only a combination of the most significant features that auto encoders can discover.
Subjectivity/Specificity:
- Combining features appears to work well using 2D and 3D data but conclusions (usefulness of combining features using MDA/LRR) are only valid with respect to that type of data.
External Validity
Objectivity
- The approach is systematic and it can be repeatable through the use of the algorithms used to define the process.
- The process/approach can be applied to data other than the data used in the research.
Subjectivity
- Only the specific data in the research can be used to prove/support the results/conclusions.
- There is no
- While the method of mapping 2D features to 3D features using MDA (using autoencoders and combining/fusing features) seems effective, a model that uses it would still need to be trained on a variety of different-sized human subjects, adults and smaller children to avoid being biased in the training data it uses to make pose inferences.
Data Validity
Data subjectivity (specificity/narrowness)
Objectivity:
- Using 2D posture and 3D posture image data from a variety of standard/common repositories is suitable for the task of mapping one to the other.
- Some datasets are used in existing research (eg Walking dataset), using them is more objective. Different datasets are used to reduce bias in one particular dataset.
Therefore, this data is appropriate for use in this research.
Subjectivity/Specificity:
- The research is subject to the nature and characteristics of data that was used in the research. If a new format of 3D posture data is used, other than that format provided in the test data, its not certain, the approach will yield similar results. For example, 3D format is roughly a set of vectors in 3D space. This is not the only format that might be possible and might cause this research/appraoch not to be transferable/usable on that kind of 3D data.
Data vs Research Question
Objectivity:
The research question looks to classify 2D with its associative 3D representation; therefore, the 2D and 3D data support the determination of the research question.
Subjectivity/Specificity:
See external validity.
Summary of general risks to validity
Credibility concerns
Objectivity:
- No significant gaps in the literature review.
Subjectivity/Specificity:
- The research is 10 years old - limited to the era in which the research took place.
- The fusion techniques and methods could be outdated and have been subsequently improved or deprecated.
Relevance, Contribution, Originality and Novelty
Implications & Contributions
MDA outperforms other methods for 2D to 3D pose correlation.
The generation of a method to form a combination of multiple feature descriptors into a useful uniform descriptor format (MHLRR) appears to be widely applicable, meaning other features may be combined in other applications, too.
Legacy systems that use traditional features and methods can be reused/upgraded to upgrade these systems to be more accurate by combining multiple features.
Opinion
This research takes an objectivist approach, which aims to uncover hidden truths (effectiveness of MDA for improving DNNs,) favouring empirical testing, a systematic approach and a thorough investigation into cause and effect.
- Details
- Category: Blog
- By Stuart Mathews
- Hits: 1012
Introduction
As part of my academic research endeavours, I'm undertaking to train myself to analyse research papers with a more methodical and critical eye.
The particular paper reviewed in this article is "A Real-Time Hand Posture Recognition System Using Deep Neural Networks" by Tang et al.and is part of a larger survey entitled "A survey of deep neural network architectures and their applications" by Liu, W. et al.
The approach I've used to structure my review process is outlined in Research Review Process
Table of Contents
Research question
How can hand positions be detected and classified in real time more effectively and efficiently, given the current limitations of current methods?
Research aim
Describes a HPR (Hand posture recognition) system that accurately and efficiently detects, segments, and classifies hand positions for sign language recognition in real time using a deep neural network.
Type of research
Empirical and quantitative research.
This research defines an HPR system for extracting high-quality hand-only images from arbitrary hand-signing scenes and classifying them into sign language representations.
Two well-defined, repeatable algorithms are developed to detect, segment and track hands from image scenes.
Quantitative testing of the quality of the hand-only images (produced by the HPR system) is done using a DNN.
The DNN’s results are used to indicate how good the quality of the hand-only images is when used in a neural network as opposed to using manually selected features, i.e to determine if the DNN learns better as a result of using the generated hand-only data.
Mode of enquiry
This is scientific research.
The use of algorithms to define the hand segmentation and object recognition phase is repeatable and therefore automatable, which makes this approach systematic.
Multiple experiments are used to test the quality of the produced hand-only data, for example, a DBN, CNN, and SVM. Also, tests using manual features, e.g HOG, in addition to learning automatic features (DNN) of the images are carried out. Comparison of test results (and frequent use of tables during analysis) is used to determine the best parameters and best classification results.
Methodology
The research defines a repeatable design that uses algorithms to compose an HPR system that produces hand-only image data, which is later tested using a DNN to see how well it learns using this type of data.
For example, the HPR system contains a well-defined approach that is modelled mathematically, and algorithms are designed and implemented to detect and segment hands from scenes where signers are performing various hand postures. The process uses a robust approach that incorporates skin colour and depth information to extract only the hands from images. This approach also incorporates computer vision techniques (eg, region growing, ellipse hypothesis, Otsu’s method, etc) that overcome problems such as occlusion and noise from other skin areas eg, nose, head, etc and illumination.
The resulting hand-only image data is experimentally tested by passing it through a DNN to see how well it can be used by DNNs to classify/learn sign language labels. Comparative tests with SVM are also used.
The deliverables of the research are:
- Propose a two-stage HPR system for sign language recognition
- Propose an effective algorithm to implement hand detection and tracking
- Apply deep neural networks to automatically learn features from hand posture
- Test through experimentation that the proposed system works quickly and accurately.
Research Methods
- Experiments using different types of classifiers (DBN, CNN, SVM)
- Mathematical modeling
- Neural network design
- Algorithm design
- Computer vision
The research uses primarily experimentation for quantitative/empirical measurements.
This system develops algorithms to identify and segment hands from signing scenes. Algorithms are developed that utilise computer vision techniques to segment and identify hand-only parts of images.
The paper primarily uses deep neural networks (eg, 3 hidden layer DBN) to test the effectiveness and efficiency of the hand position data that is generated from their two-stage HPR system (hand segmentation and hand tracking).
A DBN is primarily used; however, experiments on a CNN (LeNet-5) are also conducted for comparison. While non-linear approaches are tested (DBN and CNN), a linear classifier (SVM) is also used.
Research techniques
- Experiment using classifiers (linear and non-linear) such as DBN, CNN, SVM
- Contrastive divergence algorithm and pre-training algorithm (greedy-layer-by-layer)
- Back propagation (fine-tuning)
- Hand detection algorithm - Probabilistic model using a Gaussian distribution, using the colour of the face and a colour thresholding algorithm described in the paper
- Region growing technique
- The hand tracking algorithm described in the paper uses Kalman tracking
- Occlusion algorithm using Object Hypothesis and Otsu’s method
- HOG feature extraction
The research primarily uses a neural network to test the quality of the hand-only images that are generated by the algorithms within the HPR system. This data is used as training data to determine how well it learns from this data. Additionally, to aid robustness, the research also randomly rotates the generated hand-only image data.
Experiments using both linear and non-linear models were used, such as SVM and DBN/CNN, respectively. Manually extracted features, such as HOG were also experimented with in addition to automatically learning the features.
Computer vision techniques such as thresholding, region growing etc., are used to underlie the basis of the two-stage HPR process.
Data
- Video of signers performing signing postures
- Samples of 32x32 hand-only grey-scale training and testing images (338,000 and 169,000 training and testing, respectively) created bythe HPR system using the video.
- 36 output classes of signing labels
Raw video of hand-signing scenes is sampled by the HPR system, which identifies and segments the corresponding images into processed hand-only 32x32 grey-scale images. This is used as training/testing data for the neural network for later classification into sign language symbols.
For comparison against manually generated features (DNNs), some manual features, such as HOG were also generated/used with SVM.
During testing, they report that experimental results for DBN, CNN, and HOG+SVN models yielded accuracy of 98.12%, 94.17 %, 87.58 % respectively, indicating that the produced hand-only data improves the training/learning of the models, with DBN learning and inference performance being the best.
Information
The quality of the image-only data produced by the HPR system (hand identification, segmentation and tracking algorithms) yields high-quality image-only data. This is indicated by the improved inference performance of the DNN when using this data.
Knowledge
The HPR system described in this research, combined with the use of DNNs, recognises sign language in real-time with better results than current state-of-the-art approaches.
The combination of both colour and depth information is effective for the hand segmentation task (a crucial aspect of producing quality hand-only images).
It is possible to effectively solve the occlusion problem inherent in many other techniques in HPR using the object hypothesis eclipse method and Otsu’s method
Correlation vs Causation
Variables:
- Data generation process (hand segmentation and tracking)
- Data
- Models (DBN, CNN, SVM)
- Feature generation (automatically learnt vs manual/HOG)
Empirical tests on DBN (3 hidden layers and described in Hinton, 2006) and CNN (LeNet-5) and HOG+SVN were conducted, all using the same generated HPR hand-only image data, across all the kinds of signing postures that exist, to see if some positions are more biased in learning success than others.
Across all experiments and using the same data, it's shown that DBN results in the lowest error discrepancy.
While the results are positive that the generated hand-only images cause good accuracy in all models, no 3rd-party hand-signing data is used to compare how model classification performance compares when using non-HPR generated data. That said, the generated data is likely to be better due to the empirical results showing that the hand segmentation and tracking algorithms produce excellent results.
As the classification accuracy of neural networks is mainly due to learning from the data (and it's the data that has been produced), the good results in the classification scores in different models likely suggest that the data is improved, causing learning to be also good.
Literature review
Referenced papers
See Figure 3: Distribution of years of research referenced papers
This paper was published in 2015.
Citations
- IEEE reports 26 citations https://ieeexplore.ieee.org/document/8903136
- ACM reports 109 citations https://dl.acm.org/doi/10.1145/2735952
Reasoning process (deduction/induction)
Figure 4: Paper’s deductive process
Subjectivity/Objectivity
Construct Validity
- No obvious flaws
Internal Validity
Research Correctness
Objectivity
- The issues in hand posture recognition are well discussed and provide a convincing representation of the problems from the field/existing research (good literature review).
- A repeatable process derived from a description of an algorithmic approach is described using various computer vision techniques
- The process described is practical, i.e from signing scenes, hands are segmented and classification of the hands into labals occur.
Subjectivity/Specificity
No obvious flaws
Research technique
Objectivity:
The techniques are systematic, mostly algorithmic, drawing on applications and novel combinations of well-known computer vision approaches. This makes the approach automatable and easier to validate by third parties to verify accuracy and if conclusions are reproducible and objective.
The use of computer vision techniques, which are already well-defined and are routinely used in the area of object recognition, makes their use in identifying and segmenting hand postures appropriate and objective.
Subjectivity/Specificity:
- The approach uses specific computer vision techniques, which may now be outdated.
- The approach has not been shown to be able be used with subjects that do not wear long sleeves. Its limited to detecting hands from a specific type of clothing style.
Research techniques vs research question
Objectivity:
The formation of a systematic approach using Computer vision algorithms to recognise hand postures, which is inherently concerned with segmentation and object recognition tasks, and the use of neural networks, which are inherently concerned with classification, is appropriate as a means to classify postures using computer vision techniques.
Also, the comparison of manual feature descriptions (eg, HOG) for hand-posture data against those that learn the best features is useful to determine which kind of model results in better classification using this type of hand-only image data. As neural networks can be used to both learn features from the data and be used for classification, their use in this research is appropriate.
Subjectivity/Specificity:
- These techniques only answer the question at the time of publication. Newer models and designs for CNNS, DBNs have been developed since this research was published.
- The results are biased to the specific techniques used in the research
Conclusion vs methods
Objectivity:
The comparison of classification results from different models shows how the improved quality of input images (produced by the HPR system) improves their classification results.
Equally, the use of different types of classification models helps to show that DBN neural networks improve classification compared to other non-neural network type approaches, such as SVN or CNN. The conclusion, therefore, that DBN using hand-only data is the best method supports the pursuit of the best method for HPR classification.
Subjectivity/Specificity:
Technically, while the solution works well (and it used with varying DNNs), there is no empirical indication of how good it is compared to other HPR approaches (DNNs or otherwise).
External Validity
Objectivity
- The approach is systematic, repeatable and uses an combination the use of known algorithms and learning systems (NNs).
- If the conditions that the research were limited to are met (such as signers wearing long sleeves and subject to single singers in the scene), it's likely that other/external signer images outside of the data used in the research will fare similarly to what is reported in the research.
Subjectivity
- Only a specific types of clothing can be worn by signer, which limits the applicability of the approach to only those using this clothing configuration.
- While the approach might be resistant to rotation of frontal signer images, it might not be well accounted for if the signer's video is captured when subjects are in a crowd or if the video records signers that are not recording straight, head-on in front of the camera. For example, if the real-time recognition needs to occur while signers are moving while gesturing conversationally, like being in a park while recording natural signers outside of controlled conditions, it's unlikely that the recognition system will work correctly because the sytem is trained using front, standing signers in perfect conditions. These would need some accounting for this if this approach is to be generalisable to real-world conditions.
Data Validity
Data subjectivity (specificity/narrowness)
Objectivity:
The HPR system generates the hand-only images from video footage of hand signers performing hand poses. This data is used for classification via the neural network.
The data is appropriate because a key aspect of this research is creating higher-quality hand-only data over and above using the results as input to the neural network for hand pose classification.
Subjectivity/Specificity:
The signing video footage is not provided with the research; however, any signing footage can be used to re-create the types of quality hand-only images the HPR system is designed to create. This means that a similar quality of images can be recreated and used for testing neural network performance. The data is therefore inherently biased, but as the procedure is systematic, automatable, it is considered to be objective and appropriate for this research.
- Limited field-of-view perspective of the captured video footage.
- The research does not mention how different ethnicities which have variable skin tone might be affected by this. For example, a person wearing a dark top and who has a dark complexion who is signing, might not have their gestures effectively distinguished. There are no experiments to suggest the approach is resilient to this.
- There is no evidence that the face detection and skin colour detection will work if the signer is from a cultural background that limits exposure to faces (like hijab). This might result in the approach being discriminately poor based on cultural clothing conventions.
Data vs Research Question
Objectivity:
As the research question aims to find the best way to identify and classify hand posture, the use of creating newer, higher-quality hand signing (posture actions) images from signing video footage for classification is appropriate and supports the research question.
Subjectivity/Specificity:
- The data used in the research in generated by the research team and used to measure their classification and tracking algorithms. This means the specific data used in the research is the only verifiable data that causes the results they report.
- Real-time is subjective and is limited to what the research considers real-time. For example, on constrained devices, real-time might represent a camera taking an image every second, not 60Fps. So this research is limited to that measurement of real-time.
Summary of general risks to validity
- The paper is old (8 years), and the approaches rely heavily on the use of a specific device that was available at that time (Kinect)
- For example, the skeletal tracking and face detection aspects of the approach rely on Kinect providing this ability.
- Being this old, there are advances in computer vision techniques that might render these older techniques inefficient.
- The approach is only usable in Windows, making the approach platform-dependent.
- Kinect sensors are also old and have been discontinued by Microsoft
- It is said that the poor performance of conventional cameras was a problem at the time of research, but newer, better hardware/devices and more performant devices are likely possible now
- There is an assumption during hand tracking that the hand does not deform during the last frame
- The approach relies on the fact that signers wear long-sleeved garments, limiting the generalisability
- Only a single type of configuration of the CNN (LENET-5) and DBN(Hinton's 2006 example ) was used. It may have been good to try others.
- It might no longer be necessary to use the segmentation algorithm nowadays to learn hand features. Approaches that use CNNs could be used on the entire recorded frame and learn the hand positions.
Credibility concerns
Objectivity:
No significant gaps in the distribution of papers referenced (see figure 3)
Subjectivity/Specificity:
- The research is 10 years old.
- Specific attention to hand segmentation may no longer be required. For example, if AE or nwer DNNS can be used to derive the most relevant features of a picture without needing to isolate them by manually segmenting/choosing the hand as the relevant feature.
Relevance, Contribution, Originality and Novelty
Implications & Contributions
Hand identification, segmentation, tracking and classification are accurate and effective, and it works well in real-time.
Overcomes many shortfalls in current methods, such as being able to deal with obstructions, non-skin noise (eyes, nose, etc), light, colour variations, subtle variations in finger positions, etc.
Removes the need to manually specify features about hands and their positions.
Opinion
This research takes an objectivist approach, which aims to uncover hidden truths (the effectiveness of segmentation and tracking algorithms and their improvement for DNNs), favouring empirical testing, a systematic approach and a thorough investigation into cause and effect.
More Articles …
- A Systematic Research Review Process
- Situation detection using Bayesian networks
- Research Proposal
- How Bayseian Networks learn
- Reviewing A Fast Learning Algorithm for Deep Belief Nets
- Thoughts on Reinforcement learning
- Thoughts on Bayesian networks
- Mathematics: a wonderful realisation
- The pursuit of knowledge
- 3D OpenGL Project
Page 9 of 182