Review of Multimodal Deep Autoencoder for Human Pose Recovery

Details: Category: Blog; By Stuart Mathews; 02.Aug; 16 June 2025; Last Updated: 30 August 2025; Hits: 815

Introduction

As part of my academic research endeavours, I'm undertaking to train myself to analyse research papers with a more methodical and critical eye.

The particular paper reviewed in this article is, "Multimodal Deep Autoencoder for Human Pose Recovery" by Hong et al., and is part of a larger survey entitled "A survey of deep neural network architectures and their applications" by Liu, W. et al.

The approach I've used to structure my review process is outlined in Research Review Process.

Research question

How can data with multiple feature descriptors be combined and used as input to create richer and better data for DNNs to learn from and to more accurately classify 2D pose images into their 3D counterparts?

Research aim

The research describes a learning process called MDA (MultiModal deep auto encoder) that correlates 2D images to 3D images (image space to pose space) using mixed feature descriptors (modalities) and their inner representations/code using Autoencoders.

Type of research

This is quantitative research as the neural networks transform image data into numerical outputs, and the model aims to reduce the discrepancy in a quantitative error function.

Mode of enquiry

This is scientific research which describes a systematic and repeatable process for combining multiple 2D features into a lower representation using MVHLRR.

An algorithm is designed that incorporates Autoencoders to extract the inner representation of the 2D and 3D data, and extensive experimentation, such as using different types of data formats to compare and test how well the MVHLRR affects learning as opposed to other types of data.

Methodology

The paper presents a procedure (MDA) for learning that consists of a well-defined process, which is modelled mathematically for combining multiple features into a unified format (LRR) and then using that format as input to a neural network for classification.

This approach describes using two Autoencoders to create and map the 2D hidden representations to their respective 3D hidden representations. This approach combines multiple 2D features (multimodal) into 1 uniform feature descriptor (LLR) and then uses a deep neural network to learn the association between that LLR data and the hidden representation of 3D image pose data, effectively, mapping LLR data to 3D data, treating the 3D data as labels.

The research designs an approach and then evaluates its performance, making it clearly a piece of design science research.

Experiments are conducted to test the approach, making this primarily experimental in nature; for example, multiple experiments are conducted using MDA and other approaches such as RobustPose, DCCN, etc.

As part of this research, an algorithm is presented that indicates how the process and design can be automated/repeated.

The main research deliverables are:

Propose a non-linear mapping using DNNs for pose recovery (2D poses to 3D pose classification)
Develop an approach for creating a input representation that fuses multiple feature descriptors (multimodal)
Demonstrate that using the multimodal input format/representation reduces classification error in DNNs

Research Methods

Mathematical modeling
Experiments using linear and non-linear classifiers
System design (MDA) and algorithm design.

The research explains the mathematical process of how an autoencoder works, specifically how it generates the code (or inner representation) by learning to minimise the error between the input and the desired output and using backpropagation to update weights. That inner representation will then be used as data for the DNN. Authors also model the MVLRR (Multi-view Hypergraph LRR) using mathematical analysis.

The research uses a variety of experimental tests. For example, the research conducted 3 trials, performed diagnostic experiments using various types of features (HC, HOG, SC, MDA, NoAe) and models (MDA, DCNN, Linear regression, RobustPose, TGP, etc).

Research techniques

Experiments using neural network (classification) and Autoencoders (feature extraction)
Back propagation Neural Network
MHLRR (MultiView Hypergraph Low rank representation)
Denoising AutoEncoder for extracting inner representations
HOG, Shape Context
Mocap (Motion capture)
Linear regression, Deep convolutional Neural network, RobustPose, Twin Gaussian Process, Latent Gaussian Mixture Regression testing
Feature combinations (AVG, CON, MSE, MODEX, MSDL, MDA - subject of this research)

The research uses a 3-step generic process of converting multiple ways of representing a 2D Image (features) into a single, combined multimodal representation called a Multi-view HyperGraph Low Rank Representation MHLRR. MHLRRR uses multiple weighted contributions from multiple features better than existing methods, which are less flexible, for example, using feature concatenation.

Autoencoders are used as a generic way to automatically extract features from their richer combined representation of features (MDA). The features are therefore represented by the configuration of the weights in the hidden layers. The research specifically uses denoising AEs, as adding noise helps to learn better features.

A neural network takes the output of the autoencoder and uses this as the input (2D features) and output (3D features) to learn a mapping between the two using a DNN. Back propagation is used to update the weights after the discrepancy in the error function is calculated.

Data

Walking dataset - spiral walking motion
Human3.6M dataset - 3.6 MIllion 3D poses and related images
HumanEva-I datasets - 5 motion types performed by 4 subjects

They extract multiple kinds of features from each, and turn them into a unified LLR, and 3D poses are a collection of joint coordinates in 3d space.

Information

Results show that combining Autoencoders with a multimodal input allows DNN to learn better than if using only a single feature. They report 20-25% error reduction using the proposed approach compared to other approaches.

Knowledge

It is possible to combine multiple feature descriptors into a unified Low Rank Representation (LRR) using Laplacian HyperGraphs to good effect to improve the performance of neural networks in correlating/mapping data to labels.

Using a combined set of input features that are then correlated to the desired output (3D poses) using a deep neural network results in much better results than if a single feature were used.

It's possible to use inner representations (or code output from AE) as the basis of input into the network, and it learns better.
In the learning process, the expected output to be learnt can be non-labels in the traditional sense, but instead definitions of other higher representations, such as an Autoencoder’s code.

This process is generalizable enough such that it could likely be used to create action sequences for animation characters is video games.

Correlation vs Causation

Variables:

Learning approach/methods (baseline, others: BP-NN, DCNN, etc.)
Layers in DNN design
Noise level (for Autoencoders)
Datasets
MultiModel methods for feature combination (eg, AVG, CON, MSE, MDA)

The use of varying approaches to combining multiple dimensions of data is evaluated, as are the models used for classification and feature extraction.

The model’s classification accuracy determines which combinations are the best, and shows that MDA is objectively better than the alternatives. This shows that MDA causes better classification results (least error discrepancy in the error function)

Also, for automatic feature extraction using Autoencoders, different noise levels are tested to see which produces the best features and, correspondingly, the best classification results when used as input. Multiple datasets reduce bias for a particular set of data.

Literature review

Referenced papers

Figure 5: Distribution of years of research referenced papers
The distribution of referenced papers does not have significant gaps to suggest a particular bias, see the figure

Citations

IEEE reports 504 citations in papers. https://ieeexplore.ieee.org/document/7293666/
ACM reports 218 citations. https://dl.acm.org/doi/10.1109/tip.2015.2487860
Generally, considering the above, on balance, this paper is moderately popular.

Reasoning method (Induction/Deduction)

Figure 6: Paper’s deductive process

Subjectivity/Objectivity & Threats to validity

Construct Validity

No obvious flaws

Internal Validity

Research Correctness

Objectivity

The data used is freely available and commonly used in the human pose recognition dicipline.
A very systematic and definable process is used using repeatable algorithms. These also draw of a combination of existing methods from computer vision making the process less specific and subjective/specific to the authors.

Subjectivity/Specificity

The approach is limited to combining feature descriptors from 2D images (pixel data).

Research technique

Objectivity:

Using neural networks to map 2D to 3D images makes sense, as this is inherently what neural networks are designed to do. They also learn about the data provided to them; therefore, the research’s techniques of providing richer feature data using Autoencoders to aid this mapping (learning) make sense in the context of this research.

Subjectivity/Specificity:

The specific feature descriptors used in the research limit the scope of LLR to only those descriptors. It not clear if other 2D image descriptors will work the same way or yield the same favourable results.
This research is limited to the specific use of the AutoEncoder techniques available at the time. These might have improved since then.

Research techniques vs research question

Objectivity:

The use of neural networks to perform the classification of 2D images to 3D images supports the research question of finding an approach to 2D to 3D classification.
The use of Autoencoders supports the aim to improve the classification by providing better (automatic) features for the neural network classification process.

Subjectivity/Specificity:

The fusion of descriptors is only known to work on the descriptors used in the research, eg, HOG. This limits the scope of proven applicability of combining them using MDA/LRR to those used in the research. For example, its not clear which types of descriptors are better than others - could merely using textual descriptions be used in the fusion?
It's not clear that this fusion technique is still effective in 2025.

Conclusion vs methods

Objectivity:

They use a variety of existing approaches to compare the performance of their approach on the same dataset. For example, they use CNNs, Linear regression, Latent Gaussian Mixture Regression (LGMR), etc. This helps support their specific conclusion as it is empirically evaluated to be better than the others.
The methods yield good and simplifying and combinig diffirent representations using autoencoders is a good idea to concentrate the data with only a combination of the most significant features that auto encoders can discover.

Subjectivity/Specificity:

Combining features appears to work well using 2D and 3D data but conclusions (usefulness of combining features using MDA/LRR) are only valid with respect to that type of data.

External Validity

Objectivity

The approach is systematic and it can be repeatable through the use of the algorithms used to define the process.
The process/approach can be applied to data other than the data used in the research.

Subjectivity

Only the specific data in the research can be used to prove/support the results/conclusions.
There is no
While the method of mapping 2D features to 3D features using MDA (using autoencoders and combining/fusing features) seems effective, a model that uses it would still need to be trained on a variety of different-sized human subjects, adults and smaller children to avoid being biased in the training data it uses to make pose inferences.

Data Validity

Data subjectivity (specificity/narrowness)

Objectivity:

Using 2D posture and 3D posture image data from a variety of standard/common repositories is suitable for the task of mapping one to the other.
Some datasets are used in existing research (eg Walking dataset), using them is more objective. Different datasets are used to reduce bias in one particular dataset.
Therefore, this data is appropriate for use in this research.

Subjectivity/Specificity:

The research is subject to the nature and characteristics of data that was used in the research. If a new format of 3D posture data is used, other than that format provided in the test data, its not certain, the approach will yield similar results. For example, 3D format is roughly a set of vectors in 3D space. This is not the only format that might be possible and might cause this research/appraoch not to be transferable/usable on that kind of 3D data.

Data vs Research Question

Objectivity:

The research question looks to classify 2D with its associative 3D representation; therefore, the 2D and 3D data support the determination of the research question.

Subjectivity/Specificity:

See external validity.

Summary of general risks to validity

Credibility concerns

Objectivity:

No significant gaps in the literature review.

Subjectivity/Specificity:

The research is 10 years old - limited to the era in which the research took place.
The fusion techniques and methods could be outdated and have been subsequently improved or deprecated.

Relevance, Contribution, Originality and Novelty

Implications & Contributions

MDA outperforms other methods for 2D to 3D pose correlation.

The generation of a method to form a combination of multiple feature descriptors into a useful uniform descriptor format (MHLRR) appears to be widely applicable, meaning other features may be combined in other applications, too.

Legacy systems that use traditional features and methods can be reused/upgraded to upgrade these systems to be more accurate by combining multiple features.

Opinion

This research takes an objectivist approach, which aims to uncover hidden truths (effectiveness of MDA for improving DNNs,) favouring empirical testing, a systematic approach and a thorough investigation into cause and effect.

Projects

Login

Twitter