This section presents a comprehensive evaluation of the proposed model’s performance across multiple tasks using two key datasets: the AIST++31 and ImperialDance8 datasets. The datasets provide diverse dance motion and music data, essential for testing multimodal learning models. The experiments focus on assessing the model’s capacity for dance vocabulary classification, dance quality estimation, and dance-music synchronization. The datasets allow for the analysis of both spatial-temporal motion features and audio-visual synchronization, enabling a holistic evaluation of dance performance. By leveraging a combination of self-supervised learning and contrastive InfoNCE loss, the model demonstrates its effectiveness in capturing complex patterns within dance motions and music alignment.
Dataset
AIST++ dataset
The AIST++ dataset31 is a large-scale dataset aimed at advancing research in dance motion generation32, choreography analysis33, and motion-music alignment. It builds upon the original AIST Dance Video Dataset, extending it with high-quality 3D motion capture data. AIST++ includes over 1,400 3D motion sequences synchronized with 10 different genres of music, such as house, hip-hop, jazz, and waacking, among others. Each dance sequence is paired with high-resolution audio, ensuring a rich multimodal dataset that covers a wide range of dance styles. One of the distinguishing features of AIST++ is its provision of high-fidelity 3D skeletal motion data, captured using motion capture systems, which allows for detailed spatial and temporal analysis of dance movements. The dataset’s combination of motion sequences, music, and metadata facilitates tasks such as motion prediction, dance choreography synthesis, and motion-to-music alignment. The dataset is split into training, validation, and testing sets, providing a standardized benchmark for evaluating dance motion models. AIST++ is particularly valuable for cross-genre motion generation research, as its diverse dance styles enable the training of models that generalize well across different dance forms. In addition, the inclusion of real-world 3D motion data makes AIST++ an essential resource for applications requiring accurate motion representation, such as virtual dance training, choreography generation, and human-robot interaction in dance.
ImperialDance dataset
The ImperialDance dataset8 is specifically designed to support research in multi-task dance performance assessment, motion-music analysis, and skill progression monitoring. It contains 69,300 seconds of recorded dance motions, spanning five distinct genres, 20 choreographies, and 20 music pieces, with performances captured from dancers of three different expertise levels (beginner, intermediate, and expert). One of the key contributions of the ImperialDance Dataset is the comprehensive recording of expertise levels, which enables detailed analysis of the progression of skill across dancers. Each choreography is repeated 100 times per class, ensuring a significant number of samples are available for each combination of genre, choreography, and expertise level. This high level of repetition allows the dataset to be particularly useful for fine-grained feature learning and for tracking performance improvements over time. A unique characteristic of the dataset is its segmentation of dance sequences into primitive motions, using the EBS method. This approach captures rhythmic structure in the data, enhancing the ability to extract meaningful temporal and spatial features from both dance motion and music. Additionally, the dataset’s multimodal design ensures that music features, such as rhythm and melody, are consistently aligned with the dance motion features, providing a robust foundation for motion-music synthesis and dance performance evaluation tasks. These features make the ImperialDance dataset especially suited for real-world dance training and performance assessment applications.
Experimental setup
For pre-training, we use data from 10 different dance choreographies spanning five distinct genres. Each choreography is represented by 100 repeated samples, capturing variations across different expertise levels and motion primitives. To ensure uniformity in data processing, all dance sequences are standardized to a fixed duration of 10 s. These sequences are further segmented into three distinct segments, each lasting 3 s, following the Eight-Beats segmentation method. This segmentation captures the rhythmic structure of the music, aligning each motion primitive with the corresponding musical beats, thereby facilitating more granular feature extraction and enhancing the model’s ability to learn complex dance-music interactions. This process results in 90,000 training dance pieces (calculated as (90,000=10times 100times 3times 3)). For downstream tasks, we use data from five additional choreographies, each belonging to a different genre, resulting in 4500 testing pieces (calculated as (4,500=5times 100times 3times 3)). We use three metrics to evaluate performance across the three assessment tasks: (i) classification accuracy, (ii) log-likelihood value, and (iii) MSE loss.
The experiments were conducted on a system equipped with an NVIDIA Tesla V100 GPU with 32 GB of VRAM, an Intel Xeon Gold 6226R processor, and 128 GB of RAM. The model was implemented using PyTorch v1.12.1 and trained using the Adam optimizer with a learning rate of (10^{-4}), a batch size of 32, and a cosine annealing scheduler to adaptively reduce the learning rate during training. Dropout layers with a rate of 0.3 were used to prevent overfitting, and early stopping was employed based on validation loss. The InfoNCE loss function was applied with a temperature parameter (tau = 0.07), optimizing the alignment between motion and music features during training. Dance motion data, represented as 3D skeleton sequences, were preprocessed using the OpenPose library to extract joint coordinates and remove outliers. The accompanying audio signals were preprocessed with the Librosa library to extract features such as Mel-frequency cepstral coefficients (MFCC), MFCC-deltas, constant-Q chromagrams, and tempograms. These features were normalized to ensure consistent scaling across samples. Data augmentation techniques were applied, including the addition of random noise to motion sequences and the application of time stretching and pitch shifting to audio features, simulating real-world variations in dance performances.
To further evaluate the model’s generalization capacity, we conducted additional testing using five choreographies held out from the training set, each representing different dance genres. These sequences were not exposed to the model during either the pre-training or prompt tuning stages. Despite the lack of exposure, the model exhibited high performance across all three downstream tasks when evaluated on these unseen sequences, with only marginal declines in accuracy, NLL, and MSE. This confirms the framework’s capability to adapt to novel dance patterns and musical structures, emphasizing its effectiveness in real-world deployment scenarios where encountering new styles is common.
Motion-music synchrony rates across expertise levels on ImperialDance dataset. This figure presents the distribution of motion-music synchrony rates for dancers of varying expertise (beginner, intermediate, expert) across different attempts. Subfigures (a–c) show the synchrony rates for the first 10 attempts compared to the last 10 attempts for beginner, intermediate, and expert dancers, respectively. The progression in synchrony is evident for beginners and intermediates, with experts maintaining consistently high rates. Subfigure (d) compares the synchrony rates across all expertise levels, illustrating that as the expertise level increases, dancers demonstrate higher alignment rates and lower variability in synchronization.
Evaluation metrics
The Musical Motion-Music Synchrony Rate (MSR)30 is defined as a quantitative measure of how well a dancer’s movements are synchronized with the beats of the accompanying music. The MSR ranges from 0 to 1, where a value of 1 indicates perfect alignment between the dance motions and the musical beats, and a value closer to 0 indicates poor alignment. By adjusting the threshold (delta), the strictness of the alignment criterion can be modulated. A smaller (delta) requires closer synchronization for the motion to be considered aligned with the music, whereas a larger (delta) allows for more flexible synchronization. In all our experiments, we consistently set the threshold (delta = 0.4) seconds, following prior literature30. This value balances the sensitivity and robustness of synchronization detection, accounting for minor variations in motion execution and beat perception
For a given dance sequence, the MSR can be mathematically formulated as:
$$begin{aligned} text {MSR} = frac{1}{N} sum _{i=1}^{N} mathbb {1}left( left| t_{text {motion}, i} – t_{text {beat}, i} right| le delta right) end{aligned}$$
(5)
Where, N is the total number of beats in the music, (t_{text {motion}, i}) represents the timestamp at which the i-th motion event (or key frame of motion) occurs, (t_{text {beat}, i}) is the timestamp of the i-th musical beat, (delta) is a threshold that defines the maximum allowable deviation between the motion and beat timestamps for them to be considered aligned and (mathbb {1}(cdot )) is the indicator function that outputs 1 if the condition inside is true (i.e., the difference between (t_{text {motion}, i}) and (t_{text {beat}, i}) is within the allowed range), and 0 otherwise.
Ablation study
This section presents an ablation study evaluating the performance of the proposed model across three key tasks: motion-music synchrony progression, multilabel dance classification, and dance quality estimation. Multiple figures are referenced throughout to support and visualize the analysis. Figure 5 illustrates the motion-music synchrony rates across different expertise levels and training iterations, providing insights into synchronization improvements among beginner, intermediate, and expert dancers. Figure 6 compares model architectures for multilabel dance classification accuracy across genres, choreographies, motion primitives, and expertise levels. Figure 7 presents model-wise performance for dance quality estimation based on negative log-likelihood loss, highlighting the capability of different architectures to handle aleatoric uncertainty. Lastly, Fig. 8 provides a comparative analysis of dance-music synchronization performance using MSE, assessing rhythmic alignment between motion and music. These visual results are discussed in detail in the following subsections.
We also conduct a modality ablation to understand the isolated and joint effects of motion and music features on the model’s performance. This helps quantify the benefit of multi-modal fusion compared to unimodal inputs.

Performance comparison for Task 1 (multilabel dance classification) on ImperialDance dataset. This MSR chart illustrates the performance of various model architectures on the task of multilabel dance classification, measured by the classification accuracy across 180 unique classes (genres, choreographies, expertise levels, and motion primitives).
Motion-music synchrony evaluation
Our analysis tracks the progression of motion-music synchrony across different expertise levels (Beginner, Intermediate, Expert) by examining the alignment rates during two key periods: the first 10 attempts and the last 10 attempts of practice. The evaluation criterion used is the Motion-Music Synchrony Rate (MSR), which quantifies how well the dancers synchronize their movements with the music beats. We compute the average MSR for both the initial and final attempts for each expertise level, as shown in Fig. 5. Additionally, we aggregate the MSR for all 100 attempts across the three expertise levels (Fig. 5d).
For Progression of synchrony for beginners and intermediates: In Fig. 5a,b, the mean and median synchrony rates for the last 10 attempts are higher than those for the first 10 attempts, both for beginner and intermediate dancers. This indicates that these dancers improve their ability to synchronize with the music after repeated practice sessions. The greater consistency in the later attempts, especially for intermediates, suggests an enhanced ability to follow the rhythm after practice. By comparing Fig. 5a,b, it is clear that intermediate dancers make more substantial progress than beginners. While beginners show improvement, intermediates exhibit a larger increase in synchrony, as reflected in their tighter distributions and higher median values. This outcome is expected, as intermediate dancers are typically more adept at quickly learning to align their movements with the music, leading to greater progression in their MSR over time.
Figure 5c reveals that expert dancers maintain a high and consistent MSR across both the first and last 10 attempts, with little to no progression identified. This is to be expected, as expert dancers already demonstrate near-optimal synchronization with the music from the beginning, indicating that additional practice sessions do not significantly impact their performance. As illustrated in Fig. 5d, there is a clear increase in the mean and median synchrony rates as the expertise level rises from beginner to expert. Expert dancers exhibit the highest MSR, while beginners show the lowest rates and the greatest variability. The results highlight that, as dancers advance in skill level, their ability to synchronize with music improves, with experts maintaining high alignment consistently across all attempts. These findings confirm the ability of our model to capture the progression of dance performance, particularly for beginner and intermediate dancers, and underscore the stable performance of experts in maintaining high synchronization rates throughout the evaluation.
Task 1: multilabel dance classification
In Task 1, the objective of multilabel dance classification focuses on identifying a variety of dance elements such as genres, choreographies, expertise levels, and motion primitives within a sequence. This task is modeled as a multi-label classification problem where the dataset is composed of 180 unique classes, derived from 20 choreographies, 3 expertise levels, and 3 primitive motions. The classification process is optimized during network training using the cross-entropy loss function, which aims to accurately categorize each label across the diverse set of dance sequences. Figure 6 presents the performance results for different combinations of motion-music architectures and loss functions, where higher scores (indicated by the (uparrow) symbol) signify better performance in Task 1. The best-performing model is Ours (STGCN-LSTM-InfoNCE), which achieves a score of 75.20, the highest among all models. This score represents the classification accuracy achieved by the model across 180 unique classes, including genres, choreographies, expertise levels, and motion primitives. A higher accuracy score indicates better performance in correctly identifying the diverse attributes of the dance sequences.

Performance Comparison of motion-music architectures for Task 2 (dance quality estimation) on ImperialDance dataset. It compares the performance of various motion-music architectures on Task 2: dance quality estimation, measured by the negative log-likelihood loss. Lower loss values indicate better accuracy in predicting the performance score distribution with aleatoric uncertainty.
This suggests that the model effectively captures the intricate patterns within dance vocabulary through its unique combination of STGCN for spatiotemporal motion analysis, LSTM networks for sequence modeling, and the InfoNCE loss for robust contrastive learning. The Res18-LSTM-InfoNCE model, achieving a score of 61.62, represents the baseline performance. While this model uses ResNet-18 for feature extraction and LSTM for temporal analysis, its lower accuracy indicates that it struggles to fully capture the complex dance motions. Compared to this, Ours (STGCN-LSTM-InfoNCE) demonstrates a significant 22.02% improvement in performance, underscoring the value of the STGCN architecture in learning spatial-temporal dynamics of human motion, which are essential for dance recognition. Res50-LSTM-InfoNCE performs slightly better, with a score of 63.41, indicating a 2.91% improvement over Res18-LSTM-InfoNCE. This enhancement is likely due to the deeper feature extraction capabilities of ResNet-50. However, Ours still outperforms Res50-LSTM-InfoNCE by 18.57%, highlighting that the STGCN architecture paired with LSTM provides a better representation of dance movements, leading to higher classification accuracy.
The models incorporating the STGCN (STGCN-Res18-InfoNCE and STGCN-Res50-InfoNCE) show marked improvements in performance. STGCN-Res18-InfoNCE, with a score of 67.35, achieves a 9.86% improvement over Res18-LSTM-InfoNCE, while STGCN-Res50-InfoNCE further increases the score to 70.24. This demonstrates the effectiveness of STGCN in capturing the intricate spatial-temporal relationships inherent in dance movements. Nevertheless, Ours (STGCN-LSTM-InfoNCE) still outperforms these models, showing a 7.07% improvement over STGCN-Res50-InfoNCE, which suggests that the integration of LSTM for sequence modeling, alongside InfoNCE loss, further refines the model’s ability to classify complex dance patterns. Lastly, STGCN-LSTM-SupCon34 achieves a score of 72.59, reflecting the advantages of using supervised contrastive learning (SupCon) to better distinguish between various dance labels. Despite this improvement, Ours (STGCN-LSTM-InfoNCE) still outperforms it by 3.59%, indicating that the combination of STGCN, LSTM, and InfoNCE is better suited for the nuanced task of multilabel dance classification. Overall, Ours (STGCN-LSTM-InfoNCE) demonstrates the best performance across all models, with an improvement range of 3.59% to 22.02% over other architectures. The results clearly show that this model, through its effective use of spatiotemporal graph networks, sequence modeling, and contrastive learning, is highly capable of recognizing various dance vocabulary elements. The (uparrow) symbol indicates that higher values reflect high accuracy as Fig. 6.
Task 2: dance quality estimation
In Task 2, the goal is to evaluate Dance Quality Estimation, where each dance sequence is annotated with a performance score by professional dancers. This task is designed to predict a score distribution rather than exact scores. The model learns to map dance features to a score distribution that incorporates aleatoric uncertainty, a key component that accounts for the inherent subjectivity in the labeling process. By modeling this uncertainty, the approach seeks to minimize bias during the labeling process. The scoring mechanism is framed using a Gaussian distribution, and the model is optimized to minimize the negative log-likelihood of predicting the target score distribution. In this context, a lower value ((downarrow) symbol) indicates better performance, reflecting the model’s ability to reduce errors in score prediction. Figure 7 presents the performance of different models in Task 2 based on their ability to minimize the negative log-likelihood loss. Ours (STGCN-LSTM-InfoNCE) achieves the lowest loss of 0.000775, outperforming all other models in this task. This loss value represents the negative log-likelihood loss, which quantifies the error in predicting the distribution of dance quality scores. A lower loss indicates that the model is better at estimating the score distribution while accounting for aleatoric uncertainty, which is essential for handling the subjectivity and variability in professional evaluations.
This demonstrates its high accuracy in mapping the complex dance features to a performance score distribution. The comparison with other models highlights substantial performance improvements with the STGCN-LSTM-InfoNCE. Res18-LSTM-InfoNCE achieves a loss of 0.003471, which is significantly higher than Ours. This indicates that while this model benefits from combining ResNet-18 with LSTM, it struggles to accurately model the uncertainty in the score distribution. The STGCN-LSTM-InfoNCE achieves a 77.68% improvement over Res18-LSTM-InfoNCE, showing that the incorporation of STGCN, which captures spatial-temporal motion features, provides a significant advantage. Similarly, Res50-LSTM-InfoNCE shows a loss of 0.004425, performing worse than Res18-LSTM-InfoNCE. Despite the deeper architecture of ResNet-50, the model’s ability to handle aleatoric uncertainty in performance scoring remains limited. Ours demonstrates an 82.48% improvement over Res50-LSTM-InfoNCE, emphasizing that the integration of STGCN with LSTM offers a more refined modeling of temporal sequences in the context of performance scoring.
Moving to the STGCN based models, STGCN-Res18-InfoNCE and STGCN-Res50-InfoNCE show losses of 0.005041 and 0.062043 respectively. While the inclusion of STGCN allows these models to capture spatial-temporal dynamics, they still fall short in handling the score distribution effectively. In comparison, Ours (STGCN-LSTM-InfoNCE) outperforms these models by 84.62% and 98.75%, respectively. The results demonstrate the importance of incorporating LSTM for sequence modeling, which helps in effectively reducing errors in the predicted score distribution. Lastly, STGCN-LSTM-SupCon achieves a loss of 0.008160, which, while an improvement over some other models, does not match the performance of Ours. The STGCN-LSTM-SupCon model uses supervised contrastive learning (SupCon) to distinguish between different features, but it is less effective than the InfoNCE loss when it comes to capturing the nuanced differences required for performance scoring. The 90.50% improvement by Ours over this model demonstrates the efficacy of InfoNCE in enhancing feature representation and handling the complexities of aleatoric uncertainty. Ours (STGCN-LSTM-InfoNCE) achieves the best performance with a loss of 0.000775, outperforming all other models by significant margins, with improvements ranging from 77.68 to 98.75%.

Comparison of MSE Loss values for dance-music synchronization using various Motion-Music-Loss model combinations on ImperialDance dataset. This chart highlights the performance of different models in aligning dance motions with musical rhythm, where a lower Mean MSE indicates superior synchronization.
Task 3: dance-music synchronization
In Task 3, the objective is to assess Dance-Music Synchronization, focusing on how well the dancer’s movements synchronize with the musical rhythm. This synchronization is quantified using the MSR, which measures how closely the intensity of a dancer’s motion matches the intensity of the accompanying music. The correlation between motion and music is considered successful when the peak difference between them does not exceed 0.4 s, consistent with the threshold adopted for MSR computation The MSE loss is employed as the evaluation metric, with lower values indicating better alignment between motion and music. Results for various models in Task 3. Ours (ST-GCN-LSTM-InfoNCE) are shown in Fig. 8 where our model has the lowest MSE of 2.52, indicating the highest level of synchronization between motion and music. This score reflects the MSE between the motion intensity peaks and musical beats. A lower MSE indicates better temporal alignment, meaning the dancer’s movements are more closely synchronized with the rhythm of the accompanying music.
This demonstrates the superior ability of the ST-GCN-LSTM architecture combined with the InfoNCE loss function to model the temporal and rhythmic alignment required for dance performance. In comparison, the Res18-LSTM-InfoNCE model has an MSE of 3.11, which is 18.96% higher than Ours, suggesting that the addition of ST-GCN significantly improves performance in rhythmic tasks.
Similarly, Res50-LSTM-InfoNCE performs with an MSE of 3.50, and Ours shows a 28.00% improvement over this model. Despite the deeper ResNet-50 architecture, the absence of ST-GCN results in less effective rhythm modeling. On the other hand, ST-GCN-Res18-InfoNCE and ST-GCN-Res50-InfoNCE perform similarly with MSE values of 3.77 and 3.72, respectively. Ours outperforms both models, improving their performance by 33.16% and 32.26%, further highlighting the importance of the LSTM component in capturing long-term dependencies in the dance sequences. The ST-GCN-LSTM-SupCon model demonstrates the highest error with an MSE of 5.28, showing that the supervised contrastive loss (SupCon) is less effective in this task. In comparison, Ours achieves a 52.27% improvement over ST-GCN-LSTM-SupCon, which underscores the advantage of using InfoNCE loss for feature representation in tasks requiring precise synchronization between motion and music. Ours (ST-GCN-LSTM-InfoNCE) significantly outperforms all other models in Task 3, with performance improvements ranging from 18.96 to 52.27%.
The proposed model achieves a steady processing rate of approximately 29 FPS when evaluating complex dance sequences with high temporal and spatial resolutions. This performance is achieved through optimized GPU utilization, mixed-precision computations, and efficient data processing pipelines. At this rate, the framework provides near-real-time feedback, making it suitable for professional training and live performance monitoring scenarios. While not instantaneous, the 29 FPS rate ensures that the system can process inputs with minimal latency, enabling practical applications such as rehearsal evaluation and interactive coaching systems. Additionally, this performance reflects a balance between computational demands and output quality, addressing the challenges posed by large-scale motion-music data analysis.
Zero-shot generalization across dance genres
To evaluate the generalization ability of the proposed model in zero-shot settings, we conducted a genre-based ablation experiment where the model was tested on entirely unseen dance genres. Specifically, we divided the ImperialDance dataset into two non-overlapping genre sets: five genres were used exclusively for training and prompt tuning, while the remaining five were held out for zero-shot evaluation. This ensures that the model was never exposed to any choreography, motion pattern, or music sequence from the test genres during the training phase.
During evaluation, the model received prompt-based textual descriptions related to genre, choreography, and expertise level, but no parameter updates were performed. We assessed the performance of the model across all three downstream tasks: (i) Multilabel Dance Classification, (ii) Dance Quality Estimation, and (iii) Dance-Music Synchronization. Evaluation metrics included classification accuracy for Task 1, negative log-likelihood (NLL) for Task 2, and mean squared error (MSE) for Task 3. The results, summarized in Table 1, demonstrate that the model maintains strong performance even on unseen genres. The classification accuracy on unseen genres decreased only marginally compared to the in-domain setting, and the degradation in NLL and MSE was minimal. These findings confirm that the integration of contrastive self-supervised learning and transformer-based prompt tuning facilitates effective generalization to novel dance styles without requiring additional fine-tuning.
Ablation study on input modalities
To assess the contribution of each input modality, we conducted an ablation study comparing three model variants: motion-only, music-only, and the proposed multi-modal (motion+music) configuration. In the motion-only model, only the STGCN encoder was active, processing the skeletal motion features, while the LSTM music encoder was disabled. Conversely, the music-only model retained the LSTM music encoder and excluded the motion stream. The multi-modal model used both encoders with contrastive learning to jointly embed motion and music primitives. Table 2 presents the results across the three downstream tasks. As expected, the multi-modal model achieved the best performance across all tasks. Notably, the motion-only model performed relatively well on Task 1 (classification) and Task 2 (quality estimation), indicating that motion features alone contain rich information about choreography and expertise. However, its performance in Task 3 (synchronization) significantly degraded, highlighting its inability to capture alignment with musical beats.
In contrast, the music-only model yielded the lowest performance in Task 1, reflecting insufficient information for genre or choreography recognition when motion is excluded. Its performance on Task 3 (synchronization) was also weaker than the multi-modal setup, although marginally better than the motion-only model due to access to rhythmic cues. These results confirm that while each modality contributes uniquely to certain tasks, the integration of both is essential for achieving robust, generalizable performance across all evaluation dimensions. Hence, the multi-modal framework is not only superior in aggregate performance but also necessary for rhythm-sensitive evaluations such as synchronization.
Ablation study on prompt tuning effectiveness
To evaluate the specific contribution of prompt tuning in our framework, we conducted a controlled ablation study comparing two model variants: (i) the full model with prompt tuning applied during downstream evaluation, and (ii) a baseline variant where no prompting was used and all downstream predictions relied solely on fixed motion and music encoders. Both variants were trained and evaluated under identical conditions on the ImperialDance dataset.
Table 3 reports the results across the three downstream tasks. The model with prompt tuning demonstrates a consistent and notable performance advantage. For Task 1 (Multilabel Dance Classification), prompt tuning improves accuracy from 70.33 to 75.20, yielding a relative gain of 6.93%. In Task 2 (Dance Quality Estimation), the negative log-likelihood (NLL) loss is reduced from 0.001311 to 0.000775, reflecting a 40.89% improvement in modeling aleatoric uncertainty in performance scoring. Task 3 (Dance-Music Synchronization) also benefits, with the mean squared error (MSE) decreasing from 3.61 to 2.52, a gain of 30.19%. These results highlight the role of prompt tuning in enhancing the adaptability and generalization of the model across diverse downstream evaluation tasks. By providing task-specific textual cues, the model effectively aligns multimodal representations to produce more accurate and semantically meaningful predictions. This is especially impactful in scenarios involving diverse choreography styles and unseen input conditions, where traditional fine-tuning methods may struggle.
Ablation study on genre-specific robustness
To evaluate the model’s robustness across diverse music genres, we conducted an ablation study analyzing performance separately on five distinct musical genres from the AIST++ dataset: hip-hop, jazz, house, waacking, and break. For each genre, we assessed the performance of the full model (STGCN-LSTM-InfoNCE with prompt tuning) across the three downstream tasks: (i) multilabel dance classification, (ii) dance quality estimation, and (iii) dance-music synchronization. Table 4 summarizes the results. Notably, although slight performance fluctuations are observed due to rhythmic complexity (e.g., in jazz and break), the model consistently delivers strong results without retraining or genre-specific tuning. It validates the model’s genre-agnostic capability, attributable to (1) the contrastive learning strategy that enforces rhythm-aware but genre-invariant embeddings, and (2) the EBS method that normalizes rhythm structure across different musical contexts. The small variation in performance confirms the model’s generalization capacity across a wide range of musical styles, making it suitable for real-world applications involving heterogeneous dance-music inputs.
Real-time inference and deployment feasibility
Although the proposed framework integrates a combination of computationally intensive modules and includes STGCN for spatial-temporal skeletal modeling, LSTM for sequential music encoding, and transformer-based prompt system tuning is designed for efficiency during inference. In practical deployment, all encoders are frozen and used as feature extractors, while only lightweight MLP heads and prompt-encoded representations are actively involved in downstream evaluation. This design choice significantly reduces the computational overhead compared to traditional end-to-end training paradigms. To empirically validate the framework’s suitability for real-time or near-real-time applications, we measured the average inference speed across the three downstream tasks using the ImperialDance dataset on an NVIDIA Tesla V100 GPU (32 GB VRAM), with a batch size of 32 and mixed-precision (fp16) acceleration enabled. As shown in Table 5, the model achieves an overall average throughput of 28.94 FPS, with task-specific frame rates ranging from 28.07 to 30.12 FPS.
Further latency profiling indicates that over 91% of the inference time is consumed within GPU-forward passes through the frozen STGCN and LSTM encoders, as well as the transformer-based prompt module. This confirms that the majority of the model’s operations are GPU-optimized and do not involve expensive backpropagation or fine-tuning during evaluation. Moreover, the modular design facilitates pipeline parallelism and delayed batch processing, which can be used to further reduce perceived latency in interactive systems.

Qualitative visualization of 3D skeleton-based motion trajectories for (a) beginner, (b) intermediate, and (c) expert dancers. Color encodes joint velocity magnitude over time. Expert dancers demonstrate smoother, rhythmically aligned, and spatially consistent trajectories, while beginners exhibit erratic and less synchronized movements.
Qualitative visualization of expertise-level distinctions
To complement the quantitative evaluation, we provide qualitative visualizations that highlight how the proposed model distinguishes between dancers of varying expertise levels. Figure 9 presents representative examples of dance motion trajectories for three expertise categories such as beginner, intermediate, and expert, based on 3D skeletal joint sequences. Each example is color-coded to indicate the velocity magnitude of motion across time, where warmer colors represent higher dynamic intensity.
As shown in Fig. 9, beginner dancers exhibit more erratic and spatially dispersed trajectories with inconsistent intensity, reflecting lower control and synchronization. Intermediate dancers show moderate regularity with improved beat alignment and smoother transitions. Expert dancers display highly consistent motion paths, fluid transitions, and peak alignment with musical beats. These visual trends confirm that the model’s feature extraction pipeline (STGCN + LSTM + InfoNCE) captures the fine-grained spatial-temporal patterns that distinguish skill levels. Furthermore, attention maps from the transformer-based prompting module indicate higher activation around temporally coherent motion primitives in expert sequences, demonstrating the model’s ability to semantically align expertise with motion quality and rhythm fidelity.
Performance comparison with SOTA models
We compare the performance of various SOTA methods with our proposed model (STGCN-LSTM-InfoNCE) across three different tasks and the results are shown in Table 6. Task 1 evaluates multilabel dance classification, Task 2 measures dance quality estimation, and Task 3 assesses dance-music synchronization. The upward arrow ((uparrow)) for Task 1 indicates that higher values represent better performance, while the downward arrow ((downarrow)) for Tasks 2 and 3 indicates that lower values signify better performance. Our model, STGCN-LSTM-InfoNCE, achieves a performance of 75.20 in Task 1, outperforming all other methods listed. When compared to CotrastiveDance, which is the second-best performing method in this task with a score of 68.21, our model shows an improvement of approximately 10.25%. Compared to SupCon34, which yields a score of 32.05, our model demonstrates a significantly higher performance with a relative improvement of about 134.7%. Methods like SimCLR 22 (36.29) and MoCo35 (14.37) show considerably lower performance, indicating the superiority of our approach in handling multi-label classification for dance genres, choreographies, and expertise levels.
For Task 2, where lower values are preferred due to the negative log-likelihood optimization, our model achieves an exceptionally low score of 0.000775, demonstrating its ability to effectively capture aleatoric uncertainty in dance quality estimation. Compared to CotrastiveDance, which scores 0.0098, our model improves the performance by 92.09%. The STGCN model10, scoring (2.72 times 10^{-3}), also demonstrates a weaker performance in comparison to our model, further supporting the robustness of STGCN-LSTM-InfoNCE in addressing this task. Other methods like SupCon (0.199) and SimCLR (0.071) exhibit even poorer performance, indicating that these methods struggle with capturing the variability in performance scoring. In Task 3, which measures the alignment between motion and musical rhythm, our model achieves a score of 2.52, significantly lower than all other models, including CotrastiveDance8(4.91) and SimCLR (5.43). This indicates that our model excels in aligning dance motions with musical beats. The improvement over ContrastiveDance is approximately 48.67%, demonstrating the superior ability of our model to evaluate rhythm synchrony. The other methods like MoCo (13.75) and STGCN (11.04) display even higher losses, indicating their inability to effectively model this task.
Discussion
Computational cost and resource requirements
To evaluate the feasibility and deployment potential of the proposed framework, we report the computational cost in terms of training time, inference speed, and hardware requirements. Training time: the complete training process spanned 100 epochs over approximately 90,000 training sequences. With a batch size of 32 and an initial learning rate of (10^{-4}), using cosine annealing for dynamic learning rate scheduling, the model required approximately 16 min per epoch. This resulted in a total pre-training time of roughly 26–27 h An additional 4–5 h were required for prompt tuning across the downstream tasks. Inference speed: the model achieves an average inference speed of 28.94 frames per second (FPS) across the three downstream tasks: multilabel dance classification (30.12 FPS), dance quality estimation (28.07 FPS), and dance-music synchronization (28.63 FPS). This performance enables near-real-time feedback, making the framework suitable for live dance coaching and rehearsal evaluation applications.
Hardware requirements: all training and evaluation were conducted on a high-performance machine equipped with an NVIDIA Tesla V100 GPU (32 GB VRAM), Intel Xeon Gold 6226R CPU, and 128 GB of RAM. The implementation was based on PyTorch v1.12.1 with mixed-precision (fp16) enabled to optimize GPU memory usage and computational throughput. Resource optimization: during inference, the STGCN and LSTM encoders, along with the transformer-based text prompt module, operate in evaluation mode with frozen parameters. Only the task-specific MLP heads are actively updated or evaluated, resulting in significantly reduced computational overhead. Latency profiling indicates that over 91% of inference time is attributed to GPU-forward passes through the frozen encoders and transformer blocks, confirming the pipeline’s efficiency under GPU acceleration. Scalability: the modular nature of the framework allows for deployment in resource-constrained environments by replacing heavy encoders with lightweight alternatives or using encoder pruning. This extensibility supports practical applications in mobile AR/VR systems, interactive choreography tools, and real-time dance performance monitoring.
Limitations
While the proposed framework demonstrates significant advancements in multi-modal dance performance evaluation, several aspects warrant further exploration to enhance its applicability and effectiveness. A primary limitation lies in the generalizability of the model to more diverse datasets that encompass unconventional or experimental dance styles. Current validation has been performed on datasets with predefined genres and choreographies, which may not fully represent the complexities and variabilities found in real-world scenarios. Future work could focus on training and testing the model on datasets with greater diversity to ensure robustness across a wider range of applications.
Another limitation pertains to the model’s ability to handle complex synchronization scenarios, such as performances involving irregular or polyrhythmic music. While the model excels in rhythm-based synchronization tasks, assessing and adapting to these more intricate temporal structures remains a challenge. Future research could explore advanced techniques for handling such musical complexities to further refine the model’s synchronization capabilities.
Although the model demonstrates promising generalization to unseen choreographies from held-out genres, its performance may degrade on highly unconventional or experimental dance styles that diverge significantly from the training data distribution. Future work may include domain adaptation or meta-learning approaches to further enhance generalization.
In terms of real-world deployment, particularly within immersive augmented and virtual reality (AR/VR) environments, several practical constraints must be addressed. For instance, AR/VR systems require real-time processing of high-fidelity 3D motion data, which presents substantial computational and latency challenges. The current framework, while capable of near-real-time inference on 2D skeleton and audio inputs, would require significant architectural modifications to handle continuous 3D skeletal streams, depth-aware context, and interaction-based evaluation. Integrating real-time motion capture from AR/VR sensors introduces potential noise, occlusions, and incomplete data frames that the current model is not explicitly designed to handle. Moreover, maintaining synchronization between the rendered virtual environment and live motion-music analysis adds further temporal constraints that exceed the current system’s latency tolerance. Future extensions could investigate lightweight transformer variants, real-time 3D mesh encoders, and multi-threaded streaming pipelines to support efficient deployment in AR/VR coaching or choreography tools.
Additionally, real-world dance performances are often affected by occlusion or missing motion data due to overlapping dancers or suboptimal camera angles. The current framework does not explicitly address these challenges, which could impact performance evaluation in practical scenarios. Incorporating robust motion completion strategies and sensor fusion techniques may help improve the system’s resilience in such deployment settings.