TEMSET-24K: Densely Annotated Dataset for Indexing Multipart Endoscopic Videos using Surgical Timeline Segmentation

Annotation Assessment

To ensure the consistency of labelling in the dataset, we designed an annotation process involving a team of colorectal cancer surgery specialists, all accredited with fellowship status with the Royal College of Surgeons (RCS, UK). The process began with one surgeon annotating one full video in a shared setting to demonstrate the annotation procedure for the multipart ESV files. Following this, another surgeon logged into the LS server using their credentials and navigated to the project they intended to annotate, accessing the individual video clips for annotation. The LS user interface provided a comma-separated list of phases, tasks, and actions for annotating the timeline of each video clip. Annotations were initially performed by one surgeon and subsequently validated by at least two other surgeons for cross-checking purposes. In cases of conflicting boundaries between the start and end of the labelling triplets, discussions were held to finalize the annotations that was agreed by all surgeons. We employed multifaceted strategies involving our proposed dense taxonomy, collaboratively annotating one full surgery in shared settings, and holding iterative discussions to resolve conflicts to achieve consistent annotations of the complex workflow scenes based on all surgeons’ inputs. The final annotations consisted of labels made up of five phases, 12 tasks, and 21 actions as defined by the proposed taxonomy. These annotations were then programmatically exported from LS in JSON format, along with the corresponding ESV files.

Deep Learning Model Training

Data Pre-Processing

To improve the field of view, irrelevant areas were cropped from ESV images comprising black regions. The input image was first converted to grayscale, and a binary threshold was used to isolate the circular surgical region from the background. This step enhanced the visibility of the surgical scene. Subsequently, the largest contour was identified within the thresholded image and computed its minimum enclosing bounding box. A mask corresponding to this circular region was created and applied to the original image to extract the surgical area while ignoring the background. The bounding box of the surgical region was cropped and this cropped image was resized to its original size using bilinear interpolation. This method ensures that only the relevant surgical view is retained and standardised, facilitating improved visualisation and analysis of the surgical scene.

Problem Formulation

A key objective of this study was to learn an unknown function F that maps high-dimensional TEMS endoscopic surgical videos ({bf{X}}in {{mathbb{R}}}^{Ttimes Htimes Wtimes 3}) to a multitarget label triplet Y {Phase, Task, Action}, where, T, H, and W denote the sequence length (no. of frames in the video), height, and width of the frames, respectively. To achieve this, this study proposes a Spatiotemporal Adaptive LSTM Network (STALNet) that learns the desired mapping. As shown in Fig. 5, STALNet integrates a TimeDistributed video encoder ET, followed by an adaptive long-short term memory network (LSTM) module having attention as the last layer MAA-LSTM to capture spatial and temporal dependencies in the ESV data. Let ϕ be the feature extraction function using the backbone. The output of the encoder is given by:

$${bf{F}}={{bf{E}}}^{T}(phi ({bf{X}}));{bf{X}}in {{mathbb{R}}}^{Btimes Ttimes Ctimes Htimes W},$$

(1)

where, B is the batch size, T is the sequence length, C is the number of channels, and H and W are the height and width of the frames, respectively. We experimented with various encoders, including ConvNeXt (convnext_small_in22k)40, SWIN V2 (swinv2_base_window12_192-22k)41, and ViT (vit_small_patch16_224)42,43. These encoders were chosen for their proven ability to capture detailed spatial features across different scales, which is crucial for accurately interpreting surgical video frames. The extracted features are fed into an Adaptive LSTM module. This module consists of multiple LSTM layers, where the number of LSTMs depends on the input sequence length T. Each LSTM processes the sequence of features and produces hidden states. Let ht represent the hidden state at time step t. The hidden states are computed as:

$${{bf{H}}}_{t}={{bf{M}}}_{{rm{AA}}-{rm{LSTM}}}({{bf{F}}}_{t},{{bf{h}}}_{t-1}),$$

where ({{bf{H}}}_{t}in {{mathbb{R}}}^{Btimes D}). Multiple LSTM layers were applied to capture temporal dependencies across the sequence. Incorporating LSTMs into the proposed solution in an adaptive manner significantly improved the model’s capacity for surgical scene understanding, as this approach leverages and preserves the temporal coherence in the videos, improving the stability and accuracy of the timeline predictions. The final hidden states from each LSTM layer are collected as ({bf{H}}=[{{bf{H}}}_{1},{{bf{H}}}_{2},ldots ,{{bf{H}}}_{T}]in {{mathbb{R}}}^{Ttimes Btimes D}) and their information across the sequence is aggregated using an attention mechanism. The attention weights are computed by applying a linear layer to the hidden states:

$${{bf{A}}}_{t}=,{rm{softmax}},({{bf{W}}}_{a}{{bf{H}}}_{t}),$$

where ({{bf{W}}}_{a}in {{mathbb{R}}}^{Dtimes 1}) is the attention weight matrix. The attention-weighted output is computed as a weighted sum of the hidden states:

$${bf{O}}=mathop{sum }limits_{t=1}^{T}{{bf{A}}}_{t}{{bf{H}}}_{t}in {{mathbb{R}}}^{Btimes D}.$$

The final output is obtained by passing the attention-weighted output through a fully connected layer followed by batch normalisation:

$${bf{Y}}=,{rm{BatchNorm}},({{bf{W}}}_{h}{bf{O}}),$$

where ({{bf{W}}}_{h}in {{mathbb{R}}}^{Dtimes (P+T+A)}), with P, T, and A representing the number of phases, tasks, and actions, respectively. A technique was employed here for mean ensembling to create more robust learners for each model, followed by heuristic-based prediction correction to address sporadic predictions.

Fig. 5

Proposed SpatioTemporal Adaptive LSTM Network (STALNet) for Surgical Timeline Segmentation. This network diagram shows the process by which ESV clips are analysed by encoders in order to apply reliable timeline segments.

The model is trained using a custom loss function that combines the losses for phase, task, and action predictions. The total loss is given by:

$${mathscr{L}}=alpha {{mathscr{L}}}_{p}+beta {{mathscr{L}}}_{t}+gamma {{mathscr{L}}}_{a},$$

where ({{mathscr{L}}}_{p}), ({{mathscr{L}}}_{t}), and ({{mathscr{L}}}_{a}) are the individual losses for phase, task, and action predictions, and α, β, and γ are their respective weights. Each of these losses is computed using the CrossEntropyLossFlat function applied to each of the output triplets.

DL Model Implementation

The model described in this paper was implemented using the fastai44 library. A server with 4 NVIDIA LS40 GPUs was used for training and validation. To enhance model convergence, the default ReLU activation function was replaced with the Mish activation function, which demonstrated superior performance in our experiments. Additionally, we substituted the default Adam optimiser with ranger, a combination of RectifiedAdam and the Lookahead optimisation technique, providing more stable and efficient training dynamics. To further optimise the training process, the to_fp16() method was employed to reduce the precision of floating-point operations, thereby enabling half-precision training and improving computational efficiency. The lr_find method was utilised to determine the optimal learning rate for the model, implementing a learning rate slicing technique. This approach assigned higher learning rates to the layers closer to the model head and lower learning rates to the initial layers, facilitating more effective training. For benchmarking, we initially evaluated several network architectures, including a basic image classifier, to establish a trivial baseline. This simple approach, however, produced significant sporadic predictions due to the absence of sequence modelling, highlighting the necessity for a more sophisticated model.

Model Validation

The model described in this paper was validated against the human annotator ground truth using the server with NVIDIA LS40 GPUs. We compared the proposed STALNet architecture with various encoder backbones, including ConvNeXt, SWIN V2, and ViT. The output results were analysed against the baseline to look at comparative performance metrics and how they captured the spatiotemporal dependencies that are crucial for the surgical timeline segmentation task.

Statistical Analysis

For our model evaluation, we utilised standard metrics including accuracy, F1 score, and ROC (Receiver Operating Characteristic) curves. To illustrate model variability, standard deviation is reported for accuracy and F1 scores. The following equations define these metrics:

$$begin{array}{ccc}{rm{Accuracy}} & = & frac{TP+TN}{TP+TN+FP+FN}times 100 % ,\ {rm{Precision}} & = & frac{TP}{TP+FP},\ {rm{Recall}} & = & frac{TP}{TP+FN},\ {rm{F1; Score}} & = & 2cdot frac{{rm{Precision}}cdot {rm{Recall}}}{{rm{Precision}}+{rm{Recall}}}.end{array}$$

(2)

We computed these statistics at two levels: 1) Overall Model Performance: We reported the overall accuracy and F1 score on the entire validation set. 2) Class-Specific Performance: These metrics were computed for each taxonomy triplet class (phase, task, and action) to identify which classes the model struggles with the most. Additionally, ROC curves were used to visually investigate model performance. True positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) were derived from the predictions, which were then used to compute precision and recall, leading to the construction of ROC curves plotted using Scikit-learn. To enhance our analysis, we implemented custom visualisations showing video clips, target labels, and model predictions. We employed color coding (red for incorrect and green for correct predictions) for easy interpretation. All data and model results were visualised and analysed using Matplotlib, NumPy, and Scikit-learn.

Model Performance Evaluation

Table 1 presents the accuracy and F1 scores for each model across the three encoder architectures. The baseline image classification learner, which predicts timeline labels based solely on individual images, achieved an F1 score of 72.99% with the ConvNeXt encoder, 66.7% with the SWIN V2 encoder, and 60.87% with the ViT encoder. These results indicate the fundamental capability of deep learning models for surgical timeline segmentation but also highlight the limitations of relying solely on spatial information. In contrast, our proposed STALNet demonstrated significant performance improvements over the baseline model. On average, STALNet achieved an F1 score of 82.78% and an accuracy of 91.69%, reflecting an average performance gain of 9.79% in F1 score and 11.38% in accuracy compared to the baseline model. These improvements underscore the importance of incorporating spatiotemporal information for surgical timeline segmentation. Furthermore, the performance varied between different model encoders used in the time-distributed layer for feature extraction. Among the evaluated encoders, the ConvNeXt encoder achieved the highest accuracy with 91.69%, slightly better than the SWIN V2 encoder at 91.41%. However, the highest performing F1 score, which is a significant metric for evaluating timeline segmentation, was achieved by the SWIN V2 encoder at 86.02%, which is approximately 3.24% higher than the ConvNeXt encoder’s F1 score of 82.78%. This demonstrates that while ConvNeXt offers marginally better accuracy, SWIN V2 excels in terms of F1 score, highlighting its superior performance in capturing relevant features for timeline segmentation. Despite the higher F1 score of SWIN V2, it required substantial computation during both training and deployment phases. On the other hand, ConvNeXt not only delivered competitive performance but also offered a more computationally efficient solution, making it a practical choice for real-world applications. Overall, the STALNet model, particularly with the ConvNeXt encoder, demonstrated superior performance in segmenting surgical timelines. This highlights the efficacy of integrating spatiotemporal features and selecting robust encoder architectures to balance performance and computational efficiency.

Table 1 Comparison of Surgical Timeline Segmentation Models.

The STALNet model was also evaluated for its performance on each of the taxonomy triplets (phase, task, action) as shown in Tables 2, 3, and 4, respectively. The evaluation of phase segmentation reveals that the model performs exceptionally well across all phases, with only minor fluctuations in performance using different encoders. The ROC curves show its efficacy across these triplet behaviours (see Fig. 6). For example, the “Dissection” phase achieved an F1 score of 99.0% with no variance and an accuracy of 99.0% with a variance of 11.0% with the SWIN V2 encoder. Similarly, the “Setup” phase showed high performance with an F1 score of 98.0% and an accuracy of 99.0%, both exhibiting low variances (1% and 9%, respectively with ConvNeXt and SWIN V2 encoders). Even the “Closure” phase, despite being one of the more challenging phases due to its fewer instances, maintained an F1 score and accuracy of 100% for both with variances of 0% and 5%, respectively with the SWIN V2 encoder. These results indicate that the model effectively captures and segments the different phases consistently across three distinct encoders. In task segmentation, the model showed strong and consistent performance across most tasks. For instance, tasks such as “Longitudinal Muscle Dissection” and “Suturing” achieved high F1 scores of 99% for each, with accuracies of 100% and 99%, and low variances (1% and 0%, and 7% and 8%, respectively) with the ConvNeXt encoder. This consistency reflects the model’s robust ability to segment tasks accurately. Conversely, the “Site” task, which had a significantly lower F1 score of 67% with high variance 33% with the ConvNeXt encoder. This indicates that the model struggles more with tasks that are less frequently represented in the dataset. For action segmentation, the model demonstrated high performance on frequently occurring actions such as “Scope Insertion” and “Stitching” achieving F1 scores of 99% and 95%, and accuracies of 100% and 98%, respectively with the ConvNeXt encoder. The variances for “Scope Insertion” were 1% for the F1 score and 3% for accuracy, while “Stitching” had variances of 4% and 15%, indicating stable and reliable performance. However, actions like “Debris Wash” and “Haemostatis,” which had lower F1 scores of 50% for each, also exhibited higher variances 50% for each of the above actions with the ConvNeXt encoder. These findings suggest that the model’s performance is consistent for well-represented actions, but struggles with less frequent actions.

Table 2 Performance of the STALNet model on Surgical Phases across different encoders.
Table 3 Performance of the STALNet model on Surgical Tasks across different encoders.
Table 4 Performance of the STALNet model on Surgical Actions across different encoders.
Fig. 6
figure 6

STALNet Performance Review using ROC Curves for Taxonomy Triplets. The top row of ROC curves shows the performance of ConvNeXt, ViT and SWIN V2 encoders on labelling high level TEMS surgical “Phases”. The next two rows show the performance of STALNet encoders on labelling TEMS surgical “Tasks” (intermediate level) and “Actions” (the fine level).

In summary, our technical validation is deliberately structured to demonstrate the effectiveness of STALNet’s multi-target modelling strategy, which offers superior performance and semantic consistency compared to flat single-label approaches. In early experiments, we trained STALNet as a single-label classifier across all 84 triplet combinations. This unitarget formulation consistently plateaued at  ~ 72% accuracy and struggled to model the underlying dependencies between triplet components. While it did not produce invalid triplets—since each output class was predefined—it lacked interpretability and failed to generalise well to complex surgical workflows.

We also explored a multi-head architecture without tailored loss weighting. This improved expressiveness but still resulted in clinically implausible combinations, as the model lacked guided supervision to respect the hierarchical structure between phases, tasks, and actions. Our final multi-target approach, with three prediction heads and tailored loss functions for each triplet component, enabled the model to learn semantic relationships across components. This design achieved up to 91.7% accuracy and 86.0% F1 score on individual elements (see Tables 1 to 4), while effectively avoiding unrealistic triplet outputs by learning their internal structure. Although the results are shown in separate tables for interpretability, they originate from a single, unified model trained jointly with a triplet-aware loss.

The results confirm that the STALNet model with the ConvNeXt encoder performs well and consistently across phases, tasks, and actions with sufficient training data, as evidenced by low variance in well-represented classes. However, as the number of classes increases—from five phases to 11 tasks to 21 actions—the modelling task becomes more challenging, leading to higher variance and lower performance for less frequent classes. This trend underscores the complexity of handling a larger number of classes and highlights the need to address class imbalance. Techniques such as weighted dataloaders and customised loss functions can mitigate these issues, improving the model’s robustness and performance across all categories.

The results also illustrate the model’s superior capabilities in capturing the nuances of surgical workflows. The ROC curves highlights that the Swin V2 encoder outperforms other encoders in terms of accuracy and F1 score. The model’s output is visually depicted in an infographic in Fig. 7. This shows the input video clips with predicted and actual taxonomy triplet labels from a batch. This visualisation clearly demonstrates the trends discussed in the performance tables and ROC curves, providing a comprehensive understanding of the model’s efficacy in real-world scenarios.

Fig. 7
figure 7

STALNet: Batch of results for visual inspection. This figure illustrates the output of the STALNet model compared to human annotations—the ground truth (GT). Each tile displays the first, middle, and last frames of a video clip, along with predictions and GT for each taxonomy triplet (Phase, Task, Action) at the top. Green font indicates agreement with the GT, while red font indicates disagreement. In this example, there is widespread agreement except for one microclip where the model predicted the action “retraction” instead of “dissection” as labeled by the human annotators.

The focus of this study was to provide a high-fidelity resource that enables the development of AI models for accurate surgical video indexing, such as our proposed STALNet architecture. While the objective is not to directly evaluate models for upstream tasks like surgical skill assessment—which require deeper reasoning and semantic understanding—this foundational work is essential for enabling scalable retrospective video analysis and supporting future clinical applications. To support this, the structured phase-task-action triplet taxonomy was co-designed with a panel of expert colorectal surgeons, aiming not only to capture workflow granularity but also to embed clinically meaningful signals that could potentially serve as proxies for surgical competence. For example, metrics derived from such factors as the frequency of intraoperative adverse events (e.g., bleeding), the length of inactive periods (“no action”), or the volatility of phase transitions—could, in future studies, be investigated as indicators of procedural fluency or surgeon expertise. These hypotheses are particularly relevant for distinguishing between experienced and novice operators, as variability in temporal workflow progression may indeed reflect differences in training or technical confidence.

Continue Reading