Cyberattack event and arguments extraction based on feature interaction and few-shot learning

In this section, we evaluate the event extraction performance of the proposed CAFIIE method against baseline methods on CTI texts, conducting comprehensive tests on both event type extraction accuracy and multiple event attribute extractions using F1 scores.

Experiment setup

Datasets

Table 2 presents the public and private datasets used in our experiments.

Table 2 The named entity statistics information of four datasets.

CASIE 1 Cybersecurity professionals meticulously selected 1000 articles from 5000 cyber news to build the CASIE dataset. This dataset encompasses five event types and over 20 types of event arguments.

DNRTI 2 It is a cybersecurity dataset for recognizing named entities with more than 6500 annotated sentences and 36400 annotated entities. Entities are categorized into 13 types, including Hacker Organization (HackOrg), Attack (OffAct), Sample File (SamFile), etc.

MalwareTextDB 39 It is a malicious code library comprising 6819 tagged sentences and 10983 tagged entities from 39 APT reports. Token annotations are categorized into Action, Entity, and Modifier.

Private Dataset Selected 1000 cyber threat event news from the open source CTI website40 in 2023, established a private CTI dataset with 7346 named entities. Following CASIE’s items, we categorized attack events and threat entities into five types and 23 types. Table 3 presents the distribution of specific entity types in the CASIE and private datasets.

Table 3 The Statistics of Entity Types in the CASIE and Private Dataset.

Evaluation metrics

We adopted commonly used evaluation metrics, including precision (P), recall (R), F1 score (F1), and accuracy (Accu), in comparing experiments. The formulas for these metrics are provided below.

$$begin{aligned} begin{aligned} P=frac{N_{TP}}{N_{TP}+N_{FP}} , ~~R=frac{N_{TP}}{N_{TP}+N_{FN}} , ~~F1=2times frac{Ptimes R}{P+R} , Accu=frac{N_{TP}+N_{TN}}{N_{TP}+N_{FN}+N_{FN}+N_{TN}}. end{aligned} end{aligned}$$

(8)

Where the subscripts TP, FP, TN, and FN are the components of the confusion matrix, N denotes the number of variables in each subscript category.

Comparison models

We compared the proposed algorithm with eight high-performing information extraction algorithms in terms of precision and recall. Specific algorithms include CRF41, Naivebayes-CRF42, BiLSTM-CRF43, IDCNN-CRF44, CNN-BiLSTM-CRF2, LSTM-BiLSTM-CRF2, Base (multi-embedding), Base-BERT1, BERT-HSA5, BERT-SSA5, Base-FI-BiLSTM-ANN-CRF(ours).We uniformly incorporated CRF as the final network layer added to all algorithms for a fair comparison and better performance.

Hyperparameter configuration of CAFIIE

The configuration of CAFIIE’s network layers (Domain-Word2Vec/BERT Embedding, Dense Features, BiLSTM, Attention, double FC, CRF). To maintain a fair comparison for experimental records, we retain the initial word embedding dimension at 100, which is the same setting as the CASIE experiment. After extensive experiments and debugging, we identified the optimal parameter setting. Table 4 shows the CAFIIE’s overall hyperparameter configuration.

Table 4 Hyperparameter configuration in CAFIIE architecture.

Experimental results

Event type detection

Utilizing CTI event-type annotations in the CASIE dataset, this section compares the detection performance of our proposed CAFIIE method with the CASIE detection process. Table 5 is the experimental record of the CAFIIE algorithm, in which the detection rate of the “I-Phishing” event type is the highest, reaching 87%. It is 2% higher than the highest detection rate (85%) recorded in the CASIE1.

Notably, the “PatchVulnerability” event type usually consists of a single word, and the only “B-PatchVulnerability” label represents this whole type. There is no “I-PatchVulnerability” word, so we used the symbol “-” to refer to the empty data of “I-PatchVulnerability” in Table 5. Therefore, this experiment confirms the performance improvement effect of the proposed CAFIIE method on the cybersecurity event type detection on the CASIE dataset.

Table 5 Event types detection on CASIE dataset.

Event argument detection

Following the comparative methods in CASIE, we conducted event argument extraction experiments on three publicly available datasets and one private dataset. Table 6 summarizes our method’s accuracy in extracting the top entity types for each dataset, listing the top five entities by accuracy.

Table 6 The CTI event arguments extraction accuracy ranking by proposed method (CAFIIE) in four datasets.

Since MalwareTextDB comprises only three entity types, Table 6 presents results for these types. Compared to the listed baselines, our method exhibits enhanced accuracy in extracting the main entity types on the three public datasets (CASIE, DNRTI, and MalwareTextDB). Table 7 shows the experimental results of our CAFIIE method and other baseline methods. The best performance metric for each dataset is in bold, and the suboptimal performance is underlined.

1. Our method is implemented in three different ways: Base+FI+BiLSTM+ANN+CRF, Base+BERT+FI+BiLSTM+ANN+CRF, and BERT+FI+BiLSTM+ANN+CRF. The experimental results showed that our method performed with higher extraction accuracy on four datasets. Among the three other networks of our CAFIIE, the Base+BERT+FI+BiLSTM+ANN+CRF model achieved the best precision on four datasets, and its performance in recall and FI score metrics outperformed most methods.

2. The differences in experimental indicators of the three proposed models reflect the necessity of comparing models. According to the results, the text information obtained through the context-independent multi-dimensional embedding method can achieve higher experimental performance after subsequent complex model processing. It showed the effectiveness of our innovative interactive feature mining in network structure.

3. Only our method achieves over 66% precision on the private dataset. It was demonstrated that our method affects the information extraction of CTIs.

Although the proposed method performs optimally in most situations, it only performs a slight advantage in the public datasets. The reason is that the proposed method is more suitable for complex CTI in the cybersecurity domain. We aim to create a professional cybersecurity knowledge graph from our dataset. More detailed and comprehensive cybersecurity vocabulary categories were defined during the text labeling process, enhancing the professionalism of the labeled data. It requires more professional knowledge and features. Therefore, on the private dataset, other methods generally perform well, and our method introduces interactive feature mining as added knowledge features to make the algorithm more suitable for professional fields and perform better.

Table 7 The extraction effects of all methods.

To visualize the variation of accuracy during the training process across iteration epochs, we selected three new baseline methods for comparative analysis with the proposed approach, as detailed in the Fig. 4.

Fig. 4

The comparison performance of four methods on different datasets.

The few-shot scenario in cybersecurity

Researchers designed few-shot learning for few-shot scenarios with limited categories in classification tasks. Due to sample constraints, features are sparse. Consequently, we chose CASIE and private datasets as the foundational datasets. In the cybersecurity domain, we categorized few-shot scenarios into three classes (3-way): communication data distorting attacks, data communication service disrupting attacks, and asset data destruction attacks. Due to the sparse features of CTI, the extraction efficiency in one-shot learning is poor. Hence, we simulated two groups of few-shot learning scenarios with five samples and ten samples (5-shot, 10-shot), omitting a single sample (1-shot) as the support set. Our experimental algorithms encompass the proposed CAFIIE model and the comparative model CASIE. Table 8 displays our experiments on CASIE and private datasets, affirming the efficacy of our FI-based algorithm for extracting CTI in few-shot scenarios.

Table 8 The extraction precision of CAFIIE in the few-shot scenarios.

Ablation study

To comprehensively evaluate the influence of different components on the CAFIIE method, ablation studies have been conducted. We focus on evaluating the impact of the FI, BiLSTM, and ANN on the performance of the CAFIIE. The results are presented in Table 9.

Table 9 Ablation study on different components of CAFIIE.

Compared to the complete CAFIIE network, we observed that the experimental group lacking FI exhibited the poorest performance across different datasets, followed by the group without BiLSTM. In contrast, the group without ANN demonstrated the best performance. This confirms the critical role of FI in the entire CAFIIE network, as it facilitates deeper contextual understanding for both the BiLSTM and ANN networks.

Continue Reading