P3GNN: A Privacy-Preserving Provenance Graph-Based Model for APT Detection in Software Defined Networking (2024)

11institutetext: Cyber Science Lab, Canada Cyber Foundry, University of Guelph, Canada
11email: hnazari@uoguelph.ca, ayazdine@uoguelph.ca, adehghan@uoguelph.ca
22institutetext: Dept. of Engineering, University of Guelph, Canada
22email: fzarrink@uoguelph.ca
33institutetext: Dept. of Math and Computer Science, Brandon University, Manitoba, Canada
33email: srivastavag@brandonu.ca
44institutetext: Research Centre for Interneural Computing, China Medical University, Taichung, Taiwan
44email: srivastavag@brandonu.ca

Hedyeh Nazari11  Abbas Yazdinejad11  Ali Dehghantanha11  Fattane Zarrinkalam22  Gautam Srivastava3344

Abstract

Software Defined Networking (SDN) has brought significant advancements in network management and programmability. However, this evolution has also heightened vulnerability to Advanced Persistent Threats (APTs), sophisticated and stealthy cyberattacks that traditional detection methods often fail to counter, especially in the face of zero-day exploits. A prevalent issue is the inadequacy of existing strategies to detect novel threats while addressing data privacy concerns in collaborative learning scenarios. This paper presents P3GNN (privacy-preserving provenance graph-based graph neural network (GNN) model), a novel model that synergizes Federated Learning (FL) with Graph Convolutional Networks (GCN) for effective APT detection in SDN environments. P3GNN utilizes unsupervised learning to analyze operational patterns within provenance graphs, identifying deviations indicative of security breaches. Its core feature is the integration of FL with hom*omorphic encryption, which fortifies data confidentiality and gradient integrity during collaborative learning. This approach addresses the critical challenge of data privacy in shared learning contexts. Key innovations of P3GNN include its ability to detect anomalies at the node level within provenance graphs, offering a detailed view of attack trajectories and enhancing security analysis. Furthermore, the model’s unsupervised learning capability enables it to identify zero-day attacks by learning standard operational patterns. Empirical evaluation using the DARPA TCE3 dataset demonstrates P3GNN’s exceptional performance, achieving an accuracy of 0.93 and a low false positive rate of 0.06.

Keywords:

Advanced Persistent Threat, SDN, Graph Neural Network, Federated Learning.

1 Introduction

SDN represents a novel approach in network architecture that aims to address prevailing challenges by separating the control plane from the data plane and using protocols such as OpenFlow [12, 34] for communication between these planes [35]. This approach turns network switches into simple packet handlers, with decision-making centralized in a software controller. The controller offers a comprehensive overview of the network and programming abstractions, enabling a centralized management system. This system grants operators programmatic and real-time control over networks and devices, simplifying network management and introducing flexibility that was previously lacking [13, 4]. However, the adoption of SDN introduces new security concerns, such as APT [26, 36]. APTs pose an immense and formidable threat to cyberspace security. Characterized by their stealth and slow infiltration, APTs are typically launched by well-resourced groups, like state actors, and employ advanced tactics, techniques, and procedures (TTPs) for long-term network presence, targeting espionage or sabotage [2]. As APTs become more sophisticated, cybersecurity solutions must become more dynamic and adaptable.

Diverse strategies utilizing log files to counter APTs have evolved due to the limitations of traditional signature-based detection systems, particularly against zero-day exploits and advanced polymorphic malware [17]. Among these, Host-based Intrusion Detection Systems (HIDS) stand out for their efficacy. Further, integrating HIDS into SDN networks appears promising for enhanced security [37]. Building on HIDS, the Provenance-based Intrusion Detection System (PIDS), especially its provenance graph method, utilizes provenance data’s contextual information to pinpoint APT attacks [31]. In single-host environments, a system provenance graph, structured as a directed acyclic graph (DAG), categorizes log file elements like processes, files, and sockets as vertices and their interactions as edges. This method effectively maps system processes, illustrating control and data flow between entities[16, 22]. Previous research on log provenance graphs for APT detection faces limitations, particularly with misuse-based methods’ inability to detect zero-day attacks. These methods, which compare provenance graphs to known attack graphs, struggle to identify new threats not captured in existing cyber threat reports [20]. In response, unsupervised anomaly-based methods offer a promising alternative by focusing on deviations from normal behavior within provenance graphs, essential for detecting new and emerging threats, including zero-day attacks [24].

Previous research on detecting APTs using provenance graphs highlights significant challenges due to the complexity of these dense networks that represent system interactions. GNNs, in contrast to standard Deep Neural Networks (DNNs), are better suited for these tasks due to their ability to interpret graph-structured data, enabling them to not only detect but also visualize malicious activities within network contexts [19, 15]. However, the effectiveness of GNNs is often limited by the availability of extensive log datasets necessary for training. Despite the potential of centralized training to mitigate data scarcity by pooling resources from multiple sources, practical implementation is hindered by privacy and security concerns, preventing organizations from sharing sensitive log data [7, 23].

Considering these challenges, in this paper, we introduce P3GNN, an explainable privacy-preserving model specifically designed for APT detection. P3GNN innovatively combines the principles of FL with GCN, using unsupervised learning to examine operational patterns in provenance graphs for identifying deviations that signal security breaches [9]. A key aspect of P3GNN is its strong focus on privacy and security. By integrating FL with hom*omorphic encryption, specifically the Paillier hom*omorphic cryptosystem (PHC), it ensures the confidentiality and integrity of data and gradients during collaborative learning, offering dual-layer protection for sensitive information [11]. This integration is vital because, while FL allows for collaborative model training across different organizations without direct data sharing, it inherently provides limited privacy protection and doesn’t fully resolve security issues, especially regarding data shared with third-party servers [27]. The concern of an untrustworthy server exploiting uploaded parameters to infer feature values from private training datasets is effectively addressed by the addition of partially hom*omorphic encryption in P3GNN. Additionally, P3GNN’s capability to detect APTs at the node level within provenance graphs, identify anomalous nodes, and trace the full attack trajectory, is crucial for security analysts. This functionality offers a comprehensive view of attack paths and enhances the analysis of APT incidents. The main contributions of this paper can be summarized as follows:

  • We propose a novel interpretable, privacy-preserving provenance graph-based FL model P3GNN for detecting APTs. Our model leverages GCN to adeptly navigate the intricate interconnections present in log provenance graphs.

  • Our model operates in an unsupervised manner, effectively learning the standard operational patterns within provenance graphs and thereby identifying any deviations from the system’s typical behavior. This capability enables our system to detect anomalies that could signify zero-day attacks, which are particularly challenging to identify due to their novel and previously unrecorded nature.

  • The integration of FL with hom*omorphic encryption in P3GNN maintains the confidentiality of data and gradients in collaborative learning, addressing data sharing challenges and protecting log owners’ privacy. This allows training on diverse datasets, including attack samples from multiple sources, without transferring sensitive log data. Training is conducted locally, leveraging distributed data while ensuring privacy and security.

  • P3GNN can pinpoint anomalous nodes within provenance graphs at the node level. This feature significantly enhances the model’s capability to trace the entire trajectory of APT attacks, providing valuable insights for security analysis. This attribute of P3GNN addresses a common issue in the field of artificial intelligence regarding interpretability.

The following is the specification of the threat model we consider in this paper.

  • High Volume of Log Data with Subtle Anomalies:The main challenge in detecting APTs within SDNs lies in the sheer volume of log data, where malicious activity can easily masquerade as normal system behavior. This makes it challenging to identify threats without examining the interconnections among various system events. Furthermore, the subtlety of such activities necessitates deep analysis. Advanced ML techniques are crucial for effectively identifying these hidden threats.

  • Privacy Risk in Data Sharing for GNN Training: In a decentralized infrastructure for GNN training, sharing sensitive log data with external entities poses a significant privacy risk, potentially exposing critical operational details.

  • Confidentiality of Gradients in FL setup: The confidentiality of gradients is a concern, as an honest but curious server could potentially access sensitive user information through shared gradients and model parameters. An adversary can gain access to the private logs of an organization through gradient inversion attacks [14].

By achieving our contributions, we address the following design goals in this research:

The rest of this paper is organized as follows. Section 2 reviews the related literature. Section 3 introduces the structure of the P3GNN model. Section 4 evaluates the proposed model, compares the results to related works, and provides interpretability and complexity analysis for the model. Lastly, we conclude the paper in section 5.

2 Related Work

APT detection has been the subject of numerous studies. In provenance-based methods, Milajerdi et al. [21] developed Holmes, a real-time APT detection system that utilizes provenance graphs for correlating suspicious information flows. It employs expert knowledge of TTPs to match defined exploits within these graphs. This integration allows Holmes to effectively summarize attackers’ actions through robust detection signals and high-level graphs, offering precise detection and low false alarm rates in real-world APT scenarios. Also, in [33], Xie et al. proposed a model named Pangoda, designed for large data environments. It analyzes provenance graphs to detect anomalies, balancing the focus between individual paths and the overall graph structure. This enables quick identification of intrusions and improves detection accuracy. In [30], the authors introduce ProvDetector, a malware detection tool that utilizes provenance graphs. ProvDetector operates by embedding paths within a provenance graph and applying the Local Outlier Factor method to identify malware effectively. Poirot, as proposed in [20], is another provenance-based system designed for threat hunting. It focuses on correlating indicators identified by other systems and constructs attack graphs using expert knowledge from existing cyber threat reports. The system matches the provenance graphs with these attack graphs to detect threats, but it faces challenges in identifying unknown threats not already included in the TTPs and cyber threat reports. DeepTaskAPT introduced in [18] is an LSTM-based deep learning model for detecting APTs by analyzing event sequences in the provenance graph. This model identifies unusual activities by being trained solely on benign sequences, which are created using a tree-based method for task generation. In [8], a system named UNICORN is presented for detecting APT anomalies using sketch-based, time-weighted provenance encoding and graph analysis. It focuses on supply-chain attacks, with multi-hop exploration and evolutionary modeling enhancing accuracy and efficiency.

In the context of APT detection in SDNs, the authors in [25] introduce a Hidden Markov Model-based APT detection method that effectively detects and analyzes multi-stage APTs in SDN, offering accurate threat identification with minimal overhead. In [26], they propose a method that employs an attack tree-based model for SDN, enabling the detection of APTs by correlating multiple attack behaviors to establish an attack path. However, these methods fail to take into account the privacy of log owners. Huynh Thai Thi et al. present an FL approach for enhancing cyber threat detection in SDN. This method leverages collaborative intelligence across multiple parties while preserving data privacy. The technique focuses on proactive APT attack detection, utilizing FL to improve the performance of detection systems against unknown threats. For the IDS model, the authors employed three local models, including Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN) [28]. In [29], the XFedHunter model uses the FedAvg algorithm for decentralized APT detection in SDNs, enhancing privacy. It integrates CNN, GRU, and E-GraphSAGE for intrusion detection and employs the SHAP framework for model explainability. While this approach focuses on privacy by design, it does not explicitly protect model parameters like gradients and weights during training.

3 Proposed Model

In this section, we present an in-depth view of our model, P3GNN, specifically designed for detecting APTs in SDNs. The P3GNN model integrates three key technological paradigms: the APT detection procedure, the SDN infrastructure, and, the collaborative and privacy-preserving aspects of FL. Each of these components plays a pivotal role in creating a robust, efficient, and privacy-preserving APT detection system. Figure 1 shows the architecture of the different components in P3GNN. The following subsections will explain each component’s contributions as well as how they work together to enhance our model’s effectiveness and security.

P3GNN: A Privacy-Preserving Provenance Graph-Based Model for APT Detection in Software Defined Networking (1)

3.1 SDN Infrastructure

In the design of our system, we view the SDN network as a potential zone for detecting malicious activities, which are likely to appear within the network’s flow data.SDN encompasses three planes: the Control Plane, responsible for traffic direction decisions, often on a network operating system; the Data Plane, comprising physical or virtual routers and switches that forward packets per control plane instructions; and the management plane, overseeing network administration, monitoring, and policy enforcement. SDN utilizes northbound APIs for communication between the control plane and higher-level applications, enabling network programming, and southbound APIs (like OpenFlow) for connecting the control plane to data plane devices for management. To effectively monitor and scrutinize this data for any signs of such activities, it’s essential to first set up the SDN network switches. A fundamental component of this process involves the data plane functionality of SDN switching devices, which are pivotal in packet forwarding. The data plane in an SDN switch operates distinctly compared to traditional networking devices. While legacy networks primarily rely on forwarding packets based on IP or MAC addresses, SDN switches introduce a more versatile approach [32]. These switches are tasked with forwarding network flow data for APT detection analysis via the OpenFlow protocol.

3.2 APT Detection Procedure in P3GNN

The APT detection procedure of the P3GNN model as shown in Figure 2 is structured into four phases: data extraction, contextual data representation, GNN training, and finally abnormal node identification and analysis. The subsequent sections will provide an in-depth exploration of each phase, detailing their functionalities and contributions to the model’s capability in APT detection.

P3GNN: A Privacy-Preserving Provenance Graph-Based Model for APT Detection in Software Defined Networking (2)

3.2.1 Data extraction:

In the initial phase of our model, we focus on the extraction and initial preprocessing of log data. The primary source of our data comprises system logs, encompassing a wide range of information such as timestamps, process names, and execution paths. Utilizing regular expression techniques, we parse these logs to transform the raw data into a structured format. We define ΣΣ\Sigmaroman_Σ as the set of all unique symbols present in the log data. Using regular expressions, we create definitions for specific patterns found in log entries. For example, to identify a UUID, we use a regular expression denoted as Ruuidsubscript𝑅uuidR_{\text{uuid}}italic_R start_POSTSUBSCRIPT uuid end_POSTSUBSCRIPT. This expression is defined as ”uuid:̈”̈(Σ)””̈”uuid:̈”̈superscriptΣ””̈\text{"uuid\"{:}\"{"}}\cdot(\Sigma^{*})\cdot\text{"\"{"}}”uuid:̈”̈ ⋅ ( roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅ ””̈, where ΣsuperscriptΣ\Sigma^{*}roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT represents any sequence of characters from our defined symbol set.

The process of transforming log entries into structured data is encapsulated by the parsing function ParseLog. Let L𝐿Litalic_L represent the collection of all log entries. Each entry l𝑙litalic_l in L𝐿Litalic_L is a tuple containing elements like timestamp, process name, and execution path. The function ParseLog applies the defined regular expressions to each log entry l𝑙litalic_l from L𝐿Litalic_L, converting it into a structured data format represented by d𝑑ditalic_d in the set D𝐷Ditalic_D. This transformation can be formally expressed as:

ParseLog(l):LD:ParseLog𝑙𝐿𝐷\text{ParseLog}(l):L\rightarrow DParseLog ( italic_l ) : italic_L → italic_D(1)

This methodical approach allows us to efficiently convert unstructured log data into a format that is more suitable for analysis and further processing.

3.2.2 Contextual data representation:

Our approach to detecting anomalous network nodes thrives in environments without predefined attack patterns, moving away from traditional supervised learning that relies on clear labels, which are often unavailable due to the lack of specific attack knowledge during training. We utilize a GCN designed to differentiate and comprehend the subtle behaviors of benign nodes. The core idea is that nodes involved in malicious activities exhibit distinct behavioral anomalies compared to benign ones, allowing our model to learn and identify unique patterns of relationships among node types. The feature extraction process involves an approach wherein each node’s interactions and connections are translated into feature vectors. Thus, the contextual data representation phase is as follows.Our primary objective is to construct a provenance graph and extract features from the preprocessed logs. The provenance graph, denoted as G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), is built based on the relationships identified in the logs. Here, V𝑉Vitalic_V represents the set of vertices (entities such as files, processes, sockets), and E𝐸Eitalic_E represents the set of edges (interactions like read, write, fork).

We define Nvsubscript𝑁𝑣N_{v}italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT as the total number of unique node types and Nesubscript𝑁𝑒N_{e}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT as the total number of unique edge types present in G𝐺Gitalic_G. To facilitate the mapping of node and edge types to unique integers, we introduce two mapping functions: Mv:Types{0,,Nv1}:subscript𝑀𝑣Types0subscript𝑁𝑣1M_{v}:\text{Types}\rightarrow\{0,\ldots,N_{v}-1\}italic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT : Types → { 0 , … , italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - 1 } for nodes, and Me:Types{0,,Ne1}:subscript𝑀𝑒Types0subscript𝑁𝑒1M_{e}:\text{Types}\rightarrow\{0,\ldots,N_{e}-1\}italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT : Types → { 0 , … , italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - 1 } for edges.

Each node vV𝑣𝑉v\in Vitalic_v ∈ italic_V is assigned a type label using Mvsubscript𝑀𝑣M_{v}italic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, and its feature vector is determined by a function F:V2Ne:𝐹𝑉superscript2subscript𝑁𝑒F:V\rightarrow\mathbb{N}^{2N_{e}}italic_F : italic_V → blackboard_N start_POSTSUPERSCRIPT 2 italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. The feature vector F(v)𝐹𝑣F(v)italic_F ( italic_v ) for a node v𝑣vitalic_v comprises the count of incoming and outgoing edges of each type. Specifically, for i{0,,Ne1}𝑖0subscript𝑁𝑒1i\in\{0,\ldots,N_{e}-1\}italic_i ∈ { 0 , … , italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - 1 }, the feature vector includes:

ai=|{eEin(v):Me(type(e))=i}|subscript𝑎𝑖conditional-set𝑒subscript𝐸in𝑣subscript𝑀𝑒type𝑒𝑖a_{i}=|\{e\in E_{\text{in}}(v):M_{e}(\text{type}(e))=i\}|italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = | { italic_e ∈ italic_E start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ( italic_v ) : italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( type ( italic_e ) ) = italic_i } |(2)
ai+Ne=|{eEout(v):Me(type(e))=i}|subscript𝑎𝑖subscript𝑁𝑒conditional-set𝑒subscript𝐸out𝑣subscript𝑀𝑒type𝑒𝑖a_{i+N_{e}}=|\{e\in E_{\text{out}}(v):M_{e}(\text{type}(e))=i\}|italic_a start_POSTSUBSCRIPT italic_i + italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT = | { italic_e ∈ italic_E start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ( italic_v ) : italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( type ( italic_e ) ) = italic_i } |(3)

where aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the count of incoming edges of type i𝑖iitalic_i, and ai+Nesubscript𝑎𝑖subscript𝑁𝑒a_{i+N_{e}}italic_a start_POSTSUBSCRIPT italic_i + italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the count of outgoing edges of the same type. The final feature vector F(v)𝐹𝑣F(v)italic_F ( italic_v ) is represented in Equation 4:

F(v)=[a0,a1,,aNe1,aNe,aNe+1,,a2Ne1]𝐹𝑣subscript𝑎0subscript𝑎1subscript𝑎subscript𝑁𝑒1subscript𝑎subscript𝑁𝑒subscript𝑎subscript𝑁𝑒1subscript𝑎2subscript𝑁𝑒1F(v)=[a_{0},a_{1},\ldots,a_{N_{e}-1},a_{N_{e}},a_{N_{e}+1},\ldots,a_{2N_{e}-1}]italic_F ( italic_v ) = [ italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT 2 italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ](4)

3.2.3 GNN training:

Our GCN model is engineered to predict complex node relationships within a graph, using node types and provenance graph structures as inputs. Its primary goal is to accurately predict feature vectors for each node by leveraging their connections and positions within the graph. The architecture features three graph convolutional layers, which aggregate and transform node features according to their connectivity. Each convolution operation is defined as:

H(l+1)=σ(D~12A~D~12H(l)W(l))superscript𝐻𝑙1𝜎superscript~𝐷12~𝐴superscript~𝐷12superscript𝐻𝑙superscript𝑊𝑙H^{(l+1)}=\sigma(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}H^{(%l)}W^{(l)})italic_H start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT = italic_σ ( over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT over~ start_ARG italic_A end_ARG over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT )(5)

In this equation, H(l)superscript𝐻𝑙H^{(l)}italic_H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT represents the node feature matrix at layer l𝑙litalic_l, A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG is the adjacency matrix of the graph, D~~𝐷\tilde{D}over~ start_ARG italic_D end_ARG is the degree matrix of A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG, W(l)superscript𝑊𝑙W^{(l)}italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is the weight matrix for layer l𝑙litalic_l, and σ𝜎\sigmaitalic_σ denotes the activation function, ReLU (Rectified Linear Unit). This convolutional operation effectively captures the local graph topology and feature information. Following the first two convolutional layers, batch normalization and ReLU activation functions are applied, with dropout used for regularization purposes. The model’s predicted feature vector can be represented in Equation 6:

F^(v)=H(final)W(final)^𝐹𝑣superscript𝐻𝑓𝑖𝑛𝑎𝑙superscript𝑊𝑓𝑖𝑛𝑎𝑙\hat{F}(v)=H^{(final)}W^{(final)}over^ start_ARG italic_F end_ARG ( italic_v ) = italic_H start_POSTSUPERSCRIPT ( italic_f italic_i italic_n italic_a italic_l ) end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_f italic_i italic_n italic_a italic_l ) end_POSTSUPERSCRIPT(6)

Then, the Mean Squared Error (MSE) is used to assess the difference between the model’s prediction and the actual feature vector F(v)𝐹𝑣F(v)italic_F ( italic_v ) for each node in the graph. Formally, the MSE loss is defined in Equation 7:

MSE=1nvV|F(v)F^(v)|2MSE1𝑛subscript𝑣𝑉superscript𝐹𝑣^𝐹𝑣2\text{MSE}=\frac{1}{n}\sum_{v\in V}|F(v)-\hat{F}(v)|^{2}MSE = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_v ∈ italic_V end_POSTSUBSCRIPT | italic_F ( italic_v ) - over^ start_ARG italic_F end_ARG ( italic_v ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(7)

Here, n𝑛nitalic_n represents the total number of nodes in the graph G, F(v)𝐹𝑣F(v)italic_F ( italic_v ) denotes the actual feature vector of node v, and F^(v)^𝐹𝑣\hat{F}(v)over^ start_ARG italic_F end_ARG ( italic_v ) is the feature vector predicted by the model for node v𝑣vitalic_v.

3.2.4 Abnormal node identification:

In the final phase of our model, we focus on identifying abnormal nodes by analyzing deviations from expected behavior. The model determines whether a node is anomalous by calculating the error value based on the absolute difference between the predicted feature vector Fp(v)subscript𝐹𝑝𝑣{F}_{p}(v)italic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_v ) and the actual feature vector F(v)𝐹𝑣F(v)italic_F ( italic_v ). The error score for node v𝑣vitalic_v is computed as follows:

error(v)=|F^(v)F(v)|error𝑣^𝐹𝑣𝐹𝑣\text{error}(v)=|\hat{{F}}(v)-F(v)|error ( italic_v ) = | over^ start_ARG italic_F end_ARG ( italic_v ) - italic_F ( italic_v ) |(8)

By setting an appropriate threshold, we distinguish between normal node behavior and significant deviations indicative of anomalies. This threshold is determined using a data-driven approach, where we analyze the distribution of error values in the validation set to select a value that optimally balances accuracy with the minimization of false positives and negatives in anomaly detection.

3.3 Secure Model Training through FL and PHC

Our model employs a dual-key mechanism, entailing a public key for encryption and a private key for decryption, specifically used by the PHC. A specialized entity, referred to as the Key Generation Authority (KGA), undertakes the task of generating these keys. During the setup phase, this KGA chooses two distinct large prime numbers, p𝑝pitalic_p and q𝑞qitalic_q, ensuring pq𝑝𝑞p\neq qitalic_p ≠ italic_q and the greatest common divisor (gcd) of pq𝑝𝑞pqitalic_p italic_q and (p1)(q1)𝑝1𝑞1(p-1)(q-1)( italic_p - 1 ) ( italic_q - 1 ) is 1. Subsequently, the KGA defines N=pq𝑁𝑝𝑞N=pqitalic_N = italic_p italic_q and computes Carmichael’s function as λ(N)=lcm(p1,q1)𝜆𝑁lcm𝑝1𝑞1\lambda(N)=\text{lcm}(p-1,q-1)italic_λ ( italic_N ) = lcm ( italic_p - 1 , italic_q - 1 ). A random integer g𝑔gitalic_g is then chosen from N2superscriptsubscriptsuperscript𝑁2\mathbb{Z}_{N^{2}}^{*}blackboard_Z start_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that gcd(L(gλmodn2),N)=1gcd𝐿modulosuperscript𝑔𝜆superscript𝑛2𝑁1\text{gcd}(L(g^{\lambda}\mod n^{2}),N)=1gcd ( italic_L ( italic_g start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT roman_mod italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_N ) = 1. Here, the function L(x)𝐿𝑥L(x)italic_L ( italic_x ) is formulated as L(x)=x1N𝐿𝑥𝑥1𝑁L(x)=\frac{x-1}{N}italic_L ( italic_x ) = divide start_ARG italic_x - 1 end_ARG start_ARG italic_N end_ARG. The encryption public key is thus represented as (N,g)𝑁𝑔(N,g)( italic_N , italic_g ), and the decryption private key is (λ,μ)𝜆𝜇(\lambda,\mu)( italic_λ , italic_μ ), where μ𝜇\muitalic_μ is a distinct integer satisfying μ=L(gλmodN2)1modN𝜇modulo𝐿superscriptmodulosuperscript𝑔𝜆superscript𝑁21𝑁\mu=L(g^{\lambda}\bmod N^{2})^{-1}\mod Nitalic_μ = italic_L ( italic_g start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT roman_mod italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_mod italic_N. These keys are disseminated among organizations while being kept hidden from the server.

As each federated communication cycle commences, organizations retrieve the global aggregated updates from the preceding round from the server, and update their local model parameters according to the Equation 9:

Wt+1old=Wtold+Utglobalsuperscriptsubscript𝑊𝑡1𝑜𝑙𝑑superscriptsubscript𝑊𝑡𝑜𝑙𝑑superscriptsubscript𝑈𝑡𝑔𝑙𝑜𝑏𝑎𝑙W_{t+1}^{old}=W_{t}^{old}+U_{t}^{global}italic_W start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_l italic_d end_POSTSUPERSCRIPT = italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_l italic_d end_POSTSUPERSCRIPT + italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUPERSCRIPT(9)

In this equation, Wt+1oldsuperscriptsubscript𝑊𝑡1𝑜𝑙𝑑W_{t+1}^{old}italic_W start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_l italic_d end_POSTSUPERSCRIPT symbolizes the model updated for the round t+1𝑡1t+1italic_t + 1, with Wtoldsuperscriptsubscript𝑊𝑡𝑜𝑙𝑑W_{t}^{old}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_l italic_d end_POSTSUPERSCRIPT and Utglobalsuperscriptsubscript𝑈𝑡𝑔𝑙𝑜𝑏𝑎𝑙U_{t}^{global}italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUPERSCRIPT representing the previous round’s final organization model and aggregated updates, respectively. The organizations engage in local training on their datasets, developing the GNN model and calculating the new parameter vector Wnewt+1superscriptsubscript𝑊𝑛𝑒𝑤𝑡1W_{new}^{t+1}italic_W start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT. The updated vector for the local model, Ut+1subscript𝑈𝑡1U_{t+1}italic_U start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, intended for subsequent server transmission, is derived as Ut+1=Wt+1newWt+1oldsubscript𝑈𝑡1superscriptsubscript𝑊𝑡1newsuperscriptsubscript𝑊𝑡1oldU_{t+1}=W_{t+1}^{\text{new}}-W_{t+1}^{\text{old}}italic_U start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT new end_POSTSUPERSCRIPT - italic_W start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT old end_POSTSUPERSCRIPT.

Following the reception of keys from the KGA, each organization, indicated as organization k𝑘kitalic_k, executes the PHC algorithm. This involves generating a random number rt+1ksubscriptsuperscript𝑟𝑘𝑡1r^{k}_{t+1}italic_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT in Zn2subscriptsuperscript𝑍superscript𝑛2Z^{*}_{n^{2}}italic_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT where 0<rt+1k<n0superscriptsubscript𝑟𝑡1𝑘𝑛0<r_{t+1}^{k}<n0 < italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT < italic_n, converting the update vector of each organization into ciphertext, expressed as Ct+1k=gUt+1krt+1kmodn2superscriptsubscript𝐶𝑡1𝑘modulosuperscript𝑔superscriptsubscript𝑈𝑡1𝑘subscriptsuperscript𝑟𝑘𝑡1superscript𝑛2C_{t+1}^{k}=g^{U_{t+1}^{k}}\cdot r^{k}_{t+1}\mod n^{2}italic_C start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_g start_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ⋅ italic_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT roman_mod italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Post-encryption, the organization sends the ciphertext Ct+1ksuperscriptsubscript𝐶𝑡1𝑘C_{t+1}^{k}italic_C start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT to the server for aggregation and generation of global model updates. Utilizing the PHC, the product of two ciphertexts results in a decryption equivalent to the sum of their original messages. The server, upon collecting all organization ciphertexts, applies the Federated Averaging (FedAvg) algorithm and PHC for aggregating updates. In the final phase, the server distributes the aggregated global updates back to the organizations. Each organization then employs the private keys to execute the decryption algorithm, obtaining the final results and updating the global model for the training round t+2𝑡2t+2italic_t + 2, as shown:

U(t+1)global=L((C(t+1)global)λmodn2)μmodnsuperscriptsubscript𝑈𝑡1globalmodulo𝐿modulosuperscriptsuperscriptsubscript𝐶𝑡1global𝜆superscript𝑛2𝜇𝑛U_{(t+1)}^{\text{global}}=L\left(\left(C_{(t+1)}^{\text{global}}\right)^{%\lambda}\mod n^{2}\right)\mu\mod nitalic_U start_POSTSUBSCRIPT ( italic_t + 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT global end_POSTSUPERSCRIPT = italic_L ( ( italic_C start_POSTSUBSCRIPT ( italic_t + 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT global end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT roman_mod italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_μ roman_mod italic_n(10)
Wt+2old=Wt+1old+Ut+1globalsuperscriptsubscript𝑊𝑡2𝑜𝑙𝑑superscriptsubscript𝑊𝑡1𝑜𝑙𝑑superscriptsubscript𝑈𝑡1𝑔𝑙𝑜𝑏𝑎𝑙W_{t+2}^{old}=W_{t+1}^{old}+U_{t+1}^{{global}}italic_W start_POSTSUBSCRIPT italic_t + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_l italic_d end_POSTSUPERSCRIPT = italic_W start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_l italic_d end_POSTSUPERSCRIPT + italic_U start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUPERSCRIPT(11)

Here, Wt+2oldsuperscriptsubscript𝑊𝑡2𝑜𝑙𝑑W_{t+2}^{old}italic_W start_POSTSUBSCRIPT italic_t + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_l italic_d end_POSTSUPERSCRIPT is the updated global model at the commencement of round t+2𝑡2t+2italic_t + 2. This iterative process continues until the global model achieves convergence.

4 Performance Evaluation

4.1 Experimental Setup

We used PyTorch 2.0.1 and CUDA 11.8 with an Intel i7-13700F CPU and NVIDIA GeForce RTX 4070 GPU. Our dataset’s training set (60%) contains only normal data, while the validation and test sets (20% each) include both normal and anomalous data, facilitating effective model evaluation and threshold tuning. The model, configured with a 0.5 learning rate and an Adam optimizer, underwent training across three rounds, each involving 100 epochs per organization. We tested scalability by varying the number of participating organizations from 2 to 10, observing stable performance with minimal fluctuations. Data encryption was managed using the python-paillier library with a 1024-bit key [5]. The architecture of our GNN model is detailed in Table 1.

LayerInput DimensionOutput Dimension
GCNConv19256
BatchNorm1-256
GCNConv2256128
BatchNorm2-128
GCNConv3 (Output Layer)12823

4.2 Dataset

The dataset for this study came from the DARPA TCE3 program, during red-team versus blue-team adversarial engagements in April 2018. These engagements, designed to simulate diverse cyberattack scenarios, involved the red team conducting attacks and the blue team defending. The setup included multiple servers such as web, SSH, email, and SMB, replicating a real-world network. The scenario realism was further enhanced by simulating extensive normal user activities, including routine system administration, web browsing, and the installation of various tools. The threat descriptions and ground truth reports from DARPA provided key insights into the red team’s tactics and strategies [10]. Despite the dataset’s large volume of events, malicious events represent only a small fraction, making anomaly and threat detection challenging due to the significant imbalance between benign and malicious activities [3].

4.3 Evaluation Metrics

To evaluate our model, we use several metrics based on the definitions of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Accuracy is the proportion of correct predictions (Accuracy=TP+TNTP+TN+FP+FNAccuracyTPTNTPTNFPFN\text{Accuracy}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text%{FN}}Accuracy = divide start_ARG TP + TN end_ARG start_ARG TP + TN + FP + FN end_ARG), Precision is the ratio of correct attack identifications to all labeled attacks (Precision=TPTP+FPPrecisionTPTPFP\text{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}}Precision = divide start_ARG TP end_ARG start_ARG TP + FP end_ARG), and Recall is the proportion of actual attacks correctly identified (Recall=TPTP+FNRecallTPTPFN\text{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}}Recall = divide start_ARG TP end_ARG start_ARG TP + FN end_ARG). The F1-score provides a balance between precision and recall (F1-score=2RecallPrecisionRecall+PrecisionF1-score2RecallPrecisionRecallPrecision\text{F1-score}=2\cdot\frac{\text{Recall}\cdot\text{Precision}}{\text{Recall}+%\text{Precision}}F1-score = 2 ⋅ divide start_ARG Recall ⋅ Precision end_ARG start_ARG Recall + Precision end_ARG), and False Positive Rate measures the rate of benign instances incorrectly marked as attacks (FPR=FPFP+TNFPRFPFPTN\text{FPR}=\frac{\text{FP}}{\text{FP}+\text{TN}}FPR = divide start_ARG FP end_ARG start_ARG FP + TN end_ARG).

4.4 Detection Evaluation

The results of our proposed model and the performance comparison with other anomaly detection models are presented in Table 2. Our approach stands out in its unsupervised learning methodology, which enables the detection of APT attacks without prior exposure to malicious data. The model exhibits a high accuracy of 0.93 and maintains a low false positive rate of 0.06. This performance is particularly noteworthy as it is achieved in the absence of supervised learning techniques commonly employed in existing models. Due to the lack of existing research on unsupervised learning methods applied to this dataset, this study undertakes a comparative analysis with supervised learning techniques. Our model’s performance was benchmarked against several state-of-the-art models, including DeepTaskAPT [18], Deeplog [6], LogShield, and BERT [1], and conventional baseline models (such as Random Forest and Linear Regression) [18]. While some of these methods may demonstrate superior performance metrics in certain aspects, they are predominantly reliant on supervised learning models. This reliance is a significant limitation, especially in addressing zero-day challenges where the model has no prior knowledge of attack patterns. Moreover, these existing models overlook the critical aspect of privacy in log data. In contrast, our model integrates FL and hom*omorphic encryption, addressing privacy concerns more effectively.

ModelMethodAccPrecRecF1-scoreFPR
P3GNNUnsupervised0.930.730.910.810.06
DeepTaskAPTSupervised0.985N/AN/AN/A0.011
DeepLogSupervised0.83N/AN/AN/A0.16
Random ForestSupervised0.80N/AN/AN/A0.18
Linear RegressionSupervised0.08N/AN/AN/A0.93
LogShieldSupervised0.980.990.990.984N/A
BERTSupervised0.890.910.950.90N/A

4.5 Computational Complexity Analysis of P3GNN

The complexity of the P3GNN model arises from data extraction, graph construction, GNN training, and anomaly detection. Incorporating FL and PHC adds to this complexity by introducing encryption and decryption overheads, crucial for ensuring data privacy in distributed networks. The base complexity of the P3GNN model, without FL and PHC, depends on the log data volume, the graph’s size and structure (vertices V𝑉Vitalic_V and edges E𝐸Eitalic_E), and the depth of computational requirements for GNN training. These factors contribute to a largely linear complexity, scaling with the size of the data and the graph. However, the model’s efficiency in this configuration benefits from direct, unencrypted operations on data and parameters. In contrast, when FL and PHC are integrated into the model, each operation involving model parameters—especially during the GNN training phase—incurs additional complexity due to encryption and decryption. This complexity is polynomial, specifically O(nparamsk3)𝑂subscript𝑛paramssuperscript𝑘3O(n_{\text{params}}\cdot k^{3})italic_O ( italic_n start_POSTSUBSCRIPT params end_POSTSUBSCRIPT ⋅ italic_k start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ), where nparamssubscript𝑛paramsn_{\text{params}}italic_n start_POSTSUBSCRIPT params end_POSTSUBSCRIPT is the number of model parameters, and k𝑘kitalic_k represents the encryption key size. This added complexity is a direct consequence of the cryptographic processes required to secure the model’s distributed learning environment.

ComponentWithout FL and PHCWith FL and PHC
Data ExtractionO(nlogsc)𝑂subscript𝑛logs𝑐O(n_{\text{logs}}\cdot c)italic_O ( italic_n start_POSTSUBSCRIPT logs end_POSTSUBSCRIPT ⋅ italic_c )O(nlogsc)𝑂subscript𝑛logs𝑐O(n_{\text{logs}}\cdot c)italic_O ( italic_n start_POSTSUBSCRIPT logs end_POSTSUBSCRIPT ⋅ italic_c )
Contextual Data Rep.O(nvertices+nedges)𝑂subscript𝑛verticessubscript𝑛edgesO(n_{\text{vertices}}+n_{\text{edges}})italic_O ( italic_n start_POSTSUBSCRIPT vertices end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT edges end_POSTSUBSCRIPT )O(nvertices+nedges)+O(nparamsk3)𝑂subscript𝑛verticessubscript𝑛edges𝑂subscript𝑛paramssuperscript𝑘3O(n_{\text{vertices}}+n_{\text{edges}})+O(n_{\text{params}}\cdot k^{3})italic_O ( italic_n start_POSTSUBSCRIPT vertices end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT edges end_POSTSUBSCRIPT ) + italic_O ( italic_n start_POSTSUBSCRIPT params end_POSTSUBSCRIPT ⋅ italic_k start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )
GNN TrainingO(|V|+|E|)𝑂𝑉𝐸O(|V|+|E|)italic_O ( | italic_V | + | italic_E | ) per layerO(|V|+|E|)𝑂𝑉𝐸O(|V|+|E|)italic_O ( | italic_V | + | italic_E | ) per layer + O(nparamsk3)𝑂subscript𝑛paramssuperscript𝑘3O(n_{\text{params}}\cdot k^{3})italic_O ( italic_n start_POSTSUBSCRIPT params end_POSTSUBSCRIPT ⋅ italic_k start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )
Abnormal Node Ident.O(nvertices)𝑂subscript𝑛verticesO(n_{\text{vertices}})italic_O ( italic_n start_POSTSUBSCRIPT vertices end_POSTSUBSCRIPT )O(nvertices)+O(nparamsk3)𝑂subscript𝑛vertices𝑂subscript𝑛paramssuperscript𝑘3O(n_{\text{vertices}})+O(n_{\text{params}}\cdot k^{3})italic_O ( italic_n start_POSTSUBSCRIPT vertices end_POSTSUBSCRIPT ) + italic_O ( italic_n start_POSTSUBSCRIPT params end_POSTSUBSCRIPT ⋅ italic_k start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )

Table 3 summarizes the computational complexity of P3GNN with and without privacy preservation techniques. The added computational complexity from integrating FL and PHC into the P3GNN model is a necessary trade-off for ensuring robust privacy and security in distributed ML environments. This complexity enables collaborative training across multiple entities without exposing sensitive data, aligning with stringent data protection regulations and safeguarding against unauthorized access, making it indispensable for securely handling sensitive or confidential information.

4.6 Interpretability of P3GNN

This section delves into the interpretability of our model, particularly focusing on its capability to detect attacks at the node level and trace the entirety of an anomalous neighborhood involved in an attack. By leveraging GNNs, P3GNN goes beyond mere identification, enabling the dissection of attack patterns through node-level analysis.

4.6.1 Visualization and Node-Level Analysis:

P3GNN stands out for its ability to precisely identify anomalous nodes within a network. Figure 3a emphasizes the paths and sequences formed by anomalous nodes. The flow of malicious activities (red) amidst the benign (blue) nodes reveals the model’s detection of attack trajectories, which can be critical for understanding the sequential progression and strategies of the attackers. This visualization through the capabilities of GNNs enhances interpretability, and is particularly useful for path analysis, as it enables security analysts to track potential attack trajectories and comprehend how malicious activities may propagate. It is an essential tool for formulating effective defensive strategies.

Additionally, P3GNN highlights clusters of malicious nodes that can indicate the initial breach points or critical attack vectors. Figure 3b illustrates a subgraph extracted from the overall network, highlighting the prominent clusters of malicious and benign nodes. In our network visualizations, clusters of nodes emerge organically due to the inherent structure and connectivity patterns within the network data. These clusters, emerging as a result of the interconnected nature of network activities, are marked by dense interconnectivity, which is a hallmark of coordinated or related malicious activities. Such clusters demand urgent investigative action. In this way, security analysts are empowered to conduct focused investigations and implement targeted security measures.

P3GNN: A Privacy-Preserving Provenance Graph-Based Model for APT Detection in Software Defined Networking (3)
P3GNN: A Privacy-Preserving Provenance Graph-Based Model for APT Detection in Software Defined Networking (4)

4.6.2 Case Study: Nginx Backdoor w/ Drakon In-Memory:

To further illustrate the model’s interpretability, we analyze a specific attack incident that occurred within the DARPA network infrastructure. In this incident, the attacker targeted the system (Cadet’s data), initially exploiting a vulnerability in the Nginx server using a malformed HTTP request. This led to the execution of a Drakon implant in the Nginx server’s memory, establishing a shell connection to the attacker’s console through HTTP. The attacker’s subsequent actions involved multiple attempts to download and elevate privileges for a micro APT implant. Despite repeated failures in privilege elevation, the attacker managed to execute the Drakon implant with root privileges and later conducted a port scan on multiple network targets to assess vulnerabilities, leaving the console connection open throughout. This attack sequence highlights several critical steps:

  • Initial exploitation of Nginx and Drakon implant execution.

  • Repeated attempts to download and elevate the micro APT implant.

  • Successful execution of the Drakon implant with elevated privileges.

  • Network reconnaissance through port scanning.

The interpretability of our model is rooted in its ability to not just identify the occurrence of an attack but to unravel the sequence of events constituting it. When an anomalous activity is identified, the model raises alarms for these specific nodes, signaling potential security breaches. To comprehensively map out the attack’s progression, we employ a depth-first search algorithm on the neighborhood of the identified anomalous nodes. In the DARPA attack case, the model successfully pinpointed each critical step, providing a comprehensive view of the attack’s progression. Figure 4 presents a graphical representation of the attack steps as identified by our model.

P3GNN: A Privacy-Preserving Provenance Graph-Based Model for APT Detection in Software Defined Networking (5)

Our model’s detection ability is further demonstrated by correlating discrete log entries with corresponding graph interactions. By examining the flow of UUIDs within the network and mapping them to recorded events, as shown in table 4, our system constructs a narrative that aligns with known attack patterns. For instance, during the initial compromise phase, our model detected the event ‘EVENT_READ‘ and ‘EVENT_EXECUTE‘ in conjunction with specific UUID interactions, implying unauthorized access and potential credential compromise. As the attack transitioned to establish a foothold, modifications in process and principal changes were captured, highlighting the attacker’s efforts to gain a stronger grip within the system.

Subsequently, an escalation of privileges was inferred from principal changes logged, suggesting that the attacker was attempting to gain higher-level access. This was further corroborated by the event ‘EVENT_CHANGE_PRINCIPAL‘ linked to UUIDs accessing sensitive system files. Our model’s depth-first search illuminated these steps in the context of the larger attack framework, showcasing the depth of our analysis. In the final phases, internal reconnaissance and efforts to maintain presence were identified through repeated ‘EVENT_READ‘ and ‘EVENT_EXECUTE‘ actions. These steps were pivotal in the attacker’s strategy to gather intelligence and ensure long-term access to the compromised system. By integrating log analysis with graph theory, our model transcends traditional detection mechanisms, offering a nuanced and dynamic approach to cybersecurity. This case study serves as a testament to the model’s efficacy in recognizing complex attack patterns and the associated network behaviors.

Attack Phase Log Entry Details Graph Interaction (UUIDs)
InitialCompromise EVENT_READ,EVENT_EXECUTEon /etc/login.conf, /etc/group CA352F0F-3E59-11E8-A5CB-3FA3753A265A→ 9A717117-65ED-675C-AD65-38102C67C832,CA357EAC-3E59-11E8-A5CB-3FA3753A265A→ 308E149F-F225-9859-A5F2-8C9BB9983D48
EstablishFoothold EVENT_MODIFY_PROCESS,EVENT_CHANGE_PRINCIPAL CA353E2C-3E59-11E8-A5CB-3FA3753A265A→ 24782D99-5286-465D-8652-63817D46C1BD,CA353E2C-3E59-11E8-A5CB-3FA3753A265A→ 76A739B1-A13E-FD50-BEA1-B218A0FD0400
EscalatePrivileges EVENT_CHANGE_PRINCIPALindicating privilege escalation CA353E2C-3E59-11E8-A5CB-3FA3753A265A→ 308E149F-F225-9859-A5F2-8C9BB9983D48
InternalReconnaissance EVENT_READon various system files CA357EAC-3E59-11E8-A5CB-3FA3753A265A→ 9A717117-65ED-675C-AD65-38102C67C832
MaintainPresence Repeated EVENT_READ,EVENT_EXECUTE CA352F0F-3E59-11E8-A5CB-3FA3753A265A→ Various UUIDs

5 Conclusion

In conclusion, this paper introduced the P3GNN, a provenance graph-based model for APT detection in SDNs that integrates GCN with FL and hom*omorphic encryption for privacy preservation. The model effectively detects APTs without prior attack knowledge, maintaining data privacy and integrity. Our tests using the DARPA TC dataset show robust performance, achieving an accuracy of 0.93 and a false positive rate of 0.06. This indicates a significant improvement over traditional supervised learning methods, which often struggle with zero-day attacks and privacy concerns. Furthermore, the model’s capability to pinpoint anomalous nodes and trace the entire attack trajectory offers invaluable insights for security analysis, addressing a critical need in AI for interpretability. In future work, we plan to broaden our study by exploring our approach’s applicability across different domains and scenarios, with a focus on empirical validation in these contexts. A key extension will involve adapting our method to defend against APT attacks across multiple hosts by shifting from host-level to network-level monitoring, enabling the detection of lateral movements in APT scenarios. Additionally, we plan to incorporate real-time data processing into the P3GNN model, enhancing its effectiveness in dynamic network environments and enabling a faster response to emerging threats. Additionally, we plan to automate the generation of Cyber Threat Intelligence reports using our P3GNN model to streamline the documentation of attack details and mitigation strategies. This automation will utilize provenance data to improve the efficiency and consistency of threat reporting and analysis in cybersecurity.

References

  • [1]Afnan, S., Sadia, M., Iqbal, S., Iqbal, A.: Logshield: A transformer-based apt detection system leveraging self-attention (2023)
  • [2]Alshamrani, A., Myneni, S., Chowdhary, A., Huang, D.: A survey on advanced persistent threats: Techniques, solutions, challenges, and research opportunities. IEEE Communications Surveys & Tutorials 21(2), 1851–1877 (2019). https://doi.org/10.1109/COMST.2019.2891891
  • [3]Anjum, M.M., Iqbal, S., Hamelin, B.: Analyzing the usefulness of the darpa optc dataset in cyber threat detection research. In: Proceedings of the 26th ACM Symposium on Access Control Models and Technologies. p. 27–32. SACMAT ’21, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3450569.3463573, https://doi.org/10.1145/3450569.3463573
  • [4]Dabbagh, M., Hamdaoui, B., Guizani, M., Rayes, A.: Software-defined networking security: pros and cons. IEEE Communications Magazine 53(6), 73–79 (2015). https://doi.org/10.1109/MCOM.2015.7120048
  • [5]Data61, C.: Python paillier library. https://github.com/data61/python-paillier (2013)
  • [6]Du, M., Li, F., Zheng, G., Srikumar, V.: Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. pp. 1285–1298 (2017)
  • [7]Ghiasvand, S., Ciorba, F.M.: Anonymization of system logs for privacy and storage benefits. CoRR abs/1706.04337 (2017), http://arxiv.org/abs/1706.04337
  • [8]Han, X., Pasquier, T., Bates, A., Mickens, J., Seltzer, M.: Unicorn: Runtime provenance-based detector for advanced persistent threats. arXiv preprint arXiv:2001.01525 (2020)
  • [9]Jagarlamudi, G.K., Yazdinejad, A., Parizi, R.M., Pouriyeh, S.: Exploring privacy measurement in federated learning. The Journal of Supercomputing pp. 1–41 (2023)
  • [10]Keromytis, A.D.: Transparent computing engagement 3 data release. README-E3. md (2018)
  • [11]Konečný, J., McMahan, H.B., Yu, F.X., Richtárik, P., Suresh, A.T., Bacon, D.: Federated learning: Strategies for improving communication efficiency (2017)
  • [12]Lara, A., Kolasani, A., Ramamurthy, B.: Network innovation using openflow: A survey. IEEE Communications Surveys & Tutorials 16(1), 493–512 (2014). https://doi.org/10.1109/SURV.2013.081313.00105
  • [13]Latif, Z., Sharif, K., Li, F., Karim, M.M., Biswas, S., Wang, Y.: A comprehensive survey of interface protocols for software defined networks. Journal of Network and Computer Applications 156, 102563 (2020). https://doi.org/https://doi.org/10.1016/j.jnca.2020.102563, https://www.sciencedirect.com/science/article/pii/S1084804520300370
  • [14]Liang, H., Li, Y., Zhang, C., Liu, X., Zhu, L.: Egia: An external gradient inversion attack in federated learning. IEEE Transactions on Information Forensics and Security (2023)
  • [15]Liang, Y., Li, S., Yan, C., Li, M., Jiang, C.: Explaining the black-box model: A survey of local interpretation methods for deep neural networks. Neurocomputing 419, 168–182 (2021). https://doi.org/https://doi.org/10.1016/j.neucom.2020.08.011, https://www.sciencedirect.com/science/article/pii/S0925231220312716
  • [16]Lv, Y., Qin, S., Zhu, Z., Yu, Z., Li, S., Han, W.: A review of provenance graph based apt attack detection:applications and developments. In: 2022 7th IEEE International Conference on Data Science in Cyberspace (DSC). pp. 498–505 (2022). https://doi.org/10.1109/DSC55868.2022.00075
  • [17]Maggi, F., Matteucci, M., Zanero, S.: Detecting intrusions through system call sequence and argument analysis. IEEE Transactions on Dependable and Secure Computing 7(4), 381–395 (2010). https://doi.org/10.1109/TDSC.2008.69
  • [18]Mamun, M., Shi, K.: Deeptaskapt: Insider apt detection using task-tree based deep learning. In: 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). pp. 693–700 (2021). https://doi.org/10.1109/TrustCom53373.2021.00102
  • [19]Manzoor, E., Milajerdi, S.M., Akoglu, L.: Fast memory-efficient anomaly detection in streaming heterogeneous graphs. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 1035–1044 (2016)
  • [20]Milajerdi, S.M., Eshete, B., Gjomemo, R., Venkatakrishnan, V.: Poirot: Aligning attack behavior with kernel audit records for cyber threat hunting. In: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. p. 1795–1812. CCS ’19, Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3319535.3363217, https://doi.org/10.1145/3319535.3363217
  • [21]Milajerdi, S.M., Gjomemo, R., Eshete, B., Sekar, R., Venkatakrishnan, V.: Holmes: Real-time apt detection through correlation of suspicious information flows. In: 2019 IEEE Symposium on Security and Privacy (SP). pp. 1137–1152 (2019). https://doi.org/10.1109/SP.2019.00026
  • [22]Pasquier, T., Han, X., Goldstein, M., Moyer, T., Eyers, D., Seltzer, M., Bacon, J.: Practical whole-system provenance capture. In: Proceedings of the 2017 Symposium on Cloud Computing. pp. 405–418 (2017)
  • [23]Rabieinejad, E., Yazdinejad, A., Dehghantanha, A., Srivastava, G.: Two-level privacy-preserving framework: Federated learning for attack detection in the consumer internet of things. IEEE Transactions on Consumer Electronics (2024)
  • [24]Sakhnini, J., Karimipour, H., Dehghantanha, A., Yazdinejad, A., Gadekallu, T.R., Victor, N., Islam, A.: A generalizable deep neural network method for detecting attacks in industrial cyber-physical systems. IEEE Systems Journal (2023)
  • [25]Shan-Shan, J., Ya-Bin, X.: The apt detection method in sdn. In: 2017 3rd IEEE International Conference on Computer and Communications (ICCC). pp. 1240–1245. IEEE (2017)
  • [26]Shan-Shan, J., Ya-Bin, X.: The apt detection method based on attack tree for sdn. In: Proceedings of the 2nd International Conference on Cryptography, Security and Privacy. p. 116–121. ICCSP 2018, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3199478.3199481, https://doi.org/10.1145/3199478.3199481
  • [27]Shokri, R., Shmatikov, V.: Privacy-preserving deep learning. In: Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. pp. 1310–1321 (2015)
  • [28]Thi, H.T., HoangSon, N.D., Duy, P.T., Pham, V.H.: Federated learning-based cyber threat hunting for apt attack detection in sdn-enabled networks. In: 2022 21st International Symposium on Communications and Information Technologies (ISCIT). pp.1–6 (2022). https://doi.org/10.1109/ISCIT55906.2022.9931222
  • [29]Thi, H.T., Son, N.D.H., Duy, P.T., Khoa, N.H., Ngo-Khanh, K., Pham, V.H.: Xfedhunter: An explainable federated learning framework for advanced persistent threat detection in sdn (2023)
  • [30]Wang, Q., Hassan, W.U., Li, D., Jee, K., Yu, X., Zou, K., Rhee, J., Chen, Z., Cheng, W., Gunter, C.A., etal.: You are what you do: Hunting stealthy malware via data provenance analysis. In: NDSS (2020)
  • [31]Wu, Y., Xie, Y., Liao, X., Zhou, P., Feng, D., Wu, L., Li, X., Wildani, A., Long, D.: Paradise: real-time, generalized, and distributed provenance-based intrusion detection. IEEE Transactions on Dependable and Secure Computing 20(2), 1624–1640 (2022)
  • [32]Xia, W., Wen, Y., Foh, C.H., Niyato, D., Xie, H.: A survey on software-defined networking. IEEE Communications Surveys & Tutorials 17(1), 27–51 (2014)
  • [33]Xie, Y., Feng, D., Hu, Y., Li, Y., Sample, S., Long, D.: Pagoda: A hybrid approach to enable efficient real-time provenance based intrusion detection in big data environments. IEEE Transactions on Dependable and Secure Computing 17(6), 1283–1296 (2020). https://doi.org/10.1109/TDSC.2018.2867595
  • [34]Yazdinejad, A., Bohlooli, A., Jamshidi, K.: Efficient design and hardware implementation of the openflow v1. 3 switch on the virtex-6 fpga ml605. The Journal of Supercomputing 74, 1299–1320 (2018)
  • [35]Yazdinejad, A., Dehghantanha, A., Karimipour, H., Srivastava, G., Parizi, R.M.: An efficient packet parser architecture for software-defined 5g networks. Physical Communication 53, 101677 (2022)
  • [36]Yazdinejad, A., Parizi, R.M., Dehghantanha, A., Zhang, Q., Choo, K.K.R.: An energy-efficient sdn controller architecture for iot networks with blockchain-based security. IEEE Transactions on Services Computing 13(4), 625–638 (2020)
  • [37]Yazdinejadna, A., Parizi, R.M., Dehghantanha, A., Khan, M.S.: A kangaroo-based intrusion detection system on software-defined networks. Computer Networks 184, 107688 (2021). https://doi.org/https://doi.org/10.1016/j.comnet.2020.107688, https://www.sciencedirect.com/science/article/pii/S1389128620312949
P3GNN: A Privacy-Preserving Provenance Graph-Based Model for APT Detection in Software Defined Networking (2024)

References

Top Articles
Latest Posts
Article information

Author: Neely Ledner

Last Updated:

Views: 6189

Rating: 4.1 / 5 (42 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Neely Ledner

Birthday: 1998-06-09

Address: 443 Barrows Terrace, New Jodyberg, CO 57462-5329

Phone: +2433516856029

Job: Central Legal Facilitator

Hobby: Backpacking, Jogging, Magic, Driving, Macrame, Embroidery, Foraging

Introduction: My name is Neely Ledner, I am a bright, determined, beautiful, adventurous, adventurous, spotless, calm person who loves writing and wants to share my knowledge and understanding with you.