Cyber Analytics Lab

Automated Test-Case Generation for REST APIs Using Model Inference Search Heuristic
Clinton Cao, Annibale Panichella, Sicco Verwer
International Conference on Automation of Software Test (AST), 2025.
The rising popularity of the microservice architectural style has led to a growing demand for automated testing approaches tailored to these systems. EvoMaster is a state-of-the-art tool that uses Evolutionary Algorithms (EAs) to automatically generate test cases for microservices’ REST APIs. One limitation of these EAs is the use of unit-level search heuristics, such as branch distances, which focus on fine-grained code coverage and may not effectively capture the complex, interconnected behaviors characteristic of system-level testing. To address this limitation, we propose a new search heuristic (MISH) that uses real-time automaton learning to guide the test case generation process. We...
On Bidirectional Deterministic Finite Automata
Simon Dieck, Sicco Verwer
Interational Conference on Implementation and Application of Automata (CIAA), 2024.
Bidirectional deterministic finite automata (biDFA) are a recent innovation with many potential applications. In this paper, we present novel theoretical results for bidirectional automata. We show that there exist regular languages, where minimal biDFA models are exponentially smaller than minimal DFA models. We show this for a language that has a structure common to software logs. This makes biDFA especially interesting when inferring models from such data. However, we also prove that the problem of biDFA minimization is NP-hard. As our key contribution, we provide a Myhill-Nerode style congruence-based characterization for the languages they can recognize. Since most algorithms for...
CATMA: Conformance Analysis Tool For Microservice Applications
Clinton Cao, Simon Schneider, Nicolás E. Diaz Ferreyra, Annibale Panichella, Sicco Verwer, Riccardo Scandariato
International Conference on Software Engineering (ICSE), 2024.
The microservice architecture allows developers to divide the core functionality of their software system into multiple smaller services. However, this architectural style also makes it harder for them to debug and assess whether the system’s deployment conforms to its implementation. We present CATMA, an automated tool that detects non-conformances between the system’s deployment and implementation. It automatically visualizes and generates potential interpretations for the detected discrepancies. Our evaluation of CATMA shows promising results in terms of performance and providing useful insights. CATMA is available at https://cyber-analytics.nl/catma.github.io/, and a demonstration video is available at https://youtu.be/WKP1hG-TDKc.
PDFA Distillation with Error Bound Guarantees
Robert Baumgartner, Sicco Verwer
Interational Conference on Implementation and Application of Automata (CIAA), 2024.
Active learning algorithms to infer probabilistic finite automata (PFA) have gained interest recently, due to their ability to provide surrogate models for some types of neural networks. However, recent approaches either cannot guarantee determinism, which makes the automaton harder to understand and compute, or they rely on techniques that bound errors on individual transitions. In this work we propose a derivative of the recent algorithm to learn deterministic PFA (PDFA) from systems returning a distribution over a set of tokens given an input string. Along with determinism, we can give error bounds on probabilities assigned to whole strings with an...
Detecting Changes in Loop Behavior for Active Learning
Bram Verboom, Simon Dieck, Sicco Verwer
International Conference on Grammatical Inference (ICGI), 2023.
Active automaton learning is a popular approach for building models of software systems. The approach forms a hypothesis model from observations and then performs a heuristic equivalence query to check if the learned model is equal to the model under test. The current methods for equivalence queries, however often fail to find counterexamples when encountering loops, one of the most common control structures in software. We introduce two novel equivalence checkers that better handle loops. One extends the well-known W-Method, and the other uses symbolic execution. Both methods are tested on RERS challenge problems. We show that our approaches find...
Learning Syntactic Monoids from Samples by extending known Algorithms for learning State Machines
Simon Dieck, Sicco Verwer
International Conference on Grammatical Inference (ICGI), 2023.
For the inference of regular languages, most current methods learn a version of deterministic finite automata. Syntactic monoids are an alternative representation of regular languages, which have some advantages over automata. For example, traces can be parsed starting from any index and the star-freeness of the language they represent can be checked in polynomial time. But, to date, there existed no passive learning algorithm for syntactic monoids. In this paper, we prove that known state-merging algorithms for learning deterministic finite automata can be instrumented to learn syntactic monoids instead, by using as the input a special structure proposed in this...
Learning state machines from data streams: A generic strategy and an improved heuristic
Robert Baumgartner, Sicco Verwer
International Conference on Grammatical Inference (ICGI), 2023.
State machines models are models that simulate the behavior of discrete event systems, capable of representing systems such as software systems, network interactions, and control systems, and have been researched extensively. The nature of most learning algorithms however is the assumption that all data be available at the begining of the algorithm, and little research has been done in learning state machines from streaming data. In this paper, we want to close this gap further by presenting a generic method for learning state machines from data streams, as well as a merge heuristic that uses sketches to account for incomplete...
Optimal Decision Tree Policies for Markov Decision Processes
Daniël Vos, Sicco Verwer
International Joint Conference on Artificial Intelligence (IJCAI), 2023.
Interpretability of reinforcement learning policies is essential for many real-world tasks but learning such interpretable policies is a hard problem. Particularly, rule-based policies such as decision trees and rules lists are difficult to optimize due to their non-differentiability. While existing techniques can learn verifiable decision tree policies, there is no guarantee that the learners generate a policy that performs optimally. In this work, we study the optimization of size-limited decision trees for Markov Decision Processes (MPDs) and propose OMDTs: Optimal MDP Decision Trees. Given a user-defined size limit and MDP formulation, OMDT directly maximizes the expected discounted return for the...
SoK: Explainable Machine Learning for Computer Security Applications
Azqa Nadeem, Daniël Vos, Clinton Cao, Luca Pajola, Simon Dieck, Robert Baumgartner, Sicco Verwer
IEEE European Symposium on Security and Privacy (Euro S&P), 2023.
Explainable Artificial Intelligence (XAI) is a promising solution to improve the transparency of machine learning (ML) pipelines. We systematize the increasingly growing (but fragmented) microcosm of studies that develop and utilize XAI methods for defensive and offensive cybersecurity tasks. We identify 3 cybersecurity stakeholders, i.e., model users, designers, and adversaries, that utilize XAI for 5 different objectives within an ML pipeline, namely 1) XAI-enabled decision support, 2) applied XAI for security tasks, 3) model verification via XAI, 4) explanation verification & robustness, and 5) offensive use of explanations. We further classify the literature w.r.t. the targeted security domain. Our analysis...
Learning state machines via efficient hashing of future traces
Robert Baumgartner, Sicco Verwer
Learning and Automata workshop (LearnAut), 2022.
State machines are popular models to model and visualize discrete systems such as software systems, and to represent regular grammars. Most algorithms that passively learn state machines from data assume all the data to be available from the beginning and they load this data into memory. This makes it hard to apply them to continuously streaming data and results in large memory requirements when dealing with large datasets. In this paper we propose a method to learn state machines from data streams using the count-min-sketch data structure to reduce memory requirements. We apply state merging using the well-known red-blue-framework to...
Learning State Machines to Monitor and Detect Anomalies on a Kubernetes Cluster
Clinton Cao, Sicco Verwer, Agathe Blaise, Filippo Rebecchi
International Workshop on Continuous Software Evaluation and Certification (IWCSEC) @ ARES, 2022.
These days more companies are shifting towards using cloud environments to provide their services to their client. While it is easy to set up a cloud environment, it is equally important to monitor the system’s runtime behaviour and identify anomalous behaviours that occur during its operation. In recent years, the utilisation of Recurrent Neural Networks (RNNs) and Deep Neural Networks (DNNs) to detect anomalies that might occur during runtime has been a trending approach. However, it is unclear how to explain the decisions made by these networks and how these networks should be interpreted to understand the runtime behaviour that...
Robust Attack Graph Generation
Dennis Mouwen, Sicco Verwer, Azqa Nadeem
Learning and Automata workshop (LearnAut), 2022.
We present a method to learn automaton models that are more robust to input modifications. It iteratively aligns sequences to a learned model, modifies the sequences to their aligned versions, and re-learns the model. Automaton learning algorithms are typically very good at modeling the frequent behavior of a software system. Our solution can be used to also learn the behavior present in infrequent sequences, as these will be aligned to the frequent ones represented by the model. We apply our method to the SAGE tool for modeling attacker behavior from intrusion alerts. In experiments, we demonstrate that our algorithm learns...
Adversarially Robust Decision Tree Relabeling
Daniël Vos, Sicco Verwer
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2022.
Decision trees are popular models for their interpretation properties and their success in ensemble models for structured data. However, common decision tree learning algorithms produce models that suffer from adversarial examples. Recent work on robust decision tree learning mitigates this issue by taking adversarial perturbations into account during training. While these methods generate robust shallow trees, their relative quality reduces when training deeper trees due the methods being greedy. In this work we propose robust relabeling, a post-learning procedure that optimally changes the prediction labels of decision tree leaves to maximize adversarial robustness. We show this can be achieved in...
Robust Optimal Classification Trees against Adversarial Examples
Daniël Vos, Sicco Verwer
AAAI Conference on Artificial Intelligence, 2022.
Decision trees are a popular choice of explainable model, but just like neural networks, they suffer from adversarial examples. Existing algorithms for fitting decision trees robust against adversarial examples are greedy heuristics and lack approximation guarantees. In this paper we propose ROCT, a collection of methods to train decision trees that are optimally robust against user-specified attack models. We show that the min-max optimization problem that arises in adversarial learning can be solved using a single minimization formulation for decision trees with 0-1 loss. We propose such formulations in Mixed-Integer Linear Programming and Maximum Satisfiability, which widely available solvers can...
SECLEDS: Sequence Clustering in Evolving Data Streams via Multiple Medoids and Medoid Voting
Azqa Nadeem, Sicco Verwer
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2022.
Sequence clustering in a streaming environment is challenging because it is computationally expensive, and the sequences may evolve over time. K-medoids or Partitioning Around Medoids (PAM) is commonly used to cluster sequences since it supports alignment-based distances, and the k-centers being actual data items helps with cluster interpretability. However, offline k-medoids has no support for concept drift, while also being prohibitively expensive for clustering data streams. We therefore propose SECLEDS, a streaming variant of the k-medoids algorithm with constant memory footprint. SECLEDS has two unique properties: i) it uses multiple medoids per cluster, producing stable highquality clusters, and ii) it...
Efficient training of robust decision trees against adversarial examples
Daniël Vos, Sicco Verwer
International Conference on Machine Learning (ICML), 2021.
Current state-of-the-art algorithms for training robust decision trees have high runtime costs and require hours to run. We present GROOT, an efficient algorithm for training robust decision trees and random forests that runs in a matter of seconds to minutes. Where before the worst-case Gini impurity was computed iteratively, we find that we can solve this function analytically to improve time complexity from O (n) to O (1) in terms of n samples. Our results on both single trees and ensembles on 14 structured datasets as well as on MNIST and Fashion-MNIST demonstrate that GROOT runs several orders of magnitude...
SAGE Intrusion Alert-driven Attack Graph Extractor
Azqa Nadeem, Sicco Verwer, Shanchieh Jay Yang
Symposium on Visualization for Cyber Security (VizSec), 2021.
Attack graphs (AG) are used to assess pathways availed by cyber adversaries to penetrate a network. State-of-the-art approaches for AG generation focus mostly on deriving dependencies between system vulnerabilities based on network scans and expert knowledge. In real-world operations however, it is costly and ineffective to rely on constant vulnerability scanning and expert-crafted AGs. We propose to automatically learn AGs based on actions observed through intrusion alerts, without prior expert knowledge. Specifically, we develop an unsupervised sequence learning system, SAGE, that leverages the temporal and probabilistic dependence between alerts in a suffix-based probabilistic deterministic finite automaton(S-PDFA) – a model that...
Enabling Visual Analytics via Alert-driven Attack Graphs
Azqa Nadeem, Sicco Verwer, Stephen Moskal, Shanchieh Jay Yang
ACM SIGSAC Conference on Computer and Communications Security (CCS), 2021.
Attack graphs (AG) are a popular area of research that display all the paths an attacker can exploit to penetrate a network. Existing techniques for AG generation rely heavily on expert input regarding vulnerabilities and network topology. In this work, we advocate the use of AGs that are built directly using the actions observed through intrusion alerts, without prior expert input.We have developed an unsupervised visual analytics system, called SAGE, to learn alert-driven attack graphs.We show how these AGs (i) enable forensic analysis of prior attacks, and (ii) enable proactive defense by providing relevant threat intelligence regarding attacker strategies. We...
Alert-driven Attack Graph Generation using S-PDFA
Azqa Nadeem, Sicco Verwer, Stephen Moskal, Shanchieh Jay Yang
IEEE Transcations on Dependable and Secure Computing, 2021.
Ideal cyber threat intelligence (CTI) includes insights into attacker strategies that are specific to a network under observation. Such CTI currently requires extensive expert input for obtaining, assessing, and correlating system vulnerabilities into a graphical representation, often referred to as an attack graph (AG). Instead of deriving AGs based on system vulnerabilities, this work advocates the direct use of intrusion alerts. We propose SAGE, an explainable sequence learning pipeline that automatically constructs AGs from intrusion alerts without a priori expert knowledge. SAGE exploits the temporal and probabilistic dependence between alerts in a suffix-based probabilistic deterministic finite automaton (S-PDFA)-a model that...
Black-box combinatorial optimization using models with integer-valued minima
Laurens Bliek, Sicco Verwer, Mathijs de Weerdt
Annals of Mathematics and Artificial Intelligence, 2020.
When a black-box optimization objective can only be evaluated with costly or noisy measurements, most standard optimization algorithms are unsuited to find the optimal solution. Specialized algorithms that deal with exactly this situation make use of surrogate models. These models are usually continuous and smooth, which is beneficial for continuous optimization problems, but not necessarily for combinatorial problems. However, by choosing the basis functions of the surrogate model in a certain way, we show that it can be guaranteed that the optimal solution of the surrogate model is integer. This approach outperforms random search, simulated annealing and a Bayesian optimization...
The Robust Malware Detection Challenge and Greedy Random Accelerated Multi-Bit Search
Sicco Verwer, Azqa Nadeem, Christian Hammerschmidt, Laurens Bliek, Abdullah Al-Dujaili, Una-May O'Reilly
Artificial Intelligence and Security (AISec) @ CCS, 2020.
Training classifiers that are robust against adversarially modified examples is becoming increasingly important in practice. In the field of malware detection, adversaries modify malicious binary files to seem benign while preserving their malicious behavior. We report on the results of a recently held robust malware detection challenge. There were two tracks in which teams could participate: the attack track asked for adversarially modified malware samples and the defense track asked for trained neural network classifiers that are robust to such modifications. The teams were unaware of the attacks/defenses they had to detect/evade. Although only 9 teams participated, this unique setting...
Beyond Labeling Using Clustering to Build Network Behavioral Profiles of Malware Families
Azqa Nadeem, Christian Hammerschmidt, Carlos H. Ganan, Sicco Verwer
Malware Analysis using Artificial Intelligence and Deep Learning, Springer, 2020.
Malware family labels are known to be inconsistent. They are also black box since they do not represent the capabilities of malware. The current state-of the-art in malware capability assessment include mostly manual approaches, which are infeasible due to the ever-increasing volume of discovered malware samples. We propose a novel unsupervised machine learning-based method called MalPaCA, which automates capability assessment by clustering the temporal behavior in malware’s network traces. MalPaCA provides meaningful behavioral clusters using only 20 packet headers. Behavioral profiles are generated based on the cluster membership of malware’s network traces. A Directed Acyclic Graph shows the relationship between...
Hybrid Connection and Host Clustering for Community Detection in Spatial-temporal Network Data
Mark Patrick Roeling, Azqa Nadeem, Sicco Verwer
Machine Learning for Cybersecurity (MLCS) @ ECML-PKDD, 2020.
Network data clustering and sequential data mining are large fields of research, but how to combine them to analyze spatial-temporal network data remains a technical challenge. This study investigates a novel combination of two sequential similarity methods (Dynamic Time Warping and N-grams with Cosine distances), with two state-of-the-art unsupervised network clustering algorithms (Hierarchical Density-based Clustering and Stochastic Block Models). A popular way to combine such methods is to first cluster the sequential network data, resulting in connection types. The hosts in the network can then be clustered conditioned on these types. In contrast, our approach clusters nodes and edges in...
Learning Optimal Classification Trees Using a Binary Linear Program Formulation
Sicco Verwer, Yingqian Zhang
Proceedings of the AAAI Conference on Artificial Intelligence, 2019.
We provide a new formulation for the problem of learning the optimal classification tree of a given depth as a binary linear program. A limitation of previously proposed Mathematical Optimization formulations is that they create constraints and variables for every row in the training data. As a result, the running time of the existing Integer Linear programming (ILP) formulations increases dramatically with the size of data. In our new binary formulation, we aim to circumvent this problem by making the formulation size largely independent from the training data size. We show experimentally that our formulation achieves better performance than existing...
Solving bin-packing problems under privacy preservation: Possibilities and trade-offs
Rowan Hoogervorst, Yingqian Zhang, Gamze Tillem, Zekeriya Erkin, Sicco Verwer
Information Sciences, 2019.
We investigate the trade-off between privacy and solution quality that occurs when a k-anonymized database is used as input to the bin-packing optimization problem. To investigate the impact of the chosen anonymization method on this trade-off, we consider two recoding methods for k-anonymity: full-domain generalization and partition-based single-dimensional recoding. To deal with the uncertainty created by anonymization in the bin-packing problem, we utilize stochastic programming and robust optimization methods. Our computational results show that the trade-off is strongly dependent on both the anonymization and optimization method. On the anonymization side, we see that using single dimensional recoding leads to significantly...
Using Datasets from Industrial Control Systems for Cyber Security Research and Education
Qin Lin, Sicco Verwer, Robert Kooij, Aditya Mathur
International Conference on Critical Information Infrastructures Security, 2019.
The availability of high-quality benchmark datasets is an important prerequisite for research and education in the cyber security domain. Datasets from realistic systems offer a platform for researchers to develop and test novel models and algorithms. Such datasets also offer students opportunities for active and project-centric learning. In this paper, we describe six publicly available datasets from the domain of Industrial Control Systems (ICS). Five of these datasets are obtained through experiments conducted in the context of operational ICS while the sixth is obtained from a widely used simulation tool, namely EPANET, for large scale water distribution networks. This paper...
Learning fuzzy decision trees using integer programming
Jason Rhuggenaath, Yingqian Zhang, Alp Akcay, Uzay Kaymak, Sicco Verwer
IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2018.
A popular method in machine learning for supervised classification is a decision tree. In this work we propose a new framework to learn fuzzy decision trees using mathematical programming. More specifically, we encode the problem of constructing fuzzy decision trees using a Mixed Integer Linear Programming (MIP) model, which can be solved by any optimization solver. We compare the performance of our method with the performance of off-the-shelf decision tree algorithm CART and Fuzzy Inference Systems (FIS) using benchmark data-sets. Our initial results are promising and show the advantages of using non-crisp boundaries for improving classification accuracy on testing data.
...

Automated Test-Case Generation for REST APIs Using Model Inference Search Heuristic

On Bidirectional Deterministic Finite Automata

CATMA: Conformance Analysis Tool For Microservice Applications

PDFA Distillation with Error Bound Guarantees

Detecting Changes in Loop Behavior for Active Learning

Learning Syntactic Monoids from Samples by extending known Algorithms for learning State Machines

Learning state machines from data streams: A generic strategy and an improved heuristic

Optimal Decision Tree Policies for Markov Decision Processes

SoK: Explainable Machine Learning for Computer Security Applications

Learning state machines via efficient hashing of future traces

Learning State Machines to Monitor and Detect Anomalies on a Kubernetes Cluster

Robust Attack Graph Generation

Adversarially Robust Decision Tree Relabeling

Robust Optimal Classification Trees against Adversarial Examples

SECLEDS: Sequence Clustering in Evolving Data Streams via Multiple Medoids and Medoid Voting

Efficient training of robust decision trees against adversarial examples

SAGE Intrusion Alert-driven Attack Graph Extractor

Enabling Visual Analytics via Alert-driven Attack Graphs

Alert-driven Attack Graph Generation using S-PDFA

Black-box combinatorial optimization using models with integer-valued minima

The Robust Malware Detection Challenge and Greedy Random Accelerated Multi-Bit Search

Beyond Labeling Using Clustering to Build Network Behavioral Profiles of Malware Families

Hybrid Connection and Host Clustering for Community Detection in Spatial-temporal Network Data

Learning Optimal Classification Trees Using a Binary Linear Program Formulation

Solving bin-packing problems under privacy preservation: Possibilities and trade-offs

Using Datasets from Industrial Control Systems for Cyber Security Research and Education

Learning fuzzy decision trees using integer programming