-
Black-box combinatorial optimization using models with integer-valued minima
Laurens Bliek, Sicco Verwer, Mathijs de Weerdt
Annals of Mathematics and Artificial Intelligence, 2020.When a black-box optimization objective can only be evaluated with costly or noisy measurements, most standard optimization algorithms are unsuited to find the optimal solution. Specialized algorithms that deal with exactly this situation make use of surrogate models. These models are usually continuous and smooth, which is beneficial for continuous optimization problems, but not necessarily for combinatorial problems. However, by choosing the basis functions of the surrogate model in a certain way, we show that it can be guaranteed that the optimal solution of the surrogate model is integer. This approach outperforms random search, simulated annealing and a Bayesian optimization...
-
The Robust Malware Detection Challenge and Greedy Random Accelerated Multi-Bit Search
Sicco Verwer, Azqa Nadeem, Christian Hammerschmidt, Laurens Bliek, Abdullah Al-Dujaili, Una-May O'Reilly
Artificial Intelligence and Security (AISec), 2020.Training classifiers that are robust against adversarially modified examples is becoming increasingly important in practice. In the field of malware detection, adversaries modify malicious binary files to seem benign while preserving their malicious behavior. We report on the results of a recently held robust malware detection challenge. There were two tracks in which teams could participate: the attack track asked for adversarially modified malware samples and the defense track asked for trained neural network classifiers that are robust to such modifications. The teams were unaware of the attacks/defenses they had to detect/evade. Although only 9 teams participated, this unique setting...
-
Beyond Labeling Using Clustering to Build Network Behavioral Profiles of Malware Families
Azqa Nadeem, Christian Hammerschmidt, Carlos H. Ganan, Sicco Verwer
Malware Analysis using Artificial Intelligence and Deep Learning, Springer, 2020.Malware family labels are known to be inconsistent. They are also black box since they do not represent the capabilities of malware. The current state-of the-art in malware capability assessment include mostly manual approaches, which are infeasible due to the ever-increasing volume of discovered malware samples. We propose a novel unsupervised machine learning-based method called MalPaCA, which automates capability assessment by clustering the temporal behavior in malware’s network traces. MalPaCA provides meaningful behavioral clusters using only 20 packet headers. Behavioral profiles are generated based on the cluster membership of malware’s network traces. A Directed Acyclic Graph shows the relationship between...
-
Hybrid Connection and Host Clustering for Community Detection in Spatial-temporal Network Data
Mark Patrick Roeling, Azqa Nadeem, Sicco Verwer
Machine Learning for Cybersecurity (MLCS), 2020.Network data clustering and sequential data mining are large fields of research, but how to combine them to analyze spatial-temporal network data remains a technical challenge. This study investigates a novel combination of two sequential similarity methods (Dynamic Time Warping and N-grams with Cosine distances), with two state-of-the-art unsupervised network clustering algorithms (Hierarchical Density-based Clustering and Stochastic Block Models). A popular way to combine such methods is to first cluster the sequential network data, resulting in connection types. The hosts in the network can then be clustered conditioned on these types. In contrast, our approach clusters nodes and edges in...
-
Learning Optimal Classification Trees Using a Binary Linear Program Formulation
Sicco Verwer, Yingqian Zhang
Proceedings of the AAAI Conference on Artificial Intelligence, 2019.We provide a new formulation for the problem of learning the optimal classification tree of a given depth as a binary linear program. A limitation of previously proposed Mathematical Optimization formulations is that they create constraints and variables for every row in the training data. As a result, the running time of the existing Integer Linear programming (ILP) formulations increases dramatically with the size of data. In our new binary formulation, we aim to circumvent this problem by making the formulation size largely independent from the training data size. We show experimentally that our formulation achieves better performance than existing...
-
Solving bin-packing problems under privacy preservation: Possibilities and trade-offs
Rowan Hoogervorst, Yingqian Zhang, Gamze Tillem, Zekeriya Erkin, Sicco Verwer
Information Sciences, 2019.We investigate the trade-off between privacy and solution quality that occurs when a k-anonymized database is used as input to the bin-packing optimization problem. To investigate the impact of the chosen anonymization method on this trade-off, we consider two recoding methods for k-anonymity: full-domain generalization and partition-based single-dimensional recoding. To deal with the uncertainty created by anonymization in the bin-packing problem, we utilize stochastic programming and robust optimization methods. Our computational results show that the trade-off is strongly dependent on both the anonymization and optimization method. On the anonymization side, we see that using single dimensional recoding leads to significantly...
-
Using Datasets from Industrial Control Systems for Cyber Security Research and Education
Qin Lin, Sicco Verwer, Robert Kooij, Aditya Mathur
International Conference on Critical Information Infrastructures Security, 2019.The availability of high-quality benchmark datasets is an important prerequisite for research and education in the cyber security domain. Datasets from realistic systems offer a platform for researchers to develop and test novel models and algorithms. Such datasets also offer students opportunities for active and project-centric learning. In this paper, we describe six publicly available datasets from the domain of Industrial Control Systems (ICS). Five of these datasets are obtained through experiments conducted in the context of operational ICS while the sixth is obtained from a widely used simulation tool, namely EPANET, for large scale water distribution networks. This paper...
-
Learning fuzzy decision trees using integer programming
Jason Rhuggenaath, Yingqian Zhang, Alp Akcay, Uzay Kaymak, Sicco Verwer
IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2018.A popular method in machine learning for supervised classification is a decision tree. In this work we propose a new framework to learn fuzzy decision trees using mathematical programming. More specifically, we encode the problem of constructing fuzzy decision trees using a Mixed Integer Linear Programming (MIP) model, which can be solved by any optimization solver. We compare the performance of our method with the performance of off-the-shelf decision tree algorithm CART and Fuzzy Inference Systems (FIS) using benchmark data-sets. Our initial results are promising and show the advantages of using non-crisp boundaries for improving classification accuracy on testing data.
...