K-means Clustering and it’s use in Security

6 min readJul 19, 2021

Often we are in situations where we’re trying to group things into a certain source or type to make understanding it much easier. This could be aggregating different shades of colors to a primary shade, arranging clothes into their correct collections, or in a business standpoint — arranging customers into certain clusters based on their interests and habits.

What are clusters in ML ?

A cluster can be broadly defined as a group or collection of items. In the world of Data Science and Machine Learning these grouped items which are similar to each other, can be used to classify items. The method of identifying similar groups of data in a dataset is called clustering.

There are three types of machine learning,

Supervised Learning — algorithm is trained using labeled data.
Unsupervised Learning — algorithm makes use of unlabeled data.
Reinforcement Learning — algorithm that improves upon itself and learns from new situations using a trial-and-error method.

In real-world use cases, many unsupervised learning problems are solved using the clustering technique.

Clustering is a technique which groups unlabeled data based on their similarities or differences. Clustering algorithms are used to process raw, unclassified data objects into groups represented by structures or patterns in the information.

K-means Clustering

One of the simplest and popular unsupervised ML algorithms, the objective of k-means is to group similar data points together and discover underlying patterns. To achieve this objective, k-means looks for a fixed number (k) of clusters in a dataset.

The k-means algorithm identifies k number of centroids (geometric center of a plane figure) and then allocates every data point in the nearest cluster, while keeping the centroids as small as possible. The means in K-means refers to the averaging of the data which are basically the centroids. The algorithm starts with a first group of randomly selected centroids, which are considered the beginning points for every cluster, and then performs iterative calculations to get the updated positions in an optimized manner. The creations and optimization stops when either:

The centroids have stabilized — there is no change in their values because the clustering has been successful
The defined number of iterations have been achieved

K-means in Security

Being able to classify data records into groups according to their features attributes, or similarities makes this significant in many fields related to data analysis, such as pattern recognition, image processing, information retrieval, geography, and marketing. Also considering that this is the information era, it has been a challenge on storage as well as performing computation on such massive data. This has all been dealt with by the wave of advancements in Cloud technology.

With the migration from on-premises to cloud, the need for security upgrades has become even more apparent. Preserving data privacy during out-sourced analysis is something that has been developing improvement but seemingly short of perfection through the various iterations of protocols and algorithms implemented. This holds specially true when it comes to performing clustering techniques. Due to sheer volume of inputs that are often involved in data mining problems or ML, generic multiparty computation (MPC) protocols become infeasible in terms of communication cost. This has led to constructions of function-specific multiparty protocols that attempt to handle a specific functionality in an efficient manner, while still providing privacy to the parties.

Secure two-party k-means Clustering

The solution to the above problem was proposed by Paul Bunn and Rafail Ostroversky in their research paper in which they designed a protocol that takes as a template the single-database protocol, and extends it to the two-party setting. They utilized numerous sub-protocols which themselves preserve privacy against an honest-but-curious adversary. They also utilize standard cryptographic tools to maintain privacy in the two-party k-means clustering algorithm.

Use in Intrusion Detection Systems

Intrusion Detection Systems are mainly used to distinguish normal behavior and abnormal behavior and then make corresponding measures. The application of unsupervised clustering algorithms in the field of abnormal detection can improve detection efficiency of an IDS and makes the practical application value higher. k-means can serve to be the most commonly used and most practical way of implementation.

In system applications, if you can’t use tagged data, you can’t clearly determine the normal or abnormal condition of the connection record, and then make the clustering tag. Typically, a threshold is used to keep a record of the connection above the threshold for the normal clustering, whereas the other is exception clustering. According to this paper published by Chunfen Bu, the experimental results show an average detection rate of 89.24% and the false alarm rate (False Positive) of 0.77%.

An improved K-Means algorithm flow as proposed in this paper consists of data preprocessing being performed on the collected data or the original dataset. Data normalization uses the Min-Max Linear function normalization to map data into intervals of 0 to 1. Feature extraction uses the PCA algorithm to perform feature dimension reduction on the entire dataset. Then the outlier detection analysis of the whole dataset will affect the removal of outliers of k-means clusters and cluster center points (centroids), and improve the accuracy of the clustering algorithm. This improved PCA based k-means algorithm reached 99.02% accuracy with a false positive rate of only 1.144% (for various intrusion types).

Cyber Security Analytics on Apache Spark

Apache Spark is an open-source unified analytics engine for large-scale data preprocessing in fields like Big Data and Machine Learning. It has seen rapid adoption by enterprises across a wide range of industries.

It is one of the commonly used big data frameworks, which is scalable, in-memory persistent, fault tolerant, and supports programs that can be executed in parallel.

Cyber security analytics is an alternative solution to the traditional security systems. It exploits the techniques and methods used in big data analytics to solve security related problems. With the help of big data frameworks like Hadoop, Spark, it can handle large volume of data in real time and can provide important insights into the security incidents with the help of data. Cyber security analytics can also detect attacks that are hidden inside enormous number of security events by filtering out irrelevant events from the relevant ones. This in turn speeds up the process of security analysis.

Elkan’s k-means clustering using triangle inequality (k-meansTI) is one of the ways to improve performance of the original k-means algorithm. k-meansTI avoids data points-cluster centers distance computations for the points that are far away from the cluster centers. The main contribution of k-meansTI is the possibility to reduce the time complexity of the standard k-means from O(kne) to approximately O(n) in practice. This research paper can be referred for detailed information regarding this approach.

The results showed that the parallel implementation of k-meansTI on Spark can achieve better performance than the Spark ML k-means when the dataset is very large. However the performance was degraded for small datasets. Clustering Web attacks shows that good clustering results can organize and reduce the data for further analysis and can be used to gain important insight into the properties of the attacks. The knowledge obtained from the clustering results can also be used to quickly classify the new data.

Hope this information was relevant to you the reader. Please do refer the original research papers linked to get much more detailed information regarding the implementation and in-depth working of the above mentioned use cases.

Thank you :)