Machine Learning in Cyber Security — Malicious Software Installation

20 July 2024

3

Introduction

Monitoring of user activities performed by local administrators is always a challenge for SOC analysts and security professionals. Most of the security framework will recommend the implementation of a whitelist mechanism.

However, the real world is often not ideal. You will always have different developers or users having local administrator rights to bypass controls specified. Is there a way to monitor the local administrator activities?

Let’s talk about the data source

machine learning cyber security — An example of how the dataset looks like — the 3 entries listed above are referring to the same software

We have a regular batch job to retrieve the software installed on each of the workstations which are located in different regions. Most of the software installed is displayed in their local languages. (Yes, you name it — it could be Japanese, French, Dutch …..) So you will meet a situation that the software installed is displayed as 7 different names while it is referring to the same software in the whitelist. Not to mention, we have thousands of devices.

Attributes of the dataset

Hostname — The hostname of the devices
Publisher Name — the software publisher
Software Name — Software Name in Local Language and different versions number

Is there a way we could identify non-standard installation?

My idea is that legit software used in the company — should have more than 1 installation and the software name should be different. In such a case, I believe it will be effective to use machine learning to help a user classify the software and highlight any outlier.

Char processing using Term Frequency — Inverse Document Frequency (TF-IDF)

Natural Language Processing (NLP) is a sub-field of artificial intelligence that deals with understanding and processing human language. In light of new advancements in machine learning, many organizations have begun applying natural language processing for translation, chatbots, and candidate filtering.

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

TF-IDF is usually used for word extraction. However, I was thinking about whether it could also be applicable for char extraction. The idea is to explore how well we could apply TF-IDF to extract the features related to each character in the software name by exporting the importance of each character to the software name.
An example of the script below is going through how I apply the TF-IDF to the software name field in my data set.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer# Import the dataset 
df=pd.read_csv("your dataset") # Extract the Manufacturer into List 
field_extracted = df['softwarename']# initialize the TF-IDF 
vectorizer = TfidfVectorizer(analyzer='char')
vectors = vectorizer.fit_transform(field_extracted)
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
result = pd.DataFrame(denselist, columns=feature_names)

A snippet of the result:

In the above diagram, you could see that a calculation is performed to evaluate how “important” each char is on the software name. This could also be interpreted as how “many” of the char specified is available on each of the software names. In this way, you could present statistically on the characteristic of each “software name” and we could put these features into the machine learning model of your choice.

Other features I extracted and believe it will also be meaning to the models:

The entropy of the Software Name

import math
from collections import Counter# Function of calculating Entropy 
def eta(data, unit='natural'):
    base = {
        'shannon' : 2.,
        'natural' : math.exp(1),
        'hartley' : 10.
    }if len(data) <= 1:
        return 0counts = Counter()for d in data:
        counts[d] += 1ent = 0probs = [float(c) / len(data) for c in counts.values()]
    for p in probs:
        if p > 0.:
            ent -= p * math.log(p, base[unit])return ententropy  = [eta(x) for x in field_extracted]

Space Ratio — How many spaces the software name has
Vowel Ratio — How many vowels (aeiou) the software name has

At last, I have these features listed above with labels to run against randomtreeforest classifier. You could select any classifier of your choice as long as it could give you a satisfactory result.

Thanks for reading!

About the Author

Elaine Hung

Elaine is a machine learning enthusiast, digital forensic and incident response consultant. Interested in applying ML and NLP on cyber security topics.

Machine Learning in Cyber Security — Malicious Software Installation

Introduction

Let’s talk about the data source

Attributes of the dataset

Is there a way we could identify non-standard installation?

About the Author

Related

Run Local AWS Cloud Stack using LocalStack on Linux

Learn Terraform Automation in 3 days using Video Courses

How To Expose Ansible AWX Service using Nginx Ingress

LEAVE A REPLY Cancel reply

Most Popular

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

Interview With Willem Dewulf – CEO of ProBackup by Shauli Zacks

Recent Comments

EDITOR PICKS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR POSTS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR CATEGORY

ABOUT US

FOLLOW US