Today we’re very excited to announce the integration of our machine learning technology into VirusTotal. VirusTotal, a subsidiary of Google, is a free service that analyzes suspicious files and URLs and facilitates the quick detection of viruses, worms, trojans, and other kinds of malicious content.
Trapmine’s ThreatScore machine learning engine developed to identify known and never-before-seen malware. This engine is a part of TRAPMINE Endpoint Detection & Protection Platform which combines behavior monitoring, exploit prevention, machine learning and endpoint deception techniques to provide fool-proof defense against malware, exploit attempts, file-less malware, ransomware and other forms of targeted attacks.
By integrating with VirusTotal, Windows executable files submitted to VirusTotal will be analyzed by Trapmine’s AI engine and the verdicts will be displayed to VirusTotal users.All new scanners joining Google’s VirusTotal community need to prove a certification and/or independent reviews from security testers according to best practices of Anti-Malware Testing Standards Organization (AMTSO). MRG Effitas, an AMTSO member independent organization, also certifies the ability of Trapmine’s Machine Learning Engine to detect malware.
Most of the antivirus engines on VT looks for signatures, i.e specific bytes, patterns, strings in the file. Instead of signatures, Trapmine extract “features”, characteristics of files to make a prediction based on pre-trained model by our data scientists. Trapmine is able to detect never-before-seen malware with high efficacy by this approach.
Now let’s find out how Trapmine’s ThreatScore was developed and how it works. Machine learning is based on algorithms that can learn from data and the diagram below is basically life cycle of ThreatScore development.
As it is seen in the graph, data collecting and processing is the first and foremost step. Trapmine has meticulously scanned millions of malicious and clean PE files that we store both in our sources and received from others and multiplied them. The eminent criterion of ours at this point, since the model we have wanted to train was aimed to be embedded in a real product, is to produce correct variety and distribution of malicious and benign PE files.
For instance, consider VirusTotal database. Let’s assume that 90% of the files in VirusTotal database might be malicious. In this scenario, even though 9 out of 10 of the files in VirusTotal world is harmful, in real world this ratio would be much lower, below 1%.
The next stage is what we call “feature extraction” processes where the files that are analyzed and turned into meaningful vector data of features. Features are a numeric representation of the raw data that can be used by machine learning models and feature vector is a vector of floats generated from a static Portable Executable (PE) file.
Not only as data scientists but also as Trapmine engineers, we have discussed and heavily dwelt upon feature extraction part to decide correct and efficient feature vector to distinguish between benign and malicious files.The features consist of more than a thousand labels like import, section, section entropy, resource and export information parsed from the (PE) file, special known strings searches and amount of randomness of strings in chunk of code. As our data is ready for the next stage, next up is deciding on what algorithm to use.
After a careful evaluation, four algorithms were identified by our team and decided to be tested in the next ‘comparison’ stage. These four algorithms are as follows:
- Logistic Regression
- Deep Learning
- Random Forest
- GBDT (Gradient-Boosted Decision Tree)
When it comes to our comparison criteria, they are as follows:
- AUC (Area Under Curve)
- FPR (False Positive Ratio)
- Model Size
Another important element is FPR (False Positive Rate). It is one of the most heavily funded, if not the most, issue in cyber security solutions. In order to secure a system by blocking or quarantining every process, what we have as a result is a highly secure but a useless system. What is crucial is evaluating and detecting the malicious ones and taking action against them. As seen in the graph below, what is of paramount importance to us is minimize the intersection of blue and red curves.
Furthermore, model size and speed is also another important criterion for us. In comparison to others, Trapmine offers its customers a lightweight, stable and fast product. That’s the reasons why the model size and its speed used for ThreatScore is vitally important. Having split the available data into two sets, one for learning and one for tests, we ran them on previously mentioned four algorithms. A purely objective comparison is impossible as these four algorithms are inherently different due to their technicalities. However, we have developed models based on these algorithms in a most possible standardized way and compared the results.
After a careful process, we have obtained the following results as described in the table below. The values in the table represent the means of different learning data and test data. As a team, we have decided to use a GBDT algorithms which provides high AUC, model size and speed, and low FPR value.
After all these processes, the algorithm and features vector that we will use have been finally decided. What is left now is training the best model and add it to the product? To reach a FPR value that is almost zero and a high TPR, we have updated the feature vector multiple times and modify the algorithm source code to minimize model size.
Consequently, Trapmine ML engine used for more than six months with an averaging 0.02 FPR value. Every ThreatScore results essentially have four different outcomes which are “clean”, “suspicious.low.ml.score”, “malicious.moderate.ml.score” and, “malicious.high.ml.score”. Owing to these multiple evaluation outcomes, Trapmine gives its customers an option of selecting the degrees of protection they would like to choose.
We hope the technology we have developed here in Trapmine is exciting you as much as it excites us. As a technology development company, we always strive for better and we need your enthusiasm to achieve this!
Ulascan Aytolun – R&D Engineer