For the last year, I have been interning at CyCraft AI Lab, the leading cybersecurity firm in Taiwan. My team had to process numerous cybersecurity articles and turn them to actionable information. Due to time pressure to respond to cyber attacks, we constructed a NLP system to tackle this problem.
In the project, I construct the whole data processing pipeline to turn the articles into intelligence
1.Crawling the largest Chinese security website
2. Classifying articles to help security team quickly identify articles related to their daily missions.
3. Setting up the recommendation system of prevalent attack methods to help security analyst quickly realize the articles’ theme.
4.Recognizing attack technique in articles and labeling with MITRE ATT&CK technique
Our system has achieved 95% accuracy on our collected dataset. I will share our solutions to problems we met during the process and how I balance between intern and high school life.
My project can be divided into the following sections:
**Data crawling/processing**
* Instead of using a crawling framework, I tried to implement the crawler from scratch so I can learn the underlying mechanisms.
* The talk will share the difficulties we encountered include anti-crawling and network problems.
* Crawling tool: request_html
**NLP**
The security portal site I crawled was written in Simplified Chinese. Hence, Jieba is used to tokenize the Chinese articles for further analysis.
* Tokenizing tool: Jieba
In order to classify cybersecurity articles to different types, it is necessary calculate a term's importance in different article types. Therefore, I used the following tools:
* Count Vectorizer for calculating word frequency: scikit-learn
* Using/Implementing TF-IDF for calculating term importance: scikit-learn and self implemented TF-IDF
Dividing main cyber attack types into smaller sub-topics can help the recommendation system more accurately recommend attack methods. Therefore, SVD is used.
* Single Value Decomposition: scikit-learn
Since the cybersecurity articles both contain Simplified Chinese and English, there must be some modifications.
* Handling bilingual article issue
**Classifying & Recommendation**
* Multinomial Naive Bayse: scikit-learn
* Stochastic Gradient Descent: scikit-learn
**Analyzing the result**
* The evaluation is conducted by train-analysis-feedback process.
* The process will be presented during the talk.
**Tool reference:**
request_html : [https://requests.readthedocs.io/projects/requests-html/en/latest/](http://)
Jieba: [https://github.com/fxsjy/jieba](http://)
count vectorizer:[https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html](http://)
TF-IDF:[https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html](http://)
Single Value Decomposition:[https://ccjou.wordpress.com/2009/09/01/奇異值分解-svd/](http://)
Multinomial Naive Bayse:[https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html](http://)
Stochastic Gradient Descent:[https://towardsdatascience.com/stochastic-gradient-descent-clearly-explained-53d239905d31](http://)