Language Modeling for Anomalous Network Activity Detection
Research Mentor(s)
Hutchinson, Brian
Description
Security analysts often employ automated detection systems to reduce the cognitive burden imposed by manual inspection of computer and network activity logs. These systems are designed to flag potential events of interest so that analysts may locate and triage security risks quickly. Traditionally, these systems have relied upon signature-based approaches, which try to match logs to known attack patterns. While these tend to be highly accurate in identifying known attacks, they often fail to detect novel attacks. An alternative is to use anomaly detection techniques, which aim to learn normal behavior and flag patterns that fall outside of it. The dominant anomaly detect approach requires the system to aggregate various user behavior statistics over windows of time (e.g. one day), and then look for improbable combinations of values. This requires expertise in designing the particular statistics to be computed, leaves the system brittle to changes in the environment, and leaves it vulnerable to blind spots not captured by the statistics. In this work we describe a recurrent neural network (RNN) language model designed to learn directly from streams of network activity log data. This algorithm treats network activity logs as "sentences," allowing it to learn both the "syntax" and "semantics" of normal network activity logs. It identifies anomalous log lines as those which do not conform to its learned model of normal network behavior. Not only does this bypass the need to hand-craft statistics, but we demonstrate that our approach also improves performance. Using the Los Alamos National Laboratory Cybersecurity Dataset, we obtain an area under the receiver operating characteristic curve of 0.98, indicating very high true positive rates with minimal false positives. Additionally, we introduce a mechanism for interpreting the model's predictions in order to provide richer context to analysts who might use our system.
Document Type
Event
Start Date
17-5-2018 12:00 AM
End Date
17-5-2018 12:00 AM
Department
Computer Science
Genre/Form
student projects, posters
Subjects – Topical (LCSH)
Computer algorithms; Computer networks--Management; Computational intelligence
Type
Image
Rights
Copying of this document in whole or in part is allowable only for scholarly purposes. It is understood, however, that any copying or publication of this document for commercial purposes, or for financial gain, shall not be allowed without the author’s written permission.
Language
English
Format
application/pdf
Language Modeling for Anomalous Network Activity Detection
Security analysts often employ automated detection systems to reduce the cognitive burden imposed by manual inspection of computer and network activity logs. These systems are designed to flag potential events of interest so that analysts may locate and triage security risks quickly. Traditionally, these systems have relied upon signature-based approaches, which try to match logs to known attack patterns. While these tend to be highly accurate in identifying known attacks, they often fail to detect novel attacks. An alternative is to use anomaly detection techniques, which aim to learn normal behavior and flag patterns that fall outside of it. The dominant anomaly detect approach requires the system to aggregate various user behavior statistics over windows of time (e.g. one day), and then look for improbable combinations of values. This requires expertise in designing the particular statistics to be computed, leaves the system brittle to changes in the environment, and leaves it vulnerable to blind spots not captured by the statistics. In this work we describe a recurrent neural network (RNN) language model designed to learn directly from streams of network activity log data. This algorithm treats network activity logs as "sentences," allowing it to learn both the "syntax" and "semantics" of normal network activity logs. It identifies anomalous log lines as those which do not conform to its learned model of normal network behavior. Not only does this bypass the need to hand-craft statistics, but we demonstrate that our approach also improves performance. Using the Los Alamos National Laboratory Cybersecurity Dataset, we obtain an area under the receiver operating characteristic curve of 0.98, indicating very high true positive rates with minimal false positives. Additionally, we introduce a mechanism for interpreting the model's predictions in order to provide richer context to analysts who might use our system.