Event Title

Language Modeling for Anomalous Network Activity Detection

Research Mentor(s)

Brian Hutchinson

Description

Security analysts often employ automated detection systems to reduce the cognitive burden imposed by manual inspection of computer and network activity logs. These systems are designed to flag potential events of interest so that analysts may locate and triage security risks quickly. Traditionally, these systems have relied upon signature-based approaches, which try to match logs to known attack patterns. While these tend to be highly accurate in identifying known attacks, they often fail to detect novel attacks. An alternative is to use anomaly detection techniques, which aim to learn normal behavior and flag patterns that fall outside of it. The dominant anomaly detect approach requires the system to aggregate various user behavior statistics over windows of time (e.g. one day), and then look for improbable combinations of values. This requires expertise in designing the particular statistics to be computed, leaves the system brittle to changes in the environment, and leaves it vulnerable to blind spots not captured by the statistics. In this work we describe a recurrent neural network (RNN) language model designed to learn directly from streams of network activity log data. This algorithm treats network activity logs as "sentences," allowing it to learn both the "syntax" and "semantics" of normal network activity logs. It identifies anomalous log lines as those which do not conform to its learned model of normal network behavior. Not only does this bypass the need to hand-craft statistics, but we demonstrate that our approach also improves performance. Using the Los Alamos National Laboratory Cybersecurity Dataset, we obtain an area under the receiver operating characteristic curve of 0.98, indicating very high true positive rates with minimal false positives. Additionally, we introduce a mechanism for interpreting the model's predictions in order to provide richer context to analysts who might use our system.

Document Type

Event

Start Date

May 2018

End Date

May 2018

Location

Computer Sciences

Rights

Copying of this document in whole or in part is allowable only for scholarly purposes. It is understood, however, that any copying or publication of this document for commercial purposes, or for financial gain, shall not be allowed without the author’s written permission.

Language

English

Format

application/pdf

This document is currently not available here.

Share

COinS
 
May 17th, 12:00 PM May 17th, 3:00 PM

Language Modeling for Anomalous Network Activity Detection

Computer Sciences

Security analysts often employ automated detection systems to reduce the cognitive burden imposed by manual inspection of computer and network activity logs. These systems are designed to flag potential events of interest so that analysts may locate and triage security risks quickly. Traditionally, these systems have relied upon signature-based approaches, which try to match logs to known attack patterns. While these tend to be highly accurate in identifying known attacks, they often fail to detect novel attacks. An alternative is to use anomaly detection techniques, which aim to learn normal behavior and flag patterns that fall outside of it. The dominant anomaly detect approach requires the system to aggregate various user behavior statistics over windows of time (e.g. one day), and then look for improbable combinations of values. This requires expertise in designing the particular statistics to be computed, leaves the system brittle to changes in the environment, and leaves it vulnerable to blind spots not captured by the statistics. In this work we describe a recurrent neural network (RNN) language model designed to learn directly from streams of network activity log data. This algorithm treats network activity logs as "sentences," allowing it to learn both the "syntax" and "semantics" of normal network activity logs. It identifies anomalous log lines as those which do not conform to its learned model of normal network behavior. Not only does this bypass the need to hand-craft statistics, but we demonstrate that our approach also improves performance. Using the Los Alamos National Laboratory Cybersecurity Dataset, we obtain an area under the receiver operating characteristic curve of 0.98, indicating very high true positive rates with minimal false positives. Additionally, we introduce a mechanism for interpreting the model's predictions in order to provide richer context to analysts who might use our system.