Benchmarking Large Language Models for Log Analysis, Security, and Interpretation

Abstract

Security log analysis is critical for detecting threats and anomalies in modern systems. This work explores how Large Language Models with different architectures can be leveraged to better analyze application and system log files for security purposes.

Key Contributions

LLM4Sec Pipeline: We propose and implement a new experimentation pipeline that leverages LLMs for log analysis experimentation, evaluation, and analysis.

Comprehensive Benchmarking: We deploy and benchmark 60 fine-tuned language models across six datasets from web application and system log sources.

State-of-the-Art Results: Our best-performing fine-tuned model (DistilRoBERTa) achieves an average F1-Score of 0.998, outperforming previous approaches.

Models Evaluated

BERT
RoBERTa
DistilRoBERTa
GPT-2
GPT-Neo

Key Findings

The results demonstrate that LLMs can perform log analysis effectively, with fine-tuning being particularly important for appropriate domain adaptation to specific log types. The transformer-based architectures show strong capability in understanding the semantic structure of log messages.

Impact

This work provides security and ML practitioners with deeper insight when selecting features and algorithms for log analysis tasks, establishing benchmarks for future research in this domain.

Authors

Abstract

Key Contributions

Models Evaluated

Key Findings

Impact

Cite This Work