Exploring Semantic vs. Syntactic Features for Unsupervised Learning on Application Log Files

Abstract

Application log files contain rich information about system behavior, but extracting meaningful features for anomaly detection remains challenging. This work compares semantic and syntactic approaches to feature extraction for unsupervised learning on log data.

Key Contributions

Feature Extraction Comparison: We systematically compare semantic (transformer-based) and syntactic (pattern-based) feature extraction methods.

Unsupervised Evaluation: We evaluate both approaches in unsupervised settings across multiple log datasets.

Practical Recommendations: We provide guidance on when to use each approach based on log characteristics.

Methods Compared

Semantic Features: Transformer-based embeddings that capture meaning
Syntactic Features: Pattern-based extraction using log templates and structure

Key Findings

Semantic features excel at capturing complex behavioral patterns, while syntactic features provide more interpretable results. The optimal choice depends on the specific use case and requirements for explainability.

Datasets

Experiments conducted on application logs from various sources including web servers and system services.

Cite This Work

@inproceedings{Karlsen2023SemanticSyntactic, author = {Karlsen, Egil and Copstein, Rafael and Luo, Xiao and Schwartzentruber, Jeff and Niblett, Bradley and Johnston, Andrew and Heywood, Malcolm I. and Zincir-Heywood, Nur}, title = {Exploring Semantic vs. Syntactic Features for Unsupervised Learning on Application Log Files}, booktitle = {2023 7th Cyber Security in Networking Conference (CSNet)}, year = {2023}, pages = {219--225}, doi = {10.1109/CSNet59123.2023.10339765} }

Authors