Abstract
Application log files contain rich information about system behavior, but extracting meaningful features for anomaly detection remains challenging. This work compares semantic and syntactic approaches to feature extraction for unsupervised learning on log data.
Key Contributions
- Feature Extraction Comparison: We systematically compare semantic (transformer-based) and syntactic (pattern-based) feature extraction methods.
- Unsupervised Evaluation: We evaluate both approaches in unsupervised settings across multiple log datasets.
- Practical Recommendations: We provide guidance on when to use each approach based on log characteristics.
Methods Compared
- Semantic Features: Transformer-based embeddings that capture meaning
- Syntactic Features: Pattern-based extraction using log templates and structure
Key Findings
Semantic features excel at capturing complex behavioral patterns, while syntactic features provide more interpretable results. The optimal choice depends on the specific use case and requirements for explainability.
Datasets
Experiments conducted on application logs from various sources including web servers and system services.