Abstract
This work analyzes the effect of lightweight syntactical feature extraction techniques from the field of information retrieval for log abstraction in information security applications.
Key Contributions
- Feature Extraction Techniques: We evaluate three different syntactical feature extraction methods for log analysis.
- Clustering Algorithm Comparison: We compare three clustering algorithms for anomaly detection on extracted features.
- Multi-Dataset Evaluation: We evaluate on four different security datasets to ensure generalizability.
Methods
- Traditional vector space models (TF-IDF)
- Log template extraction
- N-gram based features
Clustering Approaches
- K-means clustering
- DBSCAN
- Hierarchical clustering
Key Findings
Lightweight syntactical features provide a good balance between computational efficiency and detection performance. These methods are particularly suitable for resource-constrained environments or real-time analysis requirements.
Practical Impact
The techniques explored are deployable in production security operations centers where computational resources and latency are important considerations.