Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

PaperDigest

Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

Network traffic, as a key media format, is crucial for ensuring security and communications in modern internet infrastructure. While existing methods offer excellent performance, they face two key bottlenecks: (1) They fail to capture multidimensional semantics beyond unimodal sequence patterns. (2) Their black box property, i.e., providing only category labels, lacks an auditable reasoning process. We identify a key factor that existing network traffic datasets are primarily designed for classification and inherently lack rich semantic annotations, failing to generate human-readable evidence report. To address data scarcity, this paper proposes a Byte-Grounded Traffic Description (BGTD) benchmark for the first time, combining raw bytes with structured expert annotations. BGTD provides necessary behavioral features and verifiable chains of evidence for multimodal reasoning towards explainable encrypted traffic interpretation. Built upon BGTD, this paper proposes an end-to-end traffic-language representation framework (mmTraffic), a multimodal reasoning architecture bridging physical traffic encoding and semantic interpretation. In order to alleviate modality interference and generative hallucinations, mmTraffic adopts a jointly-optimized perception-cognition architecture. By incorporating a perception-centered traffic encoder and a cognition-centered LLM generator, mmTraffic achieves refined traffic interpretation with guaranteed category prediction. Extensive experiments demonstrate that mmTraffic autonomously generates high-fidelity, human-readable, and evidence-grounded traffic interpretation reports, while maintaining highly competitive classification accuracy comparing to specialized unimodal model (e.g., NetMamba). The source code is available at https://github.com/lgzhangzlg/Multimodal-Reasoning-with-LLM-for-Encrypted-Traffic-Interpretation-A-Benchmark

Figures Explained

Fig. 1. Comparison of traffic analysis paradigms. (a) Traditional classification methods that act as a u201cblack boxu201d, providing only a label and low-level feature weights that lack operational value. (b) Our proposed multimodal reasoning framework, composed of a Traffic Perception Encoder and a Cognitive LLM, instructed by Byte-Grounded Knowledge, generating an evidence-grounded report with human-understandable reasoning and executable insights.

Fig. 1. Comparison of traffic analysis paradigms. (a) Traditional classification methods that act as a u201cblack boxu201d, providing only a label and low-level feature weights that lack operational value. (b) Our proposed multimodal reasoning framework, composed of a Traffic Perception Encoder and a Cognitive LLM, instructed by Byte-Grounded Knowledge, generating an evidence-grounded report with human-understandable reasoning and executable insights.

Fig. 2. Pipeline of developing BGTD dataset: (a) session extraction and class balancing from raw PCAP files, (b) fixed-length 10 u00d7 160 NPY array generation via priority-based packet sampling, and (c) LLM-assisted ground-truth synthesis using Claude Opus-4.6 prompted as a senior network security expert.

Fig. 2. Pipeline of developing BGTD dataset: (a) session extraction and class balancing from raw PCAP files, (b) fixed-length 10 u00d7 160 NPY array generation via priority-based packet sampling, and (c) LLM-assisted ground-truth synthesis using Claude Opus-4.6 prompted as a senior network security expert.

Fig. 3. Statistical overview of the BGTD dataset.

Fig. 3. Statistical overview of the BGTD dataset.

Fig. 4. Overview of the mmTraffic framework. (a) The frozen traffic encoder Tu03b8 extracts high-dimensional features from raw traffic data. (b) The linear connector Cu03c9 projects traffic features into the LLM token space, with the CGHF mechanism injecting a class-aware anchor token into the input sequence. (c) The LLM Gu03d5 autoregressively generates a structured forensic report containing behavioral traits, evidence chain, and diagnostic description.

Fig. 4. Overview of the mmTraffic framework. (a) The frozen traffic encoder Tu03b8 extracts high-dimensional features from raw traffic data. (b) The linear connector Cu03c9 projects traffic features into the LLM token space, with the CGHF mechanism injecting a class-aware anchor token into the input sequence. (c) The LLM Gu03d5 autoregressively generates a structured forensic report containing behavioral traits, evidence chain, and diagnostic description.

Fig. 6. Ablation analysis on ISCX-Tor-2016 and ISCXVPN2016, with respect to the classification and generation metrics for four variants from V1 to V4.

Fig. 6. Ablation analysis on ISCX-Tor-2016 and ISCXVPN2016, with respect to the classification and generation metrics for four variants from V1 to V4.

Fig. 5. Analysis on Structural Consistency Metrics. The semantic-priority constraints in mmTraffic ensure high logical rigor.

Fig. 5. Analysis on Structural Consistency Metrics. The semantic-priority constraints in mmTraffic ensure high logical rigor.

Related Reading