SC
All write-ups
12 min read

Binary Explorer: Agentic RAG over MCP for Vulnerability Analysis

LLMRAGMCPGhidraFAISSPython

Overview

Binary Explorer is my Master's thesis project at the University of Calabria. The goal: build an agentic system that can autonomously analyse compiled binaries for security vulnerabilities — without requiring the analyst to manually run Ghidra, parse assembly, or write queries.

The system chains together:

  1. Ghidra (headless) for decompilation
  2. FAISS for semantic indexing of decompiled functions
  3. An LLM agent (via MCP) that reasons over the index and produces structured vulnerability reports

Architecture

Binary (ELF/PE/Mach-O)
        │
        ▼
  Ghidra Headless        ← decompiles to pseudo-C + extracts function metadata
        │
        ▼
  Chunking + Embedding   ← each function → vector via sentence-transformers
        │
        ▼
  FAISS Index            ← persisted on disk, queryable by semantic similarity
        │
        ▼
  MCP Server             ← exposes tools: search_functions, get_xrefs, run_gdb
        │
        ▼
  LLM Agent              ← Claude/GPT-4o with tool use, produces final report

Why MCP?

The Model Context Protocol lets the LLM call structured tools rather than receiving a giant context dump. Instead of feeding 200 decompiled functions into the prompt, the agent:

  1. Asks search_functions("buffer overflow") → gets top-10 semantically similar functions
  2. Calls get_xrefs(func_name) → traces call chains
  3. Optionally calls run_gdb(breakpoint, input) → dynamic validation

This keeps token usage low and makes the reasoning traceable.


Key Technical Decisions

Chunking strategy

Splitting at function boundaries (not arbitrary token counts) preserves semantic coherence. A function is the natural unit of analysis in binary reversing — splitting mid-function destroys context.

def extract_functions(ghidra_output: str) -> list[Function]:
    # Parse Ghidra's pseudo-C output into discrete function objects
    # Each Function has: name, address, pseudo_c, calls[], called_by[]
    ...

Embedding model choice

I tested three models for the function → vector step:

Model Recall@10 Latency Notes
all-MiniLM-L6-v2 71% 12ms Too generic for code
code-search-net 84% 18ms Better, trained on code
nomic-embed-code 89% 22ms Best results, used in final

Vulnerability patterns

The agent uses a prompt that encodes known vulnerability patterns as search intents:

VULN_QUERIES = [
    "strcpy strcat unbounded copy",
    "malloc free use after free",
    "integer overflow arithmetic check",
    "format string printf user input",
]

Each query retrieves candidate functions; the agent then reasons about whether the pattern is actually exploitable given the context.


Results

Tested on a corpus of 40 intentionally vulnerable binaries (DVWA native, custom CTF binaries):

The system performs best on well-structured C code with standard vulnerability patterns. It struggles with obfuscated code or custom allocators.


What I'd do differently

1. Use a code-specific LLM for the reasoning step. GPT-4o is good but not trained on assembly/decompiled C. Models like deepseek-coder or a fine-tuned CodeLlama would likely improve precision on the reasoning step.

2. Add a feedback loop. Right now the agent produces a report and stops. A human-in-the-loop step where the analyst validates/rejects findings could feed back into the retrieval weights over time.

3. Persistent cross-binary index. Each binary is indexed in isolation. Indexing a corpus of malware families together would enable cross-binary similarity queries — useful for variant detection.


Repository

The full implementation is on GitHub. The README includes setup instructions for Ghidra headless mode, which is the most painful part of the stack.