A guide to setup and benchmark Local LLMs (Large Language Models) for analyzing and reverse engineering complex software projects. In the walkthrough, I setup a system for analyzing Python on top of a macOs system. The contents can be easily generalized to analyze different languages (e.g. Java or Javascript) on top of different os’es (e.g. Linux and Windows).
Source code available here: https://github.com/psuzzi/ai.reveng.
Please, note: This is a DRAFT article, which I need to be public for sharing and getting feedback
Introduction
In recent years, LLMs (Large Language Models) have become powerful tools for programmers, aiding in tasks like code analysis and reverse engineering. While cloud based services offer convenience, they come with significant privacy and data security concerns. This document outlines how to setup and benchmark local LLMs for code analysis, offering a solution that ensures privacy by running models locally on your own systems.
Selecting a Local AI Solution
Instead of using cloud services, I choose Ollama paired with CodeLlama as my local AI solution. After comparing it with LM Studio and GPT4All, Ollama proved superior in ease of use, integration capabilities, community support, and code analysis features. The comparison table below shows the details.
Aspect | Ollama | LM Studio | GPT4All |
Installation | Single-command, Docker-like | Desktop installer | Multiple components |
Interface | CLI + API | GUI + chat | GUI + API |
Model management | Simple pull commands, easy switching | Visual model browser, manual download | Limited selection, manual management |
Code Analysis Features | Strong (e.g. using CodeLlama) | Basic | Basic |
Integration Ease | High (REST API) | Medium (Wrappers) | Medium (Python APIs) |
Community Support | Very active, growing | Active | Moderate |
Best Use Case | Automated workflows, development | Interactive exploration | General text analysis |
.
Setup Ollama and run CodeLlama
Ollama provides a straightforward installation process on macOS using Homebrew. Here is how to get started:
brew install ollama
BashNote: Execute ollama help
, to learn the basic commands, e.g.: ollama pull
, run
, list
, rm [model]
, create [model]
, show [model]
.
Start the Ollama service. You have two options:
# to start now without background service
ollama serve
# to start now and restart at login
brew services start ollama
BashOnce Ollama is running, install Codellama:
# install
ollama pull codellama
BashNote: You can check your models with ollama list
, and remove them with ollama rm [model]
.
Now you have a Local LLM ready to use. Test the model with a simple deterministic task.
# Test model with a simple request
ollama run codellama "Python program to generate fibonacci numbers"
BashNote: The first time a model is used, it is slower than subsequent uses because the model must be loaded into memory.
Here is a sample output for a deterministic task:
This section demonstrated how to setup and use a Local LLM within a few minutes.
In the next sections we select specific codellama models based on system considerations, and we execute a benchmark comparing the different models.
Select Models based on System Requirements
Language models (LLMs) can run in either full precision (FP32) or quantized modes. Quantization compresses 32-bit floating point numbers into 4-bit or 8-bit formats, reducing model size by up to 87%. For instance, a 7B parameter model can be compressed from 28GB (FP32) to 4GB using 4-bit quantization (Q4).
Model Variants: CodeLlama provides three quantization options with different tradeoffs:
- K: Standard quantization with maximum accuracy
- K_M: Balanced quantization offering moderate compression with good performance
- K_S: High compression with some accuracy tradeoff
System: Requirements scale with model size (in billion parameters):
- 7B: 8GB RAM, 4GB storage, 6GB VRAM
- 13B: 16GB RAM, 7GB storage, 8GB VRAM
- 34B: 32GB RAM, 18GB storage, 16GB VRAM
Implementation Example
This example demonstrates running CodeLlama on an Apple Silicon M1 with 32GB RAM and 500GB storage. The system supports local LLM inference well.
To balance performance and resource usage, we’ll test the K_M (medium quantization) variants of the 7B, 13B, and 34B models:
# Smallest but still capable
ollama pull codellama:7b-code-q4_K_M
# Better performance
ollama pull codellama:13b-code-q4_K_M
# Best performance
ollama pull codellama:34b-code-q4_K_M
BashNote: you can check the available models in https://ollama.com/library/codellama/tags.
Benchmarking Models
Common pitfalls in LLM benchmarking
Let’s examine a naive benchmark approach to understand what not to do:
time ollama run codellama:7b-code-q4_K_M "Write a simple function"
# Output: [tokens generated with varying latency]
# Result: 0.05s user 0.05s system 0% cpu 32.273 total
BashThis approach has several limitations:
- Combines timing model loading and inference time
- Uses a non deterministic prompt that procude varying outputs
- Lacks measurement of crucial metrick like time-to-first-token
- Doesn’t account for run-to-run variance
- GPU utilization isn’t properly measured
Best Practices for LLM Benchmarking
Prompt Design: Deterministic Prompts with Testable Output
- Use prompts that consistently produce the same outputs given the same inputs
- Specify structured output formats (e.g. specific JSON format)
- Include validation cases with expected outputs for testing
Measurement Methodology
- Separate cold start (first run) and warm start measurements
- Track key metrics:
- Time to first token
- Total generation time
- Token throughput [tokens/sec]
- Peak memory usage
For benchmarking models, having a stable prompt is crucial. For instance, the below prompt would bring non-deterministic results, as the results may very widely
Here is an example of a naive benchmark. This is wrong in several directions but will help understanding how to do it right.
Considerations
- Just timing the whole process is not enough to benchmark the model, as we miss details:
- Time spent on the first model load
- The prompt “Write a simple function” is not deterministic, the output can vary widely.
- 0% cpu hints at the GPU being used for the task
- If this is the first execution of the model, the timing includes both model loading and token output
- The prompt “Write a simple function” is not deterministic, the output can vary.
For this result we need to consider:
- This is the first execution of the model, and the timing includes also loading the model in memory
- The prompt “Write a simple function” is not deterministic. To compare multiple models we need improvements.
For benchmarking different language models, having a stable prompt is crucial. Here are some key principles:
- Deterministic Tasks: Choose tasks which have clear, unambiguous expected outputs
- Structured Output Requirements: Include explicit format requirements to make outputs comparable
- Context-Independent Prompts: Avoid prompts which can produce different results based on time or context.
- Complex Buonded Tasks: for testing more omprehensive capabilities
- Standardized Test Cases: Include specific test cases in the prompt.
Once defined a set of stable prompts, we should also consider:
- run each prompt multiple times, to account for variance.
- measure both cold start and warm start times
- consider measuring: Time to first token, total generation time, token generation rate, memory usage, output consistency across runs.
Below you can see a sample Python script we could use to benchmark the model
import time
import statistics
def benchmark_prompt(model, prompt, runs=10):
times = []
for i in range(runs):
start = time.time()
response = run_model(model, prompt)
end = time.time()
times.append(end - start)
return {
'min': min(times),
'max': max(times),
'mean': statistics.mean(times),
'median': statistics.median(times),
'std_dev': statistics.stdev(times)
}
PythonMy actual benchmark script is a bit more complex and interfaces directly with ollama running codellama. You can find the whole script in GitHub: <TODO-add link>.
Now we
Download models
Setup tools to interface with codebase
Appendix
Architecture-Specific Performance
Here is a comparison of two mainstream different architectures:
Apple Silicon (ARM)
- Unified memory architecture improves memory bandwidth
- Metal API optimizations for ML workloads
- Native support for quantized models via Core ML
- Efficient power consumption
- Performances scales with chip variant (M1, M2, M3, M4)
- Memory limitations based on integrated system design
x86 + NVIDIS GPU (CISC)
- Dedicated VRAM separate from system RAM
- CUDA acceleration for optimal performance
- Higher power consumption but potentially faster inference
- Flexible memory configuration
- Benefits from PCIe bandwidth for data transfer
- Multiple GPU configurations possible
All variants perform optimally with their respective hardware accelerations (Metal for Apple Silicon, CUDA for NVIDIA GPUs) and SSD storage. While CPU-only operation is possible on both architectures, hardware acceleration significantly improves inference speed.
0 Comments