Context
OpenAI added Skills to Codex this week. Anthropic released the Agent Skills spec a few days ago.
The format: reusable prompt files (SKILL.md) that agents can invoke:
code-review.md # Find security vulnerabilities
sql-optimization.md # Optimize queries
api-design.md # Review endpointsThese are 100-200 line markdown files with instructions, examples, and output formats.
Now Skills can be as direct as "Replace a text in PDF" which probably doesn't need any optimization at all. On the other hand it can be broad "Find security issues in the code" which we can optimized to no end.
Skills are structured prompts. DSPy is a framework for optimizing prompts. Nobody's connected these two ecosystems yet.
So the main idea is, if Skills are just structured prompts, they should be programmatically optimizable. You could benchmark them, version them, auto-generate model-specific variants.
I tested this hypothesis with a code security review skill.
Method
DSPy is a prompt optimization framework from Stanford. Instead of manually tweaking prompts, you:
- Define task as a signature (inputs/outputs)
- Provide training examples
- Run optimizer
- Get improved prompt with auto-selected few-shot examples
The optimizer tests different instruction phrasings, example combinations, and reasoning strategies, then picks the best performing variant.
The Pipeline
To test if Skills can be optimized with DSPy:
1. Parse SKILL.md → DSPy Signature
2. Create training data (vulnerable code + expected issues)
3. Run DSPy optimization
4. Extract improvements → SKILL-optimized.mdWhat this proves: If DSPy can improve a Skill's output quality, then Skills aren't just static markdown files - they're improvable programs.
I tested with a code security review skill that finds SQL injection, weak crypto, etc.
Setup
Input: code-review.md - 80-line skill that finds security vulnerabilities
Training data: 10 vulnerable code samples:
- SQL injection (Python, JavaScript)
- Weak crypto (MD5, hardcoded secrets)
- Command injection
- XSS, path traversal, IDOR, etc.
DSPy Signature:
class CodeReview(dspy.Signature):
"""Find security vulnerabilities and bugs in code."""
code = dspy.InputField()
language = dspy.InputField()
critical_issues = dspy.OutputField()
high_issues = dspy.OutputField()Optimization:
optimizer = dspy.BootstrapFewShot(metric=quality_metric)
optimized = optimizer.compile(baseline, trainset=examples)Models tested: GPT-4o (Azure), Qwen3 (Ollama)
Results
Bottom line: Yes, Skills can be programmatically optimized.
I tested on two very different models to prove the concept works broadly:
- GPT-4o (frontier model, Azure)
- Qwen3 (local model, Ollama)
The results show something interesting about Skills themselves.
GPT-4o (Azure)
Baseline: 40.6% quality score
baseline = dspy.Predict(CodeReview)Optimized variants:
# Chain of Thought
cot = dspy.ChainOfThought(CodeReview)
# 39.2%
# Enhanced signature
enhanced = dspy.ChainOfThought(EnhancedCodeReview)
# 38.8%
# BootstrapFewShot
optimized = optimizer.compile(module, trainset=10_examples)
# ~40% ❌Result: 0% improvement. Some variants performed worse.
What this tells us: The baseline code-review.md Skill is already well-designed for frontier models. DSPy's optimization confirms the Skill is solid - it can't find ways to improve it further for GPT-4o.
Qwen3 (Ollama, local)
Tested with REAL DSPy BootstrapFewShot optimization on 3 code examples:
BASELINE: No optimization
Testing 1/3... Score: 61.6%
Testing 2/3... Score: 56.0%
Testing 3/3... Score: 10.0%
- Baseline Average: 42.5%
OPTIMIZED: BootstrapFewShot (automated)
• DSPy automatically selected best examples
• Generated optimized prompt with demonstrations
• No manual prompt engineering
Testing 1/3... Score: 77.0%
Testing 2/3... Score: 65.7%
Testing 3/3... Score: 22.5%
- Optimized Average: 55.1%Result: BootstrapFewShot showed improvement of 12.5 percentage points (42.5% → 55.1%)
What this tells us: DSPy's automated optimization works significantly better than manual prompt variations. The optimizer automatically selected which training examples work best and compiled an optimized module - exactly what DSPy is designed to do.
Discussion
What This Enables for Skills
1. Skills can be benchmarked objectively
Right now, Skills are shared as markdown files with no quality metrics. With DSPy optimization:
my-skill/
├── SKILL.md # Baseline
├── TRAINING.json # Test cases
├── METRICS.json # Quality scores
└── OPTIMIZED.md # Auto-generated variantsYou could compare Skills objectively: "code-review-v2.md scores 15% better on SQL injection detection"
2. Skills can be auto-adapted per model
Instead of one-size-fits-all:
code-review-gpt4.md- Minimal (frontier models work well baseline)code-review-qwen.md- Heavy few-shot (local models need examples)- Auto-generated based on model capabilities
3. Skills become versionable and improvable
Like code dependencies:
{
"skill": "code-review",
"version": "2.1.0",
"optimized_for": ["gpt-4o", "qwen3", "llama-3"],
"quality_score": 0.82
}Skills could evolve over time with community contributions to training data.
4. Skills can self-optimize as models improve
When GPT-6 releases, re-run optimization on existing Skills. No manual rewrites needed.
Try It Yourself (5-Minute Setup)
All code is available at github.com/instavm/skill-optimization.
Quick Start (Local, No API Keys)
# Install Ollama: https://ollama.ai
ollama pull qwen3
# Install dependencies
pip install -r requirements.txt
# Run REAL DSPy BootstrapFewShot optimization
python scripts/optimize_qwen.pyYou'll see the before/after quality comparison in ~5 minutes.
With Your Own OpenAI/Azure Key
# Set environment variables
export AZURE_API_KEY="your-key"
export AZURE_API_BASE="your-endpoint"
export AZURE_DEPLOYMENT="your-deployment"
# Run optimization
python scripts/run_azure_optimization.pyWhat You'll See
Qwen output (before optimization):
1. SQL Injection Vulnerabilities: Functions concatenate user inputs
2. Insecure Password Hashing: Uses MD5Qwen output (after optimization):
SQL Injection - User input concatenated allows arbitrary SQL execution.
Attackers can inject malicious SQL to bypass authentication or extract data.
Fix: Use parameterized queries: cursor.execute("SELECT * FROM users WHERE id = ?", (user_id,))The difference is in the specificity and actionability.
How to Optimize Your Own Skill.md
Want to optimize your own Skill? Here's the actual workflow:
Step 1: Understand Your Skill's Structure
Look at your my-skill.md and identify:
- Inputs: What data does the skill receive? (e.g., code, user query, document)
- Outputs: What should it produce? (e.g., bug report, summary, recommendations)
- Task: What's the core transformation? (e.g., "find security bugs", "extract key points")
Example - for a code review skill:
Input: code snippet + language
Output: critical_issues, high_issues, medium_issues
Task: "Find security vulnerabilities in code"Step 2: Create Training Examples (Manual)
You need to manually create 5-10 examples with expected outputs. This is the hardest part.
For code review, I created data/training_data.json:
{
"file_path": "examples/sql_injection.py",
"expected_issues": [
{
"title": "SQL Injection",
"severity": "Critical",
"description": "User input concatenated into SQL query",
"locations": ["authenticate_user:10"],
"fix": "Use parameterized queries"
}
]
}You manually write these or use LLM but do verify them manually - DSPy doesn't auto-generate training data (yet).
Step 3: Define DSPy Signature
Convert your Skill to DSPy code:
import dspy
class MySkill(dspy.Signature):
"""Your skill's task description."""
input_field = dspy.InputField(desc="What goes in")
output_field = dspy.OutputField(desc="What comes out")
# Create baseline module
baseline = dspy.Predict(MySkill)Step 4: Create Evaluation Metric
Write a function that scores output quality (0-100%):
def quality_metric(example, prediction, trace=None):
score = 0
# Check if output has required elements
if "key_term" in prediction.output_field:
score += 25
if len(prediction.output_field) > 100:
score += 25
# ... more checks
return score / 100 # Return 0.0-1.0Step 5: Run BootstrapFewShot
from dspy.teleprompt import BootstrapFewShot
# Load your training examples
trainset = [dspy.Example(**ex).with_inputs('input_field')
for ex in training_data]
# Run optimization
optimizer = BootstrapFewShot(
metric=quality_metric,
max_bootstrapped_demos=3,
max_labeled_demos=3
)
optimized = optimizer.compile(baseline, trainset=trainset)This takes 5-10 minutes and automatically selects which examples improve performance.
Step 6: Extract Results Back to Skill.md
DSPy gives you an optimized module with:
- Better instruction phrasing
- Auto-selected few-shot examples
- Improved reasoning strategies
You manually create a new my-skill-optimized.md by:
- Looking at what examples DSPy selected
- Adding those examples to your Skill.md
- Updating instructions based on what worked
OR keep using the DSPy module directly in your code (no need for .md file).
What You DON'T Do
- DSPy doesn't parse your Skill.md automatically
- DSPy doesn't generate training examples for you
- DSPy doesn't output a new Skill.md file
What You Do
- You manually convert Skill → DSPy Signature
- You manually create training examples
- DSPy automatically finds which examples work best
- You manually update your Skill.md with the results (or use DSPy module directly)
The automation is in finding WHICH examples to use, not creating them.
What I Learned
1. Skills are programs, not just documentation
They can be optimized, benchmarked, versioned, and improved systematically. They're not static markdown files.
2. Skills can be model-agnostic
Write once, auto-optimize per model. The same Skill works on GPT-4o (baseline) and Qwen (optimized variant).
3. DSPy validates Skill quality
When DSPy can't improve a Skill (like GPT-4o at 0%), that's validation the Skill is well-designed. Optimization isn't just about improvement - it's about measurement.
4. Skills need infrastructure
To make this practical, we need:
- Standardized training data formats
- Quality benchmarks
- Auto-optimization tooling
- Model-specific variant generation
Open Questions
Should Skills ship with training data? Like:
SKILL.md+TRAINING.json+ benchmarks. This would enable community optimization.Can we standardize Skill quality metrics? My metrics are custom. We need standard benchmarks for comparing Skills objectively.
Do optimizations transfer across model families? If I optimize for Qwen, does it work for Llama? Mistral? This would reduce optimization overhead.
Can we auto-generate training data? I hand-wrote 10 vulnerable code examples. Could GPT-4 generate these systematically?
What about Skill versioning? How do we track improvements?
code-review@2.1.0with changelog and quality deltas?
Conclusion
Skills are exploding. OpenAI yesterday, Anthropic three days ago. Everyone's building new Skills.
But here's what nobody's talking about: Skills can be programmatically optimized.
What This Proves
1. Skills are not just markdown files
They're structured prompts that can be:
- Benchmarked with quality metrics
- Optimized for different models
- Versioned and improved over time
- Auto-generated in model-specific variants
2. The Skills + DSPy connection unlocks infrastructure
Current state: Skills are manually crafted, shared as .md files, no quality metrics
Possible future:
my-skill/
├── SKILL.md # Baseline
├── TRAINING.json # Test cases
├── OPTIMIZED.md # Auto-generated variants
└── METRICS.json # Quality benchmarksThis enables:
- "This Skill scores 82% on security detection benchmarks"
- "Auto-optimized for GPT-4o, Qwen3, Llama 3.1"
- "Version 2.1 - +15% improvement over v2.0"
- Community contributions to training data
3. Skills become model-agnostic
Write the Skill once. Auto-optimize for:
- Frontier models (minimal prompting needed)
- Local models (heavy few-shot guidance)
- Future models (re-optimize when GPT-5 drops)
No manual prompt engineering per model.
Limitations & Next Steps
This is early exploration. I only tested:
- One skill type (code review)
- Two models (GPT-4o, Qwen3)
- Simple metrics (10 training examples)
Questions to explore:
- Does this work for other skill types? (data analysis, API design, etc.)
- Do optimizations transfer across model families?
- Can we auto-generate training data with GPT-4?
- What's the right standardized format for Skill training data?
I'm sharing this as: "These two ecosystems connect nicely - here's proof it works."
Not as: "This is the final solution."
Try it yourself: https://github.com/instavm/skill-optimization
Appendix: Technical Implementation
DSPy Signatures
class CodeReview(dspy.Signature):
"""Find security vulnerabilities and bugs in code."""
code = dspy.InputField(desc="Source code to analyze")
language = dspy.InputField(desc="Programming language")
critical_issues = dspy.OutputField(desc="Critical security vulnerabilities")
high_issues = dspy.OutputField(desc="High priority bugs")Evaluation Metric
def evaluate_quality(output, expected):
# Parse issues from output
issues = parse_issues(output)
# Calculate precision/recall
precision = matched / len(issues)
recall = matched / len(expected)
# Evaluate explanation quality
has_impact = check_for_impact_explanation(issues)
has_fix = check_for_fix_suggestion(issues)
# Weighted score
return 0.4 * f1 + 0.3 * has_impact + 0.3 * has_fixOptimization Loop
# Create optimizer
optimizer = dspy.BootstrapFewShot(
metric=quality_metric,
max_bootstrapped_demos=4,
max_labeled_demos=4
)
# Compile
optimized = optimizer.compile(
student=module,
trainset=training_examples
)
# The optimized module now has:
# - Better instructions
# - 4 selected few-shot examples
# - Improved reasoning strategyResults Data
{
"gpt4o": {
"baseline": 0.406,
"optimized": 0.388,
"improvement": -0.044
},
"qwen": {
"baseline": 0.425,
"optimized": 0.551,
"improvement": 0.125,
"relative_improvement": 0.294
}
}All data, code, and examples available in the repository.