Vertigo LoRA

Name: Vertigo LoRA
Author: Alejandro Peña

Domain-specialized models for Roblox/Luau game development on Apple Silicon

v0.5 Production Qwen3.5-4B MLX / Apple Silicon

Site last updated	March 16, 2026
Production adapter	v0.5-4b-curated — promoted March 15, 2026
Benchmark version	v2 (real MCP tools) — revised March 15, 2026
Training data	3,301 curated examples — last expanded March 16, 2026
OpenGameEval run	47 tasks dry-run — March 16, 2026
Base model	Qwen3.5-4B-4bit (Apache 2.0)
Repository	github.com/adpena/vertigo-lora
Next milestone	128GB machine (March 19) — 9B training, full-rank 4B, Studio execution eval

Models

Model	Type	Size	Link
Vertigo-Qwen3.5-4B-v0.5-4bit	Fused (ready to use)	2.2 GB	HuggingFace
Vertigo-Qwen3.5-4B-v0.5-lora	LoRA adapter only	62 MB	HuggingFace

Results

Scoring method: pattern matching. These benchmarks check whether model output contains expected keywords and code patterns. They do NOT verify that generated code compiles, runs, or produces correct behavior. See full caveats below.

Vertigo Benchmark v2 (30 tasks)

Model	Params	Coding	Bugfix	Arch	MCP	Embody	Overall
Qwen3.5-27B dense	27B	80.6%	96.7%	79.1%	100%	95.0%	88.7%
Vertigo-Qwen3.5-4B-v0.5	4B	72.5%	90.0%	76.6%	85.8%	100%	82.9%
Qwen3.5-35B-A3B	3B active	79.2%	75.3%	66.5%	96.7%	77.5%	79.1%
Qwen3.5-4B base	4B	63.7%	83.3%	67.5%	97.5%	75.0%	75.1%
Qwen3.5-2B	2B	45.0%	81.7%	54.2%	70.0%	95.0%	65.1%
Qwen3.5-9B	9B	25.6%	76.7%	61.6%	96.7%	95.0%	63.5%

OpenGameEval Dry-Run (47 Roblox game dev tasks)

Model	Pass@1 (dry)	Method
Vertigo-Qwen3.5-4B-v0.5	83.0%	Pattern-match
Qwen3.5-4B base	72.3%	Pattern-match
Qwen3.5-27B dense	48.9%	Pattern-match
Qwen3.5-35B-A3B	42.6%	Pattern-match

Published OpenGameEval Leaderboard (different methodology)

Not comparable. The table above uses pattern-match scoring on code generation quality. The table below uses Roblox Studio execution verification. These measure fundamentally different things. A direct comparison between 83.0% (pattern-match) and 55.3% (execution-verified) is invalid.

Model	Pass@1	Method
Gemini 3.1 Pro	55.3%	Studio execution
Claude Opus 4.6	51.9%	Studio execution
Claude Opus 4.5	44.5%	Studio execution
GPT-5.4	35.1%	Studio execution

Source: Roblox OpenGameEval Leaderboard

Caveats & Limitations

Benchmark limitations

Pattern matching, not execution. Scores reflect whether output contains expected regex patterns (e.g., --!strict, :Init(), @native). Code that looks correct but doesn't compile would still score well.
Small sample size. 30 tasks with 5 per category. A single task change = 20pp category swing. Confidence intervals are wide.
Luau compile rate: 0%. None of the benchmark responses were verified to compile with luau-compile.
MCP tool calling regression. Fine-tuned model scores 85.8% vs base 97.5% on MCP tasks. Domain specialization came at the cost of some general tool-calling ability.

Data contamination

18/30 benchmark tasks: clean (no training data overlap)
12/30 benchmark tasks: partial overlap (training data teaches similar patterns from the codebase — expected for domain fine-tuning, but means the benchmark partly measures recall)
0/30 tasks: fully contaminated

Model limitations

Domain-specific. Specialized for Roblox/Luau + Vertigo conventions. Will underperform base model on general coding, other engines, non-Luau languages.
Hardware-constrained. Trained at rank 8 / 8 layers / 2048 seq on 36GB. Higher-capacity training would likely improve results significantly.
Not execution-verified. No score has been validated through actual code execution in Roblox Studio.

Training Details

Parameter	Value
Base model	Qwen3.5-4B-4bit
Method	QLoRA via Apple MLX
Hardware	Apple M5 Max, 36GB unified memory
Rank / Layers	8 / 8 (of 28)
Learning rate	2e-6 (flat)
Iterations	600
Sequence length	2048
Training examples	3,301 curated (from 3,893 raw)
Validation loss	0.857
Training time	~45 minutes
Peak memory	27.3 GB

Dataset sources (all rights-clean)

Source	Examples	License
Own codebase (Vertigo)	631	Proprietary (own)
OSS Roblox repos	1,301	Various OSS
Roblox Creator Docs	806	CC-BY-4.0
Ecosystem tools (Luau, Rojo, Wally, Selene, roblox-ts)	61	MIT / MPL-2.0
Generated (synthetic, STaR, distillation, composition)	502	Generated

No live Roblox experiences, Creator Store assets, player data, or rate-limited content were used. All examples include provenance metadata (source, rights basis, license).

Key Finding: Quality Over Quantity

The most important discovery from this project: removing 550 low-quality examples improved the benchmark score by 13 percentage points.

Version	Training Examples	Overall
v0.3 (expanded, uncurated)	4,852	63.6%
v0.4 (curated, code-dense only)	5,679	82.8%

Examples without substantial code (prose-only explanations, tool-calling prompts without code output, gameplay session descriptions) actively degraded the model when included in training data. The curation rule: every example must contain ≥5 lines of Luau code in fenced code blocks.

Contact & Feedback

This is an early research release. Feedback, questions, and collaboration inquiries are welcome.

Alejandro Peña
GitHub: adpena/vertigo-lora
HuggingFace: @adpena
Email: adpena@vertigo.build

For methodology details, contamination audit, and full training log — please reach out directly. The evaluation methodology document and training logs are available on request.