Vertigo LoRA

Domain-specialized models for Roblox/Luau game development on Apple Silicon

v0.5 Production Qwen3.5-4B MLX / Apple Silicon

Site last updatedMarch 16, 2026
Production adapterv0.5-4b-curated — promoted March 15, 2026
Benchmark versionv2 (real MCP tools) — revised March 15, 2026
Training data3,301 curated examples — last expanded March 16, 2026
OpenGameEval run47 tasks dry-run — March 16, 2026
Base modelQwen3.5-4B-4bit (Apache 2.0)
Repositorygithub.com/adpena/vertigo-lora
Next milestone128GB machine (March 19) — 9B training, full-rank 4B, Studio execution eval

Models

Model Type Size Link
Vertigo-Qwen3.5-4B-v0.5-4bit Fused (ready to use) 2.2 GB HuggingFace
Vertigo-Qwen3.5-4B-v0.5-lora LoRA adapter only 62 MB HuggingFace

Results

Scoring method: pattern matching. These benchmarks check whether model output contains expected keywords and code patterns. They do NOT verify that generated code compiles, runs, or produces correct behavior. See full caveats below.

Vertigo Benchmark v2 (30 tasks)

Model Params Coding Bugfix Arch MCP Embody Overall
Qwen3.5-27B dense 27B 80.6% 96.7% 79.1% 100% 95.0% 88.7%
Vertigo-Qwen3.5-4B-v0.5 4B 72.5% 90.0% 76.6% 85.8% 100% 82.9%
Qwen3.5-35B-A3B 3B active 79.2% 75.3% 66.5% 96.7% 77.5% 79.1%
Qwen3.5-4B base 4B 63.7% 83.3% 67.5% 97.5% 75.0% 75.1%
Qwen3.5-2B 2B 45.0% 81.7% 54.2% 70.0% 95.0% 65.1%
Qwen3.5-9B 9B 25.6% 76.7% 61.6% 96.7% 95.0% 63.5%

OpenGameEval Dry-Run (47 Roblox game dev tasks)

Model Pass@1 (dry) Method
Vertigo-Qwen3.5-4B-v0.5 83.0% Pattern-match
Qwen3.5-4B base 72.3% Pattern-match
Qwen3.5-27B dense 48.9% Pattern-match
Qwen3.5-35B-A3B 42.6% Pattern-match

Published OpenGameEval Leaderboard (different methodology)

Not comparable. The table above uses pattern-match scoring on code generation quality. The table below uses Roblox Studio execution verification. These measure fundamentally different things. A direct comparison between 83.0% (pattern-match) and 55.3% (execution-verified) is invalid.
Model Pass@1 Method
Gemini 3.1 Pro 55.3% Studio execution
Claude Opus 4.6 51.9% Studio execution
Claude Opus 4.5 44.5% Studio execution
GPT-5.4 35.1% Studio execution

Source: Roblox OpenGameEval Leaderboard

Caveats & Limitations

Benchmark limitations

Data contamination

Model limitations

Training Details

ParameterValue
Base modelQwen3.5-4B-4bit
MethodQLoRA via Apple MLX
HardwareApple M5 Max, 36GB unified memory
Rank / Layers8 / 8 (of 28)
Learning rate2e-6 (flat)
Iterations600
Sequence length2048
Training examples3,301 curated (from 3,893 raw)
Validation loss0.857
Training time~45 minutes
Peak memory27.3 GB

Dataset sources (all rights-clean)

SourceExamplesLicense
Own codebase (Vertigo)631Proprietary (own)
OSS Roblox repos1,301Various OSS
Roblox Creator Docs806CC-BY-4.0
Ecosystem tools (Luau, Rojo, Wally, Selene, roblox-ts)61MIT / MPL-2.0
Generated (synthetic, STaR, distillation, composition)502Generated

No live Roblox experiences, Creator Store assets, player data, or rate-limited content were used. All examples include provenance metadata (source, rights basis, license).

Key Finding: Quality Over Quantity

The most important discovery from this project: removing 550 low-quality examples improved the benchmark score by 13 percentage points.

VersionTraining ExamplesOverall
v0.3 (expanded, uncurated)4,85263.6%
v0.4 (curated, code-dense only)5,67982.8%

Examples without substantial code (prose-only explanations, tool-calling prompts without code output, gameplay session descriptions) actively degraded the model when included in training data. The curation rule: every example must contain ≥5 lines of Luau code in fenced code blocks.

Contact & Feedback

This is an early research release. Feedback, questions, and collaboration inquiries are welcome.

Alejandro Peña
GitHub: adpena/vertigo-lora
HuggingFace: @adpena
Email: adpena@vertigo.build

For methodology details, contamination audit, and full training log — please reach out directly. The evaluation methodology document and training logs are available on request.