Completed house render

DreamHouse: How Far Are VLMs from Constructing the Real World?

Luyu Yang, Yutong Dai, An Yan, Viraj Prabhu, Ran Xu, Zeyuan Chen

Salesforce AI Research

arXiv Code Demo
Discord
WeChat
Timber frame structure

Abstract

We introduce DreamHouse, a benchmark for physical generative reasoning — the capacity to synthesize artifacts that concurrently satisfy geometric, structural, constructability, and code-compliance constraints. Grounded in residential timber-frame construction, a domain with fully codified engineering standards and objectively verifiable correctness, DreamHouse comprises over 26,000 structures spanning 13 architectural styles verified to LOD 350, paired with a deterministic 10-test structural validation framework. Unlike static benchmarks, it supports iterative agentic interaction: models observe intermediate build states, generate construction actions, and receive structured environmental feedback. Extensive experiments with state-of-the-art VLMs reveal substantial capability gaps largely invisible on existing leaderboards, establishing physical validity as a critical evaluation axis orthogonal to visual realism.

Model Showcase

Click a ground-truth image to see how each model responds.

Claude 4.5 Opus
Select a structure above
Gemini 3 Flash
Select a structure above
GPT-5
Select a structure above

Structural Validation Suite

Topological Connectivity
T1 · Load path
σ(v)=1 ∀v ∈ V
Every member's load path terminates at a grounded element (z < 0.1 m). AABB adjacency with ε = 0.05 m tolerance.
IRC Compliance
T2 · Span limits
L L ≤ (1+τ) · L*
Joist and rafter spans checked against IRC look-up tables. Rafter span halved when purlins are present. Tolerance τ = 0.03.
IRC Compliance
T3 · On-centre spacing
s s s ∈ {406, 610} ±50 mm
Joist groups checked for standard 16″ (406 mm) or 24″ (610 mm) on-centre spacing, within 50 mm tolerance.
IRC Compliance
T4 · Lumber dimensions
w ∈ {38, 89, 140} mm d ∈ {89…286} mm
Each framing member's two smallest cross-section dimensions matched to standard nominal lumber sizes (±10/20 mm tolerance).
Structural Physics
T5 · Deflection L/360
δ δ ≤ (1+0.08)·L/360
Mid-span deflection δ = 5wL⁴/384EI under w = 1900 N/m, E = 12 GPa. Must satisfy δ ≤ 1.08 · L/360.
Geometric Integrity
T6 · Roof coverage
ρ = |R| / |F| ≥ 0.70
Footprint partitioned into 1 m × 1 m cells. Fraction covered by rafter projections (margin 0.3 m) must be ≥ 70%.
Geometric Integrity
T7 · Gap detection
gap γ = |Q| / |F| ≤ 0.20
Gap ratio γ = uncovered / total footprint cells must be ≤ 20%. Complement of T6, retained for diagnostic granularity.
Structural Physics
T8 · Cantilever limits
Δ ≤ 3.0 m c |P_s| ≥ 2, Δ_max ≤ c_sp
Elevated sills (z > 1 m) must have ≥ 2 nearby supports within 1.5 m laterally, with max inter-support gap ≤ 3.0 m.
Topological Connectivity
T9 · Stability index
Σ = Σσ(v) / |V| ≥ 1.0
Topological Stability Index: fraction of grounded members must equal 1.0. Strictly stronger than T1 — requires all members simultaneously grounded.
Geometric Integrity
T10 · Dual-end connection
valid top ✓ bot ✓ hinge fail bot ✓ top ✗ φ_bot = 1 ∧ φ_top = 1
Each rafter and stud must connect at both ends (bottom 20% and top 20% zones). Missing top = hinge failure; missing bottom = floating column.

Evaluation Protocol

PLANNER-ATOMIC PLANNER-REACTIVE PLANNER-MANAGED Multi-view task input I₀ 5 views + building context VLM full construction script a₁ Structural validator V all 10 tests → pass/fail retry T_S — single script. Retry with full history H_t. No phase feedback. Multi-view task input I₀ 5 views + building context VLM all-phases script (k=1…K) Mid-phase V_mid check load path + stability after phase k Structural validator V all 10 tests → pass/fail full regen T_Q — all phases in one script. Mid-phase fail invalidates k+1…K. No external scaffolding. Multi-view task input I₀ 5 views + building context External phase manager unlocks one phase k at a time VLM script for phase k only V_mid — phase k check scene s_t persists on fail retry phase Structural validator V all 10 tests → pass/fail T_W — strongest scaffolding. Fail never invalidates prior phases. Each phase is an independent sub-task.

Results

Model Planner-Atomic Planner-Reactive Planner-Managed
Struct. Visual Joint Struct. Visual Joint Struct. Visual Joint
GPT-5 0.792 0.312 0.035 0.302 0.293 0.003 0.333 0.179 0.008
Claude 4.5 Opus 0.716 0.406 0.071 0.428 0.239 0.003 0.713 0.278 0.031
Gemini 3 Flash 0.454 0.376 0.031 0.507 0.345 0.019 0.785 0.313 0.043
Qwen3.5-397B-A17B 0.206 0.261 0.016
Qwen3-VL-30B-A3B 0.000 N/A 0.000 0.000 N/A 0.000 0.000 N/A 0.000
Qwen3-VL-8B 0.000 N/A 0.000 0.000 N/A 0.000 0.000 N/A 0.000

BibTeX

@article{dreamhousegpt2026,
  title   = {DreamHouse: How Far Are Vision-Language Models
             from Constructing the Real World?},
  author  = {Yang, Luyu and Dai, Yutong and Yan, An and
             Prabhu, Viraj and Xu, Ran and Chen, Zeyuan},
  journal = {arXiv preprint},
  year    = {2026},
}

Join the Community

Share results, ask questions, stay updated.

Discord
Join Server

discord.gg/f3yebJFx

WeChat
WeChat group QR code

Scan with WeChat