DreamHouse: How Far Are VLMs from Constructing the Real World?

Abstract

We introduce DreamHouse, a benchmark for physical generative reasoning — the capacity to synthesize artifacts that concurrently satisfy geometric, structural, constructability, and code-compliance constraints. Grounded in residential timber-frame construction, a domain with fully codified engineering standards and objectively verifiable correctness, DreamHouse comprises over 26,000 structures spanning 13 architectural styles verified to LOD 350, paired with a deterministic 10-test structural validation framework. Unlike static benchmarks, it supports iterative agentic interaction: models observe intermediate build states, generate construction actions, and receive structured environmental feedback. Extensive experiments with state-of-the-art VLMs reveal substantial capability gaps largely invisible on existing leaderboards, establishing physical validity as a critical evaluation axis orthogonal to visual realism.

Model Showcase

Click a ground-truth image to see how each model responds.

Claude 4.5 Opus

Select a structure above

Gemini 3 Flash

Select a structure above

GPT-5

Select a structure above

Structural Validation Suite

Topological Connectivity

T1 · Load path

Every member's load path terminates at a grounded element (z < 0.1 m). AABB adjacency with ε = 0.05 m tolerance.

IRC Compliance

T2 · Span limits

Joist and rafter spans checked against IRC look-up tables. Rafter span halved when purlins are present. Tolerance τ = 0.03.

IRC Compliance

T3 · On-centre spacing

Joist groups checked for standard 16″ (406 mm) or 24″ (610 mm) on-centre spacing, within 50 mm tolerance.

IRC Compliance

T4 · Lumber dimensions

Each framing member's two smallest cross-section dimensions matched to standard nominal lumber sizes (±10/20 mm tolerance).

Structural Physics

T5 · Deflection L/360

Mid-span deflection δ = 5wL⁴/384EI under w = 1900 N/m, E = 12 GPa. Must satisfy δ ≤ 1.08 · L/360.

Geometric Integrity

T6 · Roof coverage

Footprint partitioned into 1 m × 1 m cells. Fraction covered by rafter projections (margin 0.3 m) must be ≥ 70%.

Geometric Integrity

T7 · Gap detection

Gap ratio γ = uncovered / total footprint cells must be ≤ 20%. Complement of T6, retained for diagnostic granularity.

Structural Physics

T8 · Cantilever limits

Elevated sills (z > 1 m) must have ≥ 2 nearby supports within 1.5 m laterally, with max inter-support gap ≤ 3.0 m.

Topological Connectivity

T9 · Stability index

Topological Stability Index: fraction of grounded members must equal 1.0. Strictly stronger than T1 — requires all members simultaneously grounded.

Geometric Integrity

T10 · Dual-end connection

Each rafter and stud must connect at both ends (bottom 20% and top 20% zones). Missing top = hinge failure; missing bottom = floating column.

Evaluation Protocol

Results

Model	Planner-Atomic			Planner-Reactive			Planner-Managed
Model	Struct.	Visual	Joint	Struct.	Visual	Joint	Struct.	Visual	Joint
GPT-5	0.792	0.312	0.035	0.302	0.293	0.003	0.333	0.179	0.008
Claude 4.5 Opus	0.716	0.406	0.071	0.428	0.239	0.003	0.713	0.278	0.031
Gemini 3 Flash	0.454	0.376	0.031	0.507	0.345	0.019	0.785	0.313	0.043
Qwen3.5-397B	0.231	0.395	0.029	—	—	—	0.206	0.261	0.016
Qwen3.5-397B (no thinking)	0.084	0.382	0.007	—	—	—	0.000	—	—
LLaVA-OV-72B	0.000	—	—	—	—	—	0.000	—	—
InternVL3	0.000	—	—	—	—	—	0.000	—	—
Qwen3-VL-30B-A3B	0.000	N/A	0.000	0.000	N/A	0.000	0.000	N/A	0.000
Qwen3-VL-8B	0.000	N/A	0.000	0.000	N/A	0.000	0.000	N/A	0.000
Kimi K2.5	—	—	—	—	—	—	0.000	—	—

BibTeX

@inproceedings{yang2026dreamhouse,
  title     = {DreamHouse: How Far Are Vision-Language Models
               from Constructing the Real World?},
  author    = {Yang, Luyu and Dai, Yutong and Yan, An and
               Prabhu, Viraj and Xu, Ran and Chen, Zeyuan},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2026},
}

Join the Community

Share results, ask questions, stay updated.

Discord

Join Server

discord.gg/ZhExR6qnX

WeChat

Scan with WeChat