DreamHouse Benchmark
Evaluate vision-language agents on timber-frame construction tasks using the DreamHouse CLI. This tutorial follows the same pattern as common agent benchmarks: install, smoke test, run an agent, and inspect results.
Install the DreamHouse CLI from PyPI. This installs the dreamhouse command, but does not download the task pack, reference images, validator artifact, or Blender.
pip install dreamhouse
dreamhouse --help
DreamHouse runs the validator locally in Blender. The tested default is Blender 4.5.4 LTS, whose bundled Python is 3.11.x. Other Blender versions may work if their bundled Python matches the provided validator artifact.
For local development from a cloned repo, use:
git clone https://github.com/SalesforceAIResearch/DreamHouse.git
cd DreamHouse
pip install -e .
Download and verify the benchmark artifacts. The setup command reassembles the split task pack, checks its SHA-256 checksum, and installs the validator artifact locally.
dreamhouse setup --download-artifacts
dreamhouse doctor
If Blender is not installed, you can ask DreamHouse to install the recommended build:
dreamhouse setup --download-artifacts --install-blender
The smoke test uses a built-in stub agent. It is not a model evaluation; it only verifies that the task pack, image serving, Blender execution, geometry export, local server, and validator all work end-to-end.
dreamhouse smoke-test \
--task BN_01_0003 \
--output-dir ./runs/smoke_BN_01_0003
With the stub agent, you should expect validation to run successfully but fail several structural tests, because the stub only creates four sill plates.
Real evaluation requires a user-supplied agent. DreamHouse loads agents as Python callables using the standard module:function pattern.
dreamhouse run \
--task BN_01_0003 \
--agent my_agent:generate \
--output-dir ./runs/BN_01_0003
Your agent function receives the task prompt, local paths to the five reference images, and feedback history from previous attempts. It must return executable Blender Python.
def generate(prompt: str, images: list[str], feedback: list[dict]) -> str:
# Call your VLM or agent here.
# Return Blender Python code that creates objects in COLLECTION_NAME.
return blender_python_code
Hosted VLM APIs usually require a provider key. The repository includes an OpenAI-hosted example agent.
export OPENAI_API_KEY=<your-key>
export OPENAI_MODEL=gpt-4.1
dreamhouse run \
--task BN_01_0003 \
--agent examples.openai_agent:generate \
--output-dir ./runs/openai_BN_01_0003
Self-hosted models can be exposed through an OpenAI-compatible /v1/chat/completions endpoint, such as vLLM, LiteLLM, Ollama-compatible servers, or an internal model gateway.
export OPENAI_BASE_URL=http://127.0.0.1:8001/v1
export OPENAI_API_KEY=dummy
export OPENAI_MODEL=my-vision-model
dreamhouse run \
--task BN_01_0003 \
--agent examples.openai_compatible_agent:generate \
--output-dir ./runs/local_BN_01_0003
To test this interface without a real model, start the mock endpoint:
python examples/mock_openai_server.py --port 8001
Each run writes a reproducible artifact directory containing the task, downloaded task views, generated code, exported geometry, validation results, and retry history.
| File or folder | Meaning |
|---|---|
task.json | Task description and constraints served to the agent. |
images/ | The five reference views for the selected task. |
attempts/attempt_N/code.py | Blender Python returned by the agent. |
attempts/attempt_N/submission.json | Exported geometry sent to the validator. |
attempts/attempt_N/result.json | Full validation response for that attempt. |
results.json | Latest validation result shortcut. |
summary.json | Task id, session id, final status, and feedback history. |