DreamHouse Benchmark

How to Run DreamHouse

Evaluate vision-language agents on timber-frame construction tasks using the DreamHouse CLI. This tutorial follows the same pattern as common agent benchmarks: install, smoke test, run an agent, and inspect results.

Tested with Blender 4.5.4 LTS Validator matched to Blender Python 1,200 public eval tasks Local validation server
1 · Prerequisites

Install DreamHouse

Install the DreamHouse CLI from PyPI. This installs the dreamhouse command, but does not download the task pack, reference images, validator artifact, or Blender.

pip install dreamhouse
dreamhouse --help

DreamHouse runs the validator locally in Blender. The tested default is Blender 4.5.4 LTS, whose bundled Python is 3.11.x. Other Blender versions may work if their bundled Python matches the provided validator artifact.

For local development from a cloned repo, use:

git clone https://github.com/SalesforceAIResearch/DreamHouse.git
cd DreamHouse
pip install -e .
2 · Setup Artifacts

Set Up Artifacts

Download and verify the benchmark artifacts. The setup command reassembles the split task pack, checks its SHA-256 checksum, and installs the validator artifact locally.

dreamhouse setup --download-artifacts
Best-practice path: use the tested Blender 4.5.4 LTS runtime. Other Blender versions may work, but the validator artifact must match Blender's bundled Python minor version. Run dreamhouse doctor to check your local environment.
dreamhouse doctor

If Blender is not installed, you can ask DreamHouse to install the recommended build:

dreamhouse setup --download-artifacts --install-blender
3 · Smoke Test

Run a Smoke Test

The smoke test uses a built-in stub agent. It is not a model evaluation; it only verifies that the task pack, image serving, Blender execution, geometry export, local server, and validator all work end-to-end.

dreamhouse smoke-test \
  --task BN_01_0003 \
  --output-dir ./runs/smoke_BN_01_0003

With the stub agent, you should expect validation to run successfully but fail several structural tests, because the stub only creates four sill plates.

4 · Run with an Agent

Run with Your Agent

Real evaluation requires a user-supplied agent. DreamHouse loads agents as Python callables using the standard module:function pattern.

dreamhouse run \
  --task BN_01_0003 \
  --agent my_agent:generate \
  --output-dir ./runs/BN_01_0003

Your agent function receives the task prompt, local paths to the five reference images, and feedback history from previous attempts. It must return executable Blender Python.

def generate(prompt: str, images: list[str], feedback: list[dict]) -> str:
    # Call your VLM or agent here.
    # Return Blender Python code that creates objects in COLLECTION_NAME.
    return blender_python_code
See examples/agent_template.py in the repository for a copyable starter file.
5 · Hosted API Model

Run with a Hosted Lab Model

Hosted VLM APIs usually require a provider key. The repository includes an OpenAI-hosted example agent.

export OPENAI_API_KEY=<your-key>
export OPENAI_MODEL=gpt-4.1

dreamhouse run \
  --task BN_01_0003 \
  --agent examples.openai_agent:generate \
  --output-dir ./runs/openai_BN_01_0003
6 · Self-hosted Model

Run with an OpenAI-compatible Endpoint

Self-hosted models can be exposed through an OpenAI-compatible /v1/chat/completions endpoint, such as vLLM, LiteLLM, Ollama-compatible servers, or an internal model gateway.

export OPENAI_BASE_URL=http://127.0.0.1:8001/v1
export OPENAI_API_KEY=dummy
export OPENAI_MODEL=my-vision-model

dreamhouse run \
  --task BN_01_0003 \
  --agent examples.openai_compatible_agent:generate \
  --output-dir ./runs/local_BN_01_0003

To test this interface without a real model, start the mock endpoint:

python examples/mock_openai_server.py --port 8001
7 · Outputs

Understand the Output

Each run writes a reproducible artifact directory containing the task, downloaded task views, generated code, exported geometry, validation results, and retry history.

File or folderMeaning
task.jsonTask description and constraints served to the agent.
images/The five reference views for the selected task.
attempts/attempt_N/code.pyBlender Python returned by the agent.
attempts/attempt_N/submission.jsonExported geometry sent to the validator.
attempts/attempt_N/result.jsonFull validation response for that attempt.
results.jsonLatest validation result shortcut.
summary.jsonTask id, session id, final status, and feedback history.