Text-Guided Generation of Functional 3D Environments in Minecraft (2024)

Sam EarleNew York UniversityBroklynUSA,Filippos KokkinosMetaLondonUK,Yuhe NieNew York UniversityBrooklynUSA,Julian TogeliusNew York UniversityBrooklynUSAandRoberta RaileanuMetaLondonUK

(2024)

Abstract.

Procedural Content Generation (PCG) algorithms enable the automatic generation of complex and diverse artifacts. However, they don’t provide high-level control over the generated content and typically require domain expertise. In contrast, text-to-3D methods allow users to specify desired characteristics in natural language, offering a high amount of flexibility and expressivity. But unlike PCG, such approaches cannot guarantee functionality, which is crucial for certain applications like game design.In this paper, we present a method for generating functional 3D artifacts from free-form text prompts in the open-world game Minecraft. Our method, DreamCraft, trains quantized Neural Radiance Fields (NeRFs) to represent artifacts that, when viewed in-game, match given text descriptions.We find that DreamCraft produces more aligned in-game artifacts than a baseline that post-processes the output of an unconstrained NeRF.Thanks to the quantized representation of the environment, functional constraints can be integrated using specialized loss terms. We show how this can be leveraged to generate 3D structures that match a target distribution or obey certain adjacency rules over the block types.DreamCraft inherits a high degree of expressivity and controllability from the NeRF, while still being able to incorporate functional constraints through domain-specific objectives.

Procedural Content Generation, Neural Radiance Fields, Minecraft

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Proceedings of the 19th International Conference on the Foundations of Digital Games; May 21–24, 2024; Worcester, MA, USA^†^†booktitle: Proceedings of the 19th International Conference on the Foundations of Digital Games (FDG 2024), May 21–24, 2024, Worcester, MA, USA^†^†doi: 10.1145/3649921.3649943^†^†isbn: 979-8-4007-0955-5/24/05^†^†ccs: Applied computingMedia arts^†^†ccs: Computing methodologiesNeural networks

Text-Guided Generation of Functional 3D Environments in Minecraft (1)

Text-Guided Generation of Functional 3D Environments in Minecraft (2)

1. Introduction

Procedural Content Generation (PCG) refers to a class of algorithms that can automatically create content such as video game levels(Shaker etal., 2016a; Hendrikx etal., 2013; Risi and Togelius, 2020; Khalifa etal., 2020; Dahlskog and Togelius, 2014; Liu etal., 2021), 2D or 3D visual assets(Watkins, 2016; Liu etal., 2021; Brocchini etal., 2022; Nair, 2020; Liapis etal., 2013; Sudhakaran etal., 2021), game rules or mechanics(Summerville etal., 2018; Nelson etal., 2016; Togelius etal., 2011; Gravina etal., 2019; Zook and Riedl, 2014), or reinforcement learning (RL) environments(Justesen etal., 2018; Juliani etal., 2019; Risi and Togelius, 2020; Küttler etal., 2020; Dennis etal., 2020; Samvelyan etal., 2021; Team etal., 2021; Jiang etal., 2021; Gisslén etal., 2021; Bontrager and Togelius, 2021; Team etal., 2023; Jiang etal., 2022; Parker-Holder etal., 2022). PCG allows for compression of information(Summerville and Mateas, 2015; Togelius etal., 2011), increased replayability via endless variation(Yannakakis and Togelius, 2011; Smith and Mateas, 2011; Brewer, 2017), expression of particular aesthetics(Liapis etal., 2012; Alvarez etal., 2018; Canossa and Smith, 2015; Guzdial etal., 2017), and reduction of human labour otherwise required to manually produce artifacts(Shaker etal., 2010; Gao etal., 2022; Dieterich, 2017; Shaker etal., 2016b).These methods are procedural in the sense that they outline sets of procedures or rules for generating artifacts, such as adjacency constraints in Wave Function Collapse, local update rules in cellular automata, or heuristic search in constraint satisfaction.These procedures often leverage domain-specific knowledge in order to guarantee that generated artifacts are functional; that a game environment does not contain structures that violate physics, or that a player is able to navigate between key points within them.However, users cannot generally control such methods via free-form language, and control is limited to those metrics explicitly defined by designers.

In contrast, recent generative models have shown impressive abilities in generating diverse images, videos, or 3D scenes from text prompts describing the desired output in natural language(Rombach etal., 2022; Poole etal., 2022; Singer etal., 2022). These advances allow users to create high-quality content even if they are not domain experts. While these models can produce controllable and open-ended generations, the created content is not guaranteed to be functional. Functionality is particularly important for certain applications such as game design or the creation of RL environments.Some recent efforts leverage language models to generate level representations(Todd etal., 2023; Sudhakaran etal., 2023), but effectively reduce text-based controls to a series of scalar values.Other methods train generative text-image models to produce levels that are made of discrete tiles(Merino etal., 2023b) or controllable by actions(Bruce etal., 2024), but they do not bring any functional guarantees: houses may spawn in disjointed, and birds may turn into bumblebees after a few frames.

In this work, we pursue a hybrid approach, adapting a generative model to operate on discrete 3D assets and incorporate functional constraints into its loss function.We propose a new method for generating functional 3D environments from free-form text prompts in the open-world game Minecraft. Our method, DreamCraft, trains a quantized NeRF to produce an environment layout that, when viewed in-game, matches a given text description (see Figures1 and2 for some examples).Experimenting with various quantization schemes, we find that using soft air blocks or annealing them from continuous (soft) to discrete is crucial for learning stability, while using discrete block types leads to the most recognizable structures. We evaluate the fidelity of our approach in matching generated artifacts to descriptions of both generic and domain-specific scenes and objects. We find that DreamCraft produces in-game artifacts that align with inputs more consistently than a baseline that post-processes the output of an unconstrained NeRF.

Thanks to its quantized representation of the game world, DreamCraft can jointly optimize loss terms that enforce local functional constraints on patterns of blocks. We show how this can be instantiated to, for example, generate 3D structures that match a target distribution or obey certain adjacency rules over the block types.By inheriting a high degree of expressivity and controllability from the NeRF, while still being able to incorporate functional constraints through domain-specific objectives, DreamCraft combines the strengths of both PCG and generative AI approaches, representing a first step towards democratizing flexible yet functional content creation. Our method has potential applications in the development of AI assistants for game design, as well as in the production of diverse and controllable environments for training and evaluating RL agents.

To summarize, our paper makes the following contributions:

(1)
introduces DreamCraft, a new method for training a quantized NeRF to produce 3D structures that match a given textual description using a set of discrete Minecraft blocks,
(2)
studies different quantization schemes such as whether to use discrete or continuous block densities and types,
(3)
shows that the quantized NeRF produces more accurate Minecraft artifacts than an unconstrained NeRF, and
(4)
demonstrates how to incorporate functional constraints such as obeying certain target block distributions or adjacency rules.

Text-Guided Generation of Functional 3D Environments in Minecraft (7)

Text-Guided Generation of Functional 3D Environments in Minecraft (8)

2. Related Work

Procedural Content Generation (PCG) is becoming increasingly more popular for training and evaluating robust RL agents that can generalize across a wide range of settings(Justesen etal., 2018; Juliani etal., 2019; Küttler etal., 2020; Dennis etal., 2020; Samvelyan etal., 2021; Jiang etal., 2021; Team etal., 2021, 2023).Generative models like ours provide a way of biasing the environment generations towards human-relevant ones, thus enabling to search more efficiently through vast environment spaces. Existing works indicate that environment generation can be controlled via computable metrics(Earle etal., 2022, 2021; Khalifa etal., 2020; Jiang etal., 2022; Green etal., 2020; Sarkar and Cooper, 2020; Sarkar etal., 2020; Mott etal., 2019; Shaker etal., 2010).Novel environments can also be generated by learning on human datasets(Siper etal., 2022; Guzdial etal., 2022b, c; Liu etal., 2021; López etal., 2020; Summerville etal., 2018), sometimes with additional functional constraints or post-processing(Guzdial etal., 2022a; Zhang etal., 2020; Lee etal., 2020; Torrado etal., 2020; Karth and Smith, 2019). More recently,Todd etal. (2023); Sudhakaran etal. (2023) use large language models to generate Sokoban and Mario 2D levels. However, our work is first to show how multi-modal models can be leveraged to guide the generation of 3D game environments. One of the most popular PCG algorithms is wave function collapse(Gumin, 2016) which generates structurally consistent content from a single sample, such that its output matches tile-frequency and adjacency constraints. In this paper, we show how such constraints can be used in conjunction with text-guidance to generate environments where both high-level (e.g. via natural language) and low-level (e.g. via block target distributions or adjacency rules) aspects can be controlled.

Text-to-3D Generation Our work builds upon the many recent advances in text-to-3D generation(Poole etal., 2022; Mildenhall etal., 2020; Chen etal., 2022). For example, DVGO(Sun etal., 2022) is a supervised NeRF method that, instead of training an multi-layer perceptron (MLP), directly optimizes a voxel grid over the 3D space.Similarly, PureCLIPNeRF(Lee and Chang, 2022) uses a CLIP loss to guide a NeRF using both direct and implicit voxel grids. Our approach, DreamCraft, resembles an implicit voxel grid approach, in that it uses MLPs to parameterize activations over discrete grids.But instead of outputting continuous RGB and density values and interpolating between nearest grid vertices (to determine activation at a given point during ray tracing), our approach uses MLPs to produce predictions over block types, and considers only the single nearest grid vertex during ray sampling (to determine within which block a given point resides).

Text-to-Environment Generation(Merino etal., 2023b) train a
text-conditioned decoder model on a dataset of hand-made levels in a 2D tile domain.Compared to NeRFs, which train a model to represent a single artifact (here, a level), the “5 dollar" decoder model can represent a distribution of artifacts.It can also potentially achieve a certain amount of generalization thanks to the pre-trained LLM which is used to encode text prompts.

Minecraft Environment Generation Several prior works have sought to generate Minecraft environments using both supervised and self-supervised methods. Awiszus etal. (2021) use a 3D GAN(Goodfellow etal., 2020) architecture to generate arbitrarily sized world snippets from a single example. Meanwhile, Hao etal. (2021) envision Minecraft as a potential “sketchbook” for designing more photorealistic 3D landscapes, using an unsupervised neural rendering approach to generate the latter from preexisting Minecraft landscapes.To assist Minecraft players, Merino etal. (2023a) introduce a tool for interactive evolution using both a 3D generative model to generate the structure design and an encoding model for applying Minecraft-specific textures to the structure’s voxels. Sudhakaran etal. (2021) have shown that neural cellular automata can be used to grow complex 3D Minecraft artifacts made out of thousands of blocks such as castles, apartment blocks, and trees. Other works have leveraged search-based methods(Yates, 2021), evolutionary algorithms(Medina etal., 2023; Skjeltorp, 2022), or even reinforcement learning(Jiang etal., 2022) approaches to generate Minecraft structures. The game has also been a testbed for PCG algorithms(Salge etal., 2018, 2022), open-endedness(Grbic etal., 2021), artificial life(Sudhakaran etal., 2021), RL agents(Johnson etal., 2016; Milani etal., 2020; Guss etal., 2021; Kanervisto etal., 2022a), or foundation models for decision-making(Kanervisto etal., 2022b; Fan etal., 2022; Baker etal., 2022; Wang etal., 2022). However, our work is first to generate functional Minecraft environments directly from text prompts, enhancing the high-level controllability of the creation process.

3. DreamCraft: Text-Guided Minecraft Environment Generation

Text-Guided Generation of Functional 3D Environments in Minecraft (16)

3.1. Quantized NeRFs for Environment Generation

In this section, we introduce DreamCraft, a quantized NeRF which learns to arrange in-game assets during training (see Figure3 for an overview of our approach).Our text-guided NeRF implementation uses score distillation sampling from a pre-trained image generation model to provide a loss function for the optimization of the NeRF.We use a preliminary version of Emu(Dai etal., 2023), trained only on Shutterstock data.Note that our approach is agnostic to the text-to-3D model used, so it can be applied to any other text-to-3D model architecture and thus benefit from future advances in that field. Also note that we do not use any additional or domain-specific data to train our model apart from the block textures in Table6.

Text-Guided Generation of Functional 3D Environments in Minecraft (17)

From Continuous Points to Discrete Blocks As in PureCLIPNeRF, we use MLPs to predict continuous activations over a grid by feeding them $X,Y,Z$ coordinates which are encoded using a series of sine waves, as in the original NeRF(Mildenhall etal., 2020).We sample $X,Y,Z$ vertices along a cubic $3D$ grid of width $N$ , generating predictions over solid block types at each cell in the grid.We refer to the resulting tensor as the soft solid block grid $\mathbf{B}_{\mathrm{soft}}\in\left[0,1\right]^{M\times N\times N\times N}$ , of width $N$ , comprising of $M$ different non-air block types.We take gumbel softmax(Jang etal., 2017) over these predictions to yield a discrete grid of solid blocks $\mathbf{B}_{\mathrm{hard}}$ with values in $[0,1]$ .We feed the same $X,Y,Z$ coordinates to a separate MLP to generate a soft air block grid $\mathbf{A}_{\mathrm{soft}}\in\left[0,1\right]^{N\times N\times N}$ , and again apply gumbel softmax to generate its discrete counterpart $\mathbf{A}_{\mathrm{hard}}\in\left\{0,1\right\}^{N\times N\times N}$ . The distinction between air and solid blocks is analogous to that between the air and albedo MLPs in standard NeRFs.

During training, we may interpolate hard and soft variants to interpolate final air and block grids:

\displaystyle\mathbf{A}=\alpha\cdot\mathbf{A}_{\mathrm{hard}}+(1-\alpha)\cdot%\mathbf{A}_{\mathrm{soft}}

\displaystyle\mathbf{B}=\beta\cdot\mathbf{B}_{\mathrm{hard}}+(1-\beta)\cdot%\mathbf{B}_{\mathrm{soft}},

where $\alpha$ and $\beta$ in $[0,1]$ control the hardness of air and solid blocks respectively. In our experiments, we set these to $1$ and $0$ when air/solids are hard (discrete) and soft (continuous) respectively, and linearly scale between $0$ and $1$ to “anneal” the blocks from continuous to discrete. We then mask solid block activations with air activations to yield the final block grid $\mathbf{C}=\mathbf{A}\odot\mathbf{B}$ , with $\mathbf{C}\in[0,1]^{M\times N\times N\times N}$ , where $\odot$ denotes the element-wise product, broadcasting over the block type dimension in $\mathbf{B}$ . Thus, $\mathbf{C}$ can be seen as a 3D image with as many channels as there are solid block types. When a cell in $\mathbf{C}$ has a value of $1$ in a given block channel, and $0$ s elsewhere, it contains only this block type. When it has values in $(0,1)$ , a block is “partially” present. And when a cell in $\mathbf{C}$ is all zeros, it represents the presence of an air block. When exporting structures to the game engine or computing loss from functional constraints, we set $\alpha=\beta=1$ , producing an entirely discrete block grid $\mathbf{C}_{\mathrm{hard}}$ .

From 2D Textures to 3D Structures

The block grid $\mathbf{C}$ is combined with statically-generated voxel grids to differentiably generate artifacts that visually resemble in-game structures.First, we pre-fabricate $16\times 16\times 16$ voxel grids for each block type using in-game textures, applying the game’s $16\times 16$ RGB textures to the appropriate block faces (picking an arbitrary priority order among faces where voxels overlap along the edges of the cube).These grids are frozen and do not pass gradients during learning.During the forward pass, we project our $M\times N\times N\times N$ block grid into a $16N\times 16N\times 16N$ voxelated block grid, where blocks appear in the same arrangement as in the low-resolution block grid, but in their voxelated form.We then apply neural ray tracing to this structure to generate 2D images.

When sampling points along a ray during rendering, each point takes the color and density values of the voxel in which it falls, avoiding interpolation between cells to mimic the sharp, pixelated appearance of textures in game.To avoid cases where the inside—but not the surface—of a block’s voxel grid is sampled during ray tracing, we repeatedly stack face textures within the block, in a pyramidal pattern, to approximate solid objects (Figure4).

Given a vertex $\mathbf{x}\in\mathbb{R}^{3}$ on the solid block grid, we compute a discrete block type, first using an MLP to generate a prediction $\mathbf{b}_{\mathrm{soft}}\in\mathbb{R}^{M}$ over $M$ block types, then obtaining a onehot vector $\mathbf{b}_{\mathrm{hard}}\in\left\{e_{1},e_{2},\dots,e_{M}\right\}$ using the gumbel softmax function:

\displaystyle\begin{aligned} \mathbf{b}_{\mathrm{soft}}&=\mathrm{MLP}\left(%\mathbf{x};\theta_{B}\right)\end{aligned}\qquad\begin{aligned} \mathbf{b}_{%\mathrm{hard}}&=\mathrm{gumbel\_softmax}\left(\mathbf{b}_{\mathrm{soft}}\right%),\end{aligned}

where $\theta_{B}$ denotes the parameters of the solid block type MLP. These vectors serve as the elements of $\mathbf{B}_{\mathrm{soft}}$ and $\mathbf{B}_{\mathrm{hard}}$ respectively.

Again given the block grid vertex $\mathbf{x}$ , we compute soft and hard air values, where high values correspond to solid blocks and low values approaching $0$ correspond to air.

The soft air grid is computed in the same way as density in previous works, passing the output of an MLP (parameterized by $\theta_{A}$ ) through an exponential activation function¹¹1During the backward pass, the exponential function is clipped to avoid exploding gradients: $\sigma_{\mathrm{soft}}=\exp\left(\mathrm{clamp}(y,15)\right)$ .:

\displaystyle\sigma_{\mathrm{soft}}

\displaystyle=\exp\left(\mathrm{MLP}(\mathbf{x};\theta_{A})\right),

where $\theta_{A}$ denotes the parameters of the air block MLP. To quantize the air grid (effectively computing the presence of air blocks), we first use the soft air values to derive the respective probability of an air/solid block appearing at $\mathbf{x}$ :

\displaystyle\begin{aligned} \sigma_{\mathrm{soft}}^{\prime}&=\sigma_{\mathrm{%soft}}-10\end{aligned}\qquad\begin{aligned} p_{\mathrm{air}}&=-\sigma_{\mathrm%{soft}}^{\prime}\end{aligned}\qquad\begin{aligned} p_{\mathrm{solid}}&=\sigma_%{\mathrm{soft}}^{\prime}.\end{aligned}

We then use the gumbel softmax function to discretize these predictions, obtaining a density value of $0$ in case of air, and $1$ in case of a solid block:

\displaystyle\sigma_{\mathrm{hard}}

\displaystyle=\sum^{2}\mathrm{gumbel\_softmax}\left(\left[p_{\mathrm{air}},p_{%\mathrm{solid}}\right]\right).

These values serve as the elements of $\mathbf{A}_{\mathrm{soft}}$ and $\mathbf{A}_{\mathrm{hard}}$ , respectively.

We use only opaque solid Minecraft blocks (excluding transparent blocks like glass or ice, and porous ones like grass or flowers).We use a separate, shallow background MLP, which takes as input a viewing angle, to model color at the end of each ray, allowing the model to learn a low-resolution background texture (effectively projected onto the inside of a sphere).

3.2. Functional Constraints

Distributional ConstraintsThe discrete block grid resulting from the quantized NeRF allows us to optimize DreamCraft to produce a level satisfying a target distribution of block types, producing text-guided objects comprising of particular block mixtures. The user sets a target proportion for each block type, and we apply a loss equal to the difference between this target and the actual proportion of (non-air) grid cells in the NeRF’s output (after quantization), which we compute by taking the sum over the relevant channel of the discrete onehot block grid and dividing by the size of the grid. Formally, we define the distributional loss as

\displaystyle L_{D}

\displaystyle=\sum_{t\in T}\left|G(t)-P(t)\right|,

where $t$ is a block type among the set of blocks $T$ , $G(t)$ is the target number of occurrences of this block as specified by the user, and $P(t)$ is the number of actual occurrence in the quantized block grid $\mathbf{C}_{\mathrm{hard}}$ .

Text-Guided Generation of Functional 3D Environments in Minecraft (18)

Text-Guided Generation of Functional 3D Environments in Minecraft (19)

Adjacency ConstraintsWe also introduce a loss term corresponding to a penalty or reward incurred whenever a particular configuration of blocks appear in the generated structure. To this end, we construct a convolutional layer that outputs $1$ over any matching patch of blocks, and $0$ everywhere else, then sum the result, yielding the number of occurences of the relevant pattern in $\mathbf{C}_{\mathrm{hard}}$ . We multiply this sum by a user-specified loss coefficient (negative when the pattern is desired, positive when prohibited). Suppose the user wants to apply a loss/reward of $l_{p}$ to the pattern of blocks $b_{0},b_{1},\dots,b_{j_{p}}$ occupying a patch of size $K^{3}$ (with the number of blocks of interest $j_{p}\leq K^{3}$ ), where each block $b_{i}$ has relative coordinates $x_{i},y_{i},z_{i}$ in the patch. We construct a 3D convolutional weight matrix $W_{p}$ consisting of the onehot vectors $e_{i}$ corresponding to each block type, and placing them at position $x_{i},y_{i},z_{i}$ in the weight matrix. We apply the resulting convolutional layer, $\textrm{conv}_{W_{p}}$ to the quantized block grid, subtract $j_{p}-1$ from the output, and apply a ReLU activation to obtain the binary pattern-activation tensor. Formally,

\displaystyle L_{P}

\displaystyle=\sum_{p\in P}w_{p}\sum_{i}^{K^{3}}\left(\mathrm{ReLU}\left(%\textrm{conv}_{W_{p}}\left(\mathbf{C}_{\mathrm{hard}}\right)-j_{p}+1\right)%\right)_{i},

where an inner sum is taken over elements $i$ in the binary activation tensor for pattern $p$ .

	COCO		Planet Minecraft
Model	CLIP ViT-B/16	CLIP ViT-B/32	CLIP ViT B/16	CLiP ViT B/32
Unconstrained NeRF	61.44	66.67	25.17	31.29
DreamCraft	19.74	21.05	11.56	17.01
ratio	0.32	0.32	0.46	0.54

Model	CLIP ViT-B/16	CLIP ViT-B/32
Unconstrained NeRF	2.72	2.72
DreamCraft	5.44	6.12

4. Experiments

Baselines We compare the performance of DreamCraft, our quantized NeRF which learns to arrange in-game assets during training, to Unconstrained NeRF, a baseline that maps the continuous outputs to game assets after training. Unconstrained NeRF trains a text-guided NeRF and then maps its output to Minecraft blocks using nearest neighbors. For text-guidance, we use a variant of Emu(Dai etal., 2023) trained only on Shutterstock, a model using text-conditioned latent diffusion to generate images. The nearest RGB value to each Minecraft block is determined by taking the average color over each of the (normally repeated) $16\times 16$ textures covering each of its 6 faces. We define a width- $N$ grid over the 3D output space of the unconstrained model and query the RGB and density values at the center of each cell in the grid. We calculate the $L_{2}$ distance between each centerpoint and average Minecraft block color, mapping each cell to the closest block. We then select a density threshold $s=10$ , and place air blocks wherever $\sigma<s$ .

Ablations We study different quantization schemes in order to understand what is the best way to map the continuous outputs of the air and solid block MLPs into discrete grids of Minecraft blocks. The output of either MLP can be passed through the gumbel softmax function to produce a discrete grid of air or solid blocks (see Figure3). If these values are not discretized i.e., $\alpha<1$ or $\beta<1$ , meaning that $\mathbf{b}_{\mathrm{soft}}$ or $\sigma_{\mathrm{soft}}$ are used instead of their “hard” counterparts, then the resulting voxel grid can include solid blocks interpolated with air blocks (i.e., semi-transparent) or with one another (i.e., multi-texture). We also experiment with linearly annealing these values from their soft to hard counterparts over the course of training.

Evaluation Datasets We evaluate our method on both generic and domain-specific text prompts, using the COCO(Lin etal., 2014) and Planet Minecraft(LLC, [n. d.]) datasets, respectively. For COCO, we use the same 153 prompts as in prior text-to-3D works(MohammadKhalid etal., 2022; Poole etal., 2022; Lee and Chang, 2022). For Planet Minecraft, we take the names of the top $150$ most downloaded assets uploaded by users in $2016$ under the “Maps" category. Some examples include “a desperate and lonely wizards tower pmc chunk challenge entry lore”, “mario kartgba bowsers castle 2”, and “icarly set and nickelodeon studio”. A full list of prompts can be found in SectionD.

Evaluation Metrics To quantitatively evaluate the performance of our model, we measure its fidelity using R-precision. More specifically, we query other pre-trained joint text-image encoders, namely CLIP ViT-B/16 and CLIP ViT-B/32(Radford etal., 2021), and test whether they can recognize the caption responsible for a given NeRF rendering from a set of distractors (other, randomly selected captions from the dataset). For each caption, we repeat the process for 5 different test images and average the result.

		CLIP ViT-B/16		CLIP ViT-B/32
		RGB	depth	RGB	depth
block type	block density
anneal	anneal	7.73	5.07	9.07	5.73
	hard	2.67	1.20	5.60	2.00
	soft	7.33	7.33	10.40	5.20
hard	anneal	8.53	6.00	8.40	7.33
	hard	2.27	1.33	2.80	1.87
	soft	10.40	8.93	13.20	8.40
soft	anneal	8.40	3.60	10.00	4.80
	hard	4.00	0.80	7.60	2.80
	soft	7.87	6.53	11.07	5.47

		CLIP ViT-B/16		CLIP ViT-B/32
		RGB	depth	RGB	depth
block type	block density
anneal	anneal	12.68	4.31	11.37	3.01
	hard	5.36	0.65	6.80	0.92
	soft	22.88	8.89	22.88	8.24
hard	anneal	12.68	5.88	14.12	4.71
	hard	4.05	2.22	4.97	1.18
	soft	19.61	12.03	21.83	11.11
soft	anneal	17.65	4.84	15.69	4.44
	hard	9.80	1.05	7.32	0.92
	soft	24.31	8.76	24.97	7.32

4.1. Quality of the Generations

In Figures1 and 8, we can see that, using only Minecraft blocks, our model produces structures that are visually similar to those of the Unconstrained NeRF, for both generic and domain-specific text prompts.Note that the Unconstrained NeRF model is a strong upper bound because it has a continuous and thus much larger output space (i.e., higher resolution) than our discretized DreamCraft model.DreamCraft’s relative performance increases when moving to a set ofdomain-relevant prompts.

In Table1, we compare the fidelity of DreamCraft with that of the Unconstrained NeRF on COCO and Planet Minecraft. Note that here, the generations are evaluated using the renders from the neural ray tracing engine rather than in-game renders. As expected, limiting the NeRF’s output to a specific set of discretely assembled blocks drastically reduces its space of generations. This is reflected in DreamCraft’s lower fidelity with respect to the Unconstrained NeRF, when generating objects from both the COCO dataset and Planet Minecraft datasets. However, the performance gap between DreamCraft and the Unconstrained NeRF is reduced when moving from the generic COCO dataset to the domain-specific Planet Minecraft dataset. This suggests that despite its restricted output space, DreamCraft is particularly effective at generating high-quality structures when the input prompts are relevant to the (discrete)domain at hand.

In Table2, we evaluate the R-precision using 2D captures of generated Minecraft block layouts in the game engine itself. Note that here, the generations are evaluated using the in-game renders rather than the neural renders. For game design applications, in-game render fidelity is more relevant than neural render fidelity, so this is ultimately the metric we care about in our study. In this case, the fidelity of the Unconstrained NeRF is lower than that of DreamCraft for both COCO and Planet Minecraft. This indicates that post-processing the output of a NeRF in order to discretize it using nearest-neighbor leads to worse results than learning to use discrete blocks during the generation process.In Figure9, we see that mapping a discrete set of grid vertices to nearest neighbor block types via average color leads to sub-optimal results that are particularly bad at maintaining consistency in terms of texture and color.This result demonstrates the difficulty of translating unconstrained generations to a constrained repertoire of domain-specific assets.We conclude that by incorporating these game assets in the learning process, we can generate more faithful in-game structures.

4.2. Quantization Schemes

In Table3, we compare the effect of applying soft, hard, and annealed quantization schemes to solid and air blocks. Maintaining soft air blocks (continuous-valued block transparency), in combination with hard solid blocks (discrete block types), leads to the highest R-precision at test time.This suggests that learning the topology (in contrast to the color/texture) of a generated structure is a sensitive process in which relaxing the quantization scheme (and presumably simplifying the loss landscape) is crucial.

Text-Guided Generation of Functional 3D Environments in Minecraft (21)

Text-Guided Generation of Functional 3D Environments in Minecraft (22)

Forcing block density to be discrete throughout training leads to poor performance (as can be seen in the first row of Figure6), with the poorest performance coming from models in which both block type and density are fully discrete.This may be because of noisier learning dynamics resulting from the quantized output space of the model.

Conversely, using soft block density can lead to situations in which apparently solid surfaces are emulated by layering a number of semi-transperent blocks. At test time, when rendering the fully discrete block grid, such surfaces can suddenly be culled from the image, as none of the individual blocks of which they consist are enough to result in a “solid” binary output after quantization.

5. Block Grid Resolution

	CLIP ViT-B/16		CLIP ViT-B/32
	RGB	depth	RGB	depth
block grid
20	5.47	2.40	6.00	2.93
40	6.53	6.13	8.67	6.93
60	7.33	6.80	9.33	4.40
80	11.73	8.27	12.27	6.67
100	10.80	9.07	13.33	8.67

Text-Guided Generation of Functional 3D Environments in Minecraft (30)

Text-Guided Generation of Functional 3D Environments in Minecraft (31)

We experiment with the resolution of the block grid, learning a $N\times N\times N$ -block representation of text prompts with $N\in\{10,20,\dots,100\}$ .

In Table5, we see that increasing block grid resolution leads to an increase in R-precision. We note that the more blocks in the grid, the closer each block comes to being represented by only a single pixel in each 2D render of the 3D block layout. In other words, we can expect the output of these higher-resolution quantized NeRFs to approximate that of their unconstrained counterparts with increasing accuracy. By analogy with visual art, we can say that the model uses blocks in an increasingly pointillistic fashion.

5.1. Functional Constraints

In Figure6(a), we illustrate the effect of adding distributional constraints to a prompt asking for “a stylish hat”. We can see that the model produces a similar structure using either entirely gold, redstone, or an even split of both when adding the distributional loss term.

In Figure6(b), we illustrate the effect of adding an adjacency constraint prohibiting sand from “floating”, i.e. being placed directly above an air tile, and ask for “space needle accurate” (a Planet Minecraft prompt). As the weight of the adjacency loss term is increased, the use of sand blocks becomes restricted to the central “bulb” of the tower, where it is more likely to be supported by a the circular base of dirt blocks. When the adjacency loss is weighted less heavily, sand is often incorporated into the underside of the bulb (where it is unsupported and will fall to the ground in-game).

These experiments demonstrate that functional constraints can be easily integrated with our text-to-3D model to create controllable Minecraft structures that obey both high-level and low-level user specifications and can be rendered in-game.

6. Conclusion

In this work, we develop a new approach for generating functional game environments in Minecraft from free-form text descriptions. DreamCraft quantizes the output of a text-to-3D NeRF to predict discrete block types which are then mapped to game assets (i.e., voxel-grids corresponding to in-game blocks). This allows the NeRF to use the game assets to represent the content described by a text prompt. We demonstrate that our approach has higher fidelity to the text prompt than a baseline that discretizes the output of an Unconstrained NeRF after learning. DreamCraft is, to our knowledge, the first generator capable of generating diverse, functional, and controllable 3D game environments directly from free-form text. Since our model can adapt to the unique appearance of user-supplied modular game assets to produce environments with high-level aesthetic properties, it may be particularly useful for game designers working in new domains that don’t yet have have large datasets of game layouts.

One limitation of DreamCraft is that it takes a few hours to generate a single structure. However, it could benefit from recent and future speed improvements in NeRFs(Wang etal., 2023; Guo etal., 2022; Sun etal., 2022; Zhang etal., 2020). Another promising direction for future work is to model lightning and shadow, in addition to color and density, which could be achieved using an auxiliary loss to model the in-game lightning effects.

While we focus on MineCraft, DreamCraft could be extended beyond cube-based environments, to any 3D environment involving discrete assets that can be approximated by voxel grids. It could also incorporate more complex functional constraints, assuming these could be implemented to be differentiable. For example, one could compute the path length between key blocks (e.g. a player spawn block or a treasure chest) using convolutions, and use the difference between the target achieved path lengths as part of the loss, similar to the reward used in Khalifa etal. (2020); Earle etal. (2021). It may also be beneficial—especially where functioinal constraints cannot be made differentiable—to use RL methods to approximate the gradient from such additional functional scores. Vis-a-vis embodied player agents, environments could be generated to result in certain rewards, dynamics, regret or learnability with respect to these agents.

References

(1)
Alvarez etal. (2018)Alberto Alvarez, SteveDahlskog, Jose Font, Johan Holmberg,and Simon Johansson. 2018.Assessing aesthetic criteria in the evolutionarydungeon designer. In Proceedings of the 13thInternational Conference on the Foundations of Digital Games.1–4.
Awiszus etal. (2021)Maren Awiszus, FrederikSchubert, and Bodo Rosenhahn.2021.World-gan: a generative model for minecraftworlds. In 2021 IEEE Conference on Games (CoG).IEEE, 1–8.
Baker etal. (2022)Bowen Baker, Ilge Akkaya,Peter Zhokov, Joost Huizinga,Jie Tang, Adrien Ecoffet,Brandon Houghton, Raul Sampedro, andJeff Clune. 2022.Video pretraining (vpt): Learning to act bywatching unlabeled online videos.Advances in Neural Information ProcessingSystems 35 (2022),24639–24654.
Bontrager and Togelius (2021)Philip Bontrager andJulian Togelius. 2021.Learning to generate levels from nothing. In2021 IEEE Conference on Games (CoG). IEEE,1–8.
Brewer (2017)Nathan Brewer.2017.Computerized Dungeons and Randomly GeneratedWorlds: From Rogue to Minecraft [Scanning Our Past].Proc. IEEE 105,5 (2017), 970–977.
Brocchini etal. (2022)Michele Brocchini, MarcoMameli, Emanuele Balloni, LauraDellaSciucca, Luca Rossi, Marina Paolanti,Emanuele Frontoni, and PrimoZingaretti. 2022.MONstEr: A Deep Learning-Based System for theAutomatic Generation of Gaming Assets. In ImageAnalysis and Processing. ICIAP 2022 Workshops: ICIAP International Workshops,Lecce, Italy, May 23–27, 2022, Revised Selected Papers, Part I. Springer,280–290.
Bruce etal. (2024)Jake Bruce, MichaelDennis, Ashley Edwards, JackParker-Holder, Yuge Shi, Edward Hughes,Matthew Lai, Aditi Mavalankar,Richie Steigerwald, Chris Apps,etal. 2024.Genie: Generative Interactive Environments.arXiv preprint arXiv:2402.15391(2024).
Canossa and Smith (2015)Alessandro Canossa andGillian Smith. 2015.Towards a procedural evaluation technique: Metricsfor level design. In The 10th InternationalConference on the Foundations of Digital Games. sn, 8.
Chen etal. (2022)Anpei Chen, Zexiang Xu,Andreas Geiger, Jingyi Yu, andHao Su. 2022.Tensorf: Tensorial radiance fields. InComputer Vision–ECCV 2022: 17th EuropeanConference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, PartXXXII. Springer, 333–350.
Dahlskog and Togelius (2014)Steve Dahlskog andJulian Togelius. 2014.Procedural content generation using patterns asobjectives. In Applications of EvolutionaryComputation: 17th European Conference, EvoApplications 2014, Granada, Spain,April 23-25, 2014, Revised Selected Papers 17. Springer,325–336.
Dai etal. (2023)Xiaoliang Dai, Ji Hou,Chih-Yao Ma, Sam Tsai,Jialiang Wang, Rui Wang,Peizhao Zhang, Simon Vandenhende,Xiaofang Wang, Abhimanyu Dubey,etal. 2023.Emu: Enhancing image generation models usingphotogenic needles in a haystack.arXiv preprint arXiv:2309.15807(2023).
Dennis etal. (2020)Michael Dennis, NatashaJaques, Eugene Vinitsky, AlexandreBayen, Stuart Russell, Andrew Critch,and Sergey Levine. 2020.Emergent complexity and zero-shot transfer viaunsupervised environment design.Advances in neural information processingsystems 33 (2020),13049–13061.
Dieterich (2017)RobertOta Dieterich.2017.Using Proof-Of-Concept Feedback to Explore theRelationship Between Artists and Procedural Content Generation in ComputerGame Development Tools.Ph. D. Dissertation.
Earle etal. (2021)Sam Earle, Maria Edwards,Ahmed Khalifa, Philip Bontrager, andJulian Togelius. 2021.Learning controllable content generators. In2021 IEEE Conference on Games (CoG). IEEE,1–9.
Earle etal. (2022)Sam Earle, Justin Snider,MatthewC Fontaine, Stefanos Nikolaidis,and Julian Togelius. 2022.Illuminating diverse neural cellular automata forlevel generation. In Proceedings of the Geneticand Evolutionary Computation Conference. 68–76.
Fan etal. (2022)Linxi Fan, Guanzhi Wang,Yunfan Jiang, Ajay Mandlekar,Yuncong Yang, Haoyi Zhu,Andrew Tang, De-An Huang,Yuke Zhu, and Anima Anandkumar.2022.Minedojo: Building open-ended embodied agents withinternet-scale knowledge.arXiv preprint arXiv:2206.08853(2022).
Gao etal. (2022)Tianhan Gao, Jin Zhang,and Qingwei Mi. 2022.Procedural Generation of Game Levels and Maps: AReview. In 2022 International Conference onArtificial Intelligence in Information and Communication (ICAIIC). IEEE,050–055.
Gisslén etal. (2021)Linus Gisslén, AndyEakins, Camilo Gordillo, JoakimBergdahl, and Konrad Tollmar.2021.Adversarial reinforcement learning for proceduralcontent generation. In 2021 IEEE Conference onGames (CoG). IEEE, 1–8.
Goodfellow etal. (2020)Ian Goodfellow, JeanPouget-Abadie, Mehdi Mirza, Bing Xu,David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio.2020.Generative adversarial networks.Commun. ACM 63,11 (2020), 139–144.
Gravina etal. (2019)Daniele Gravina, AhmedKhalifa, Antonios Liapis, JulianTogelius, and GeorgiosN Yannakakis.2019.Procedural content generation through qualitydiversity. In 2019 IEEE Conference on Games(CoG). IEEE, 1–8.
Grbic etal. (2021)Djordje Grbic, RasmusBergPalm, Elias Najarro, Claire Glanois,and Sebastian Risi. 2021.Evocraft: A new challenge for open-endedness. InApplications of Evolutionary Computation: 24thInternational Conference, EvoApplications 2021, Held as Part of EvoStar 2021,Virtual Event, April 7–9, 2021, Proceedings 24. Springer,325–340.
Green etal. (2020)MichaelCerny Green,Luvneesh Mugrai, Ahmed Khalifa, andJulian Togelius. 2020.Mario level generation from mechanics using scenestitching. In 2020 IEEE Conference on Games(CoG). IEEE, 49–56.
Gumin (2016)Maxim Gumin.2016.Wave Function Collapse Algorithm.https://github.com/mxgmn/WaveFunctionCollapse
Guo etal. (2022)Xiang Guo, Guanying Chen,Yuchao Dai, Xiaoqing Ye,Jiadai Sun, Xiao Tan, andErrui Ding. 2022.Neural Deformable Voxel Grid for Fast Optimizationof Dynamic View Synthesis. In Proceedings of theAsian Conference on Computer Vision. 3757–3775.
Guss etal. (2021)WilliamH Guss,MarioYnocente Castro, Sam Devlin,Brandon Houghton, NoboruSean Kuno,Crissman Loomis, Stephanie Milani,Sharada Mohanty, Keisuke Nakata,Ruslan Salakhutdinov, etal.2021.The minerl 2020 competition on sample efficientreinforcement learning using human priors.arXiv preprint arXiv:2101.11071(2021).
Guzdial etal. (2017)Matthew Guzdial, DuriLong, Christopher Cassion, and AbhishekDas. 2017.Visual procedural content generation with anartificial abstract artist. In Proceedings of ICCCcomputational creativity and games workshop.
Guzdial etal. (2022a)Matthew Guzdial, SamSnodgrass, and AdamJ Summerville.2022a.Constraint-Based PCGML Approaches.In Procedural Content Generation viaMachine Learning: An Overview. Springer,51–66.
Guzdial etal. (2022b)Matthew Guzdial, SamSnodgrass, and AdamJ Summerville.2022b.PCGML Process Overview.In Procedural Content Generation viaMachine Learning: An Overview. Springer,35–49.
Guzdial etal. (2022c)Matthew Guzdial, SamSnodgrass, and AdamJ Summerville.2022c.Procedural Content Generation Via MachineLearning: An Overview.Springer.
Hao etal. (2021)Zekun Hao, Arun Mallya,Serge Belongie, and Ming-Yu Liu.2021.Gancraft: Unsupervised 3d neural rendering ofminecraft worlds. In Proceedings of the IEEE/CVFInternational Conference on Computer Vision. 14072–14082.
Hendrikx etal. (2013)Mark Hendrikx, SebastiaanMeijer, Joeri Van DerVelden, andAlexandru Iosup. 2013.Procedural content generation for games: A survey.ACM Transactions on Multimedia Computing,Communications, and Applications (TOMM) 9,1 (2013), 1–22.
Jang etal. (2017)Eric Jang, Shixiang Gu,and Ben Poole. 2017.Categorical Reparametrization with Gumble-Softmax.In International Conference on LearningRepresentations (ICLR 2017). OpenReview. net.
Jiang etal. (2021)Minqi Jiang, EdwardGrefenstette, and Tim Rocktäschel.2021.Prioritized level replay. InInternational Conference on Machine Learning.PMLR, 4940–4950.
Jiang etal. (2022)Zehua Jiang, Sam Earle,Michael Green, and Julian Togelius.2022.Learning Controllable 3D Level Generators. InProceedings of the 17th International Conference onthe Foundations of Digital Games. 1–9.
Johnson etal. (2016)Matthew Johnson, KatjaHofmann, Tim Hutton, and DavidBignell. 2016.The Malmo Platform for Artificial IntelligenceExperimentation.. In Ijcai.4246–4247.
Juliani etal. (2019)Arthur Juliani, AhmedKhalifa, Vincent-Pierre Berges, JonathanHarper, Ervin Teng, Hunter Henry,Adam Crespi, Julian Togelius, andDanny Lange. 2019.Obstacle tower: A generalization challenge invision, control, and planning.arXiv preprint arXiv:1902.01378(2019).
Justesen etal. (2018)Niels Justesen,RubenRodriguez Torrado, PhilipBontrager, Ahmed Khalifa, JulianTogelius, and Sebastian Risi.2018.Illuminating generalization in deep reinforcementlearning through procedural level generation. InNeurIPS Workshop on Deep Reinforcement Learning.
Kanervisto etal. (2022a)Anssi Kanervisto,Stephanie Milani, Karolis Ramanauskas,Nicholay Topin, Zichuan Lin,Junyou Li, Jianing Shi,Deheng Ye, Qiang Fu, WeiYang, etal. 2022a.Minerl diamond 2021 competition: Overview, results,and lessons learned.NeurIPS 2021 Competitions and DemonstrationsTrack (2022), 13–28.
Kanervisto etal. (2022b)Anssi Kanervisto,Stephanie Milani, Karolis Ramanauskas,Nicholay Topin, Zichuan Lin,Junyou Li, Jianing Shi,Deheng Ye, Qiang Fu, WeiYang, Weijun Hong, Zhongyue Huang,Haicheng Chen, Guangjun Zeng,Yue Lin, Vincent Micheli,Eloi Alonso, François Fleuret,Alexander Nikulin, Yury Belousov,Oleg Svidchenko, and Aleksei Shpilman.2022b.MineRL Diamond 2021 Competition: Overview, Results,and Lessons Learned. In Proceedings of the NeurIPS2021 Competitions and Demonstrations Track(Proceedings of Machine Learning Research,Vol.176), DouweKiela, Marco Ciccone, and BarbaraCaputo (Eds.). PMLR, 13–28.https://proceedings.mlr.press/v176/kanervisto22a.html
Karth and Smith (2019)Isaac Karth and AdamMSmith. 2019.Addressing the fundamental tension of PCGML withdiscriminative learning. In Proceedings of the14th International Conference on the Foundations of Digital Games.1–9.
Khalifa etal. (2020)Ahmed Khalifa, PhilipBontrager, Sam Earle, and JulianTogelius. 2020.Pcgrl: Procedural content generation viareinforcement learning. In Proceedings of the AAAIConference on Artificial Intelligence and Interactive DigitalEntertainment, Vol.16. 95–101.
Küttler etal. (2020)Heinrich Küttler,Nantas Nardelli, Alexander Miller,Roberta Raileanu, Marco Selvatici,Edward Grefenstette, and TimRocktäschel. 2020.The nethack learning environment.Advances in Neural Information ProcessingSystems 33 (2020),7671–7684.
Lee and Chang (2022)Han-Hung Lee and AngelXChang. 2022.Understanding pure clip guidance for voxel gridnerf models.arXiv preprint arXiv:2209.15172(2022).
Lee etal. (2020)Vivian Lee, NathanPartlan, and Seth Cooper.2020.Precomputing Player Movement in Platformers forLevel Generation with Reachability Constraints.. InAIIDE Workshops.
Liapis etal. (2013)Antonios Liapis, GeorgiosYannakakis, and Julian Togelius.2013.Designer modeling for personalized game contentcreation tools. In Proceedings of the AAAIConference on Artificial Intelligence and Interactive DigitalEntertainment, Vol.9. 11–16.
Liapis etal. (2012)Antonios Liapis,GeorgiosN Yannakakis, and JulianTogelius. 2012.Adapting models of visual aesthetics forpersonalized content creation.IEEE Transactions on ComputationalIntelligence and AI in Games 4, 3(2012), 213–228.
Lin etal. (2014)Tsung-Yi Lin, MichaelMaire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan,Piotr Dollár, and CLawrenceZitnick. 2014.Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer,740–755.
Liu etal. (2021)Jialin Liu, SamSnodgrass, Ahmed Khalifa, SebastianRisi, GeorgiosN Yannakakis, and JulianTogelius. 2021.Deep learning for procedural content generation.Neural Computing and Applications33, 1 (2021),19–37.
LLC ([n. d.])Cyprezz LLC.[n. d.].Planet Minecraft Community: Creative fansite foreverything minecraft!https://www.planetminecraft.com/.
López etal. (2020)ChristianE López,James Cunningham, Omar Ashour, andConradS Tucker. 2020.Deep reinforcement learning for procedural contentgeneration of 3d virtual environments.Journal of Computing and Information Sciencein Engineering 20, 5(2020).
Medina etal. (2023)Alejandro Medina, MelanieRichey, Mark Mueller, and JacobSchrum. 2023.Evolving Flying Machines in Minecraft Using QualityDiversity.arXiv preprint arXiv:2302.00782(2023).
Merino etal. (2023a)Timothy Merino, MCharity, and Julian Togelius.2023a.Interactive Latent Variable Evolution for theGeneration of Minecraft Structures. In Proceedingsof the 18th International Conference on the Foundations of Digital Games.1–8.
Merino etal. (2023b)Timothy Merino, RomanNegri, Dipika Rajesh, M Charity, andJulian Togelius. 2023b.The Five-Dollar Model: Generating Game Maps andSprites from Sentence Embeddings. In Proceedingsof the AAAI Conference on Artificial Intelligence and Interactive DigitalEntertainment, Vol.19. 107–115.
Milani etal. (2020)Stephanie Milani, NicholayTopin, Brandon Houghton, WilliamH Guss,SharadaP Mohanty, Keisuke Nakata,Oriol Vinyals, and NoboruSean Kuno.2020.Retrospective analysis of the 2019 MineRLcompetition on sample efficient reinforcement learning. InNeurIPS 2019 Competition and Demonstration Track.PMLR, 203–214.
Mildenhall etal. (2020)B Mildenhall, PPSrinivasan, M Tancik, JT Barron,R Ramamoorthi, and R Ng.2020.Nerf: Representing scenes as neural radiance fieldsfor view synthesis. In European conference oncomputer vision.
MohammadKhalid etal. (2022)Nasir MohammadKhalid,Tianhao Xie, Eugene Belilovsky, andTiberiu Popa. 2022.CLIP-Mesh: Generating textured meshes from textusing pretrained image-text models. In SIGGRAPHAsia 2022 Conference Papers. 1–8.
Mott etal. (2019)Justin Mott, SaujasNandi, and Luke Zeller.2019.Controllable and coherent level generation: Atwo-pronged approach. In Experimental AI in gamesworkshop.
Nair (2020)Rohit Nair.2020.Using Raymarched shaders as environments in3D video games.Drexel University.
Nelson etal. (2016)MarkJ Nelson, JulianTogelius, Cameron Browne, and MichaelCook. 2016.Rules and mechanics.Procedural Content Generation in Games(2016), 99–121.
Parker-Holder etal. (2022)Jack Parker-Holder, MinqiJiang, Michael Dennis, MikayelSamvelyan, Jakob Foerster, EdwardGrefenstette, and Tim Rocktäschel.2022.Evolving curricula with regret-based environmentdesign. In International Conference on MachineLearning. PMLR, 17473–17498.
Poole etal. (2022)Ben Poole, Ajay Jain,JonathanT Barron, and Ben Mildenhall.2022.Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988(2022).
Radford etal. (2021)Alec Radford, JongWookKim, Chris Hallacy, Aditya Ramesh,Gabriel Goh, Sandhini Agarwal,Girish Sastry, Amanda Askell,Pamela Mishkin, Jack Clark,etal. 2021.Learning transferable visual models from naturallanguage supervision. In International conferenceon machine learning. PMLR, 8748–8763.
Risi and Togelius (2020)Sebastian Risi andJulian Togelius. 2020.Increasing generality in machine learning throughprocedural content generation.Nature Machine Intelligence2, 8 (2020),428–436.
Rombach etal. (2022)Robin Rombach, AndreasBlattmann, Dominik Lorenz, PatrickEsser, and Björn Ommer.2022.High-Resolution Image Synthesis With LatentDiffusion Models. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR).10684–10695.
Salge etal. (2022)Christoph Salge, ClausAranha, Adrian Brightmoore, Sean Butler,Rodrigo DeMouraCanaan, Michael Cook,Michael Green, Hagen Fischer,Christian Guckelsberger, Jupiter Hadley,etal. 2022.Impressions of the GDMC AI Settlement GenerationChallenge in Minecraft. In Proceedings of the 17thInternational Conference on the Foundations of Digital Games.1–16.
Salge etal. (2018)Christoph Salge,MichaelCerny Green, Rodgrigo Canaan,and Julian Togelius. 2018.Generative design in minecraft (gdmc) settlementgeneration competition. In Proceedings of the 13thInternational Conference on the Foundations of Digital Games.1–10.
Samvelyan etal. (2021)Mikayel Samvelyan, RobertKirk, Vitaly Kurin, Jack Parker-Holder,Minqi Jiang, Eric Hambro,Fabio Petroni, Heinrich Küttler,Edward Grefenstette, and TimRocktäschel. 2021.Minihack the planet: A sandbox for open-endedreinforcement learning research.arXiv preprint arXiv:2109.13202(2021).
Sarkar and Cooper (2020)Anurag Sarkar and SethCooper. 2020.Sequential segment-based level generation andblending using variational autoencoders. InProceedings of the 15th International Conference onthe Foundations of Digital Games. 1–9.
Sarkar etal. (2020)Anurag Sarkar, ZhihanYang, and Seth Cooper. 2020.Conditional level generation and game blending.arXiv preprint arXiv:2010.07735(2020).
Shaker etal. (2016a)Noor Shaker, JulianTogelius, and MarkJ Nelson.2016a.Procedural content generation in games.(2016).
Shaker etal. (2016b)Noor Shaker, JulianTogelius, MarkJ Nelson, AntoniosLiapis, Gillian Smith, and NoorShaker. 2016b.Mixed-initiative content creation.Procedural content generation in games(2016), 195–214.
Shaker etal. (2010)Noor Shaker, GeorgiosYannakakis, and Julian Togelius.2010.Towards automatic personalized content generationfor platform games. In Proceedings of the AAAIConference on Artificial Intelligence and Interactive DigitalEntertainment, Vol.6. 63–68.
Singer etal. (2022)Uriel Singer, AdamPolyak, Thomas Hayes, Xi Yin,Jie An, Songyang Zhang,Qiyuan Hu, Harry Yang,Oron Ashual, Oran Gafni, etal.2022.Make-a-video: Text-to-video generation withouttext-video data.arXiv preprint arXiv:2209.14792(2022).
Siper etal. (2022)Matthew Siper, AhmedKhalifa, and Julian Togelius.2022.Path of Destruction: Learning an Iterative LevelGenerator Using a Small Dataset.arXiv preprint arXiv:2202.10184(2022).
Skjeltorp (2022)OleEdvin Skjeltorp.2022.3D Neural Cellular Automata-Simulatingmorphogenesis: Shape, color and behavior of three-dimensional structures.Master’sthesis.
Smith and Mateas (2011)AdamM Smith and MichaelMateas. 2011.Answer set programming for procedural contentgeneration: A design space approach.IEEE Transactions on ComputationalIntelligence and AI in Games 3, 3(2011), 187–200.
Sudhakaran etal. (2023)Shyam Sudhakaran, MiguelGonzález-Duque, Claire Glanois, MatthiasFreiberger, Elias Najarro, andSebastian Risi. 2023.MarioGPT: Open-Ended Text2Level Generation throughLarge Language Models.arXiv:2302.05981[cs.AI]
Sudhakaran etal. (2021)Shyam Sudhakaran, DjordjeGrbic, Siyan Li, Adam Katona,Elias Najarro, Claire Glanois, andSebastian Risi. 2021.Growing 3d artefacts and functional machines withneural cellular automata.arXiv preprint arXiv:2103.08737(2021).
Summerville and Mateas (2015)Adam Summerville andMichael Mateas. 2015.Sampling hyrule: Multi-technique probabilisticlevel generation for action role playing games. InProceedings of the AAAI Conference on ArtificialIntelligence and Interactive Digital Entertainment,Vol.11. 63–67.
Summerville etal. (2018)Adam Summerville, SamSnodgrass, Matthew Guzdial, ChristofferHolmgård, AmyK Hoover, AaronIsaksen, Andy Nealen, and JulianTogelius. 2018.Procedural content generation via machine learning(PCGML).IEEE Transactions on Games10, 3 (2018),257–270.
Sun etal. (2022)Cheng Sun, Min Sun, andHwann-Tzong Chen. 2022.Direct voxel grid optimization: Super-fastconvergence for radiance fields reconstruction. InProceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition. 5459–5469.
Team etal. (2023)AdaptiveAgent Team, JakobBauer, Kate Baumli, Satinder Baveja,Feryal Behbahani, Avishkar Bhoopchand,Nathalie Bradley-Schmieg, Michael Chang,Natalie Clay, Adrian Collister,etal. 2023.Human-Timescale Adaptation in an Open-Ended TaskSpace.arXiv preprint arXiv:2301.07608(2023).
Team etal. (2021)Open EndedLearning Team,Adam Stooke, Anuj Mahajan,Catarina Barros, Charlie Deck,Jakob Bauer, Jakub Sygnowski,Maja Trebacz, Max Jaderberg,Michael Mathieu, etal.2021.Open-ended learning leads to generally capableagents.arXiv preprint arXiv:2107.12808(2021).
Todd etal. (2023)Graham Todd, Sam Earle,MuhammadUmair Nasir, MichaelCernyGreen, and Julian Togelius.2023.Level Generation Through Large Language Models. InProceedings of the 18th International Conference onthe Foundations of Digital Games. 1–8.
Togelius etal. (2011)Julian Togelius,GeorgiosN Yannakakis, KennethO Stanley,and Cameron Browne. 2011.Search-based procedural content generation: Ataxonomy and survey.IEEE Transactions on ComputationalIntelligence and AI in Games 3, 3(2011), 172–186.
Torrado etal. (2020)RubenRodriguez Torrado,Ahmed Khalifa, MichaelCerny Green,Niels Justesen, Sebastian Risi, andJulian Togelius. 2020.Bootstrapping conditional gans for video game levelgeneration. In 2020 IEEE Conference on Games(CoG). IEEE, 41–48.
Wang etal. (2022)Peihao Wang, Xuxi Chen,Tianlong Chen, Subhashini Venugopalan,Zhangyang Wang, etal.2022.Is Attention All NeRF Needs?arXiv preprint arXiv:2207.13298(2022).
Wang etal. (2023)Peng Wang, Yuan Liu,Zhaoxi Chen, Lingjie Liu,Ziwei Liu, Taku Komura,Christian Theobalt, and Wenping Wang.2023.F²-NeRF: Fast Neural RadianceField Training with Free Camera Trajectories.arXiv preprint arXiv:2303.15951(2023).
Watkins (2016)Ryan Watkins.2016.Procedural content generation for unitygame development.Packt Publishing Ltd.
Yannakakis and Togelius (2011)GeorgiosN Yannakakis andJulian Togelius. 2011.Experience-driven procedural content generation.IEEE Transactions on Affective Computing2, 3 (2011),147–161.
Yates (2021)Cristopher Yates.2021.The use of Poisson Disc Distribution and A*Pathfinding for Procedural Content Generation in Minecraft.Ph. D. Dissertation. Ph. D.Dissertation. Memorial University.
Zhang etal. (2020)Hejia Zhang, MatthewFontaine, Amy Hoover, Julian Togelius,Bistra Dilkina, and StefanosNikolaidis. 2020.Video game level repair via mixed integer linearprogramming. In Proceedings of the AAAI Conferenceon Artificial Intelligence and Interactive Digital Entertainment,Vol.16. 151–158.
Zook and Riedl (2014)Alexander Zook andMarkO Riedl. 2014.Generating and adapting game mechanics. InProceedings of the 2014 Foundations of DigitalGames Workshop on Procedural Content Generation in Games.

Appendix A Minecraft Textures

The set of Minecraft textures used to imitate in-game blocks within the neural rendering engine is displayed in Table6.

Text-Guided Generation of Functional 3D Environments in Minecraft (36)

Text-Guided Generation of Functional 3D Environments in Minecraft (37)

Appendix B DreamCraft Generations

Text-Guided Generation of Functional 3D Environments in Minecraft (52)

Text-Guided Generation of Functional 3D Environments in Minecraft (53)

Text-Guided Generation of Functional 3D Environments in Minecraft (58)

Text-Guided Generation of Functional 3D Environments in Minecraft (59)

Appendix C Limitations

In some cases, the model uses negative space to represent an object, modulating the background texture to a particular color, then occluding parts of it with foreground blocks/density, to give it an apparent shape. This undesirable swapping of roles between foreground/background models may be more likely to occur in the quantized NeRF: whereas certain colors or textures may be difficult or impossible to replicate using the provided blocks, the background MLP remains unconstrained. To mitigate this, future work could investigate constraining the background MLP to only use 2D projections of “distant” game assets.

Another potential issue is the lack of semantic grounding with respect to block types. For example the model may just as well satisfy the prompt “large medieval ship” by using a combination of dirt and redstone, as with actual wooden logs or planks, so long as these give the appearance of wood. Our preliminary work on functional constraints suggests that this particular problem can be addressed by setting per-block-type targets (e.g. requiring 0% dirt and 50% wood blocks), but a more general approach might lie in “demonstrating” what each block type should be used to represent by adding this information in the prompt.

Whereas traditional NeRFs can model lighting and shadows, this is not the case in DreamCraft, where the color at each point in 3D space is derived directly from a voxel grid corresponding to the in-game appearance of a Minecraft block. When structures are rendered inside the neural engine, they thus appear “flat” in contrast to the kind of shadow and lighting effects that appear in the Minecraft game engine. Ideally, we could train an auxiliary model to mimic the effects of in-game lighting, for example by training it on paired datapoints of 3D block grids, and their appearance in-game at various angles. This learned renderer could replace the differentiable raycasting component of the NeRF pipeline (as in (Wang etal., 2022)), further sparing us from having to re-implement the rendering of irregular game objects such as plants and glass.

DreamCraft is currently too slow to be feasibly used in an online player-environment generator loop, taking a few hours to generate a single structure. Future versions could benefit from recent and future speed improvements in NeRFs(Wang etal., 2023; Guo etal., 2022; Sun etal., 2022; Zhang etal., 2020). Alternatively, it could be leveraged to generate a training set for a conditional, guidance-free generative models of game worlds.

Appendix D Planet Minecraft Dataset

To test DreamCraft’s ability to generate environments specific to the domain for which it was designed, we source text prompts from Planet Minecraft, a fan-operated site where users can upload and share custom content. We consider a subset of assets uploaded to the “Maps” category in 2016 (the year in which the most such assets were uploaded), and select the top 150 maps of this subset as measued by the number of user downloads. The prompts correspond to the names of these assets. We do not collect the assets themselves or any further data from the site.

The set of prompts scraped from Planet Minecraft is given below:

block	texture	block	texture	block	texture
log oak		stone		dirt
brick		clay		snow
glazedterracotta light blue		glazedterracotta yellow		redstoneblock
gold block		iron block		diamondblock
emeraldblock		cobblestone		slime