Abstract |
In the realm of Embodied AI, the creation of 3D simulated environments holds
paramount significance, yet it often demands specialized expertise and substantial
manual labor, consequently limiting their diversity and expansiveness.
In this thesis, we introduce a novel framework designed to address this limitation
by facilitating the fully automated generation of 3D environments tailored
to user-supplied prompts. Our framework automates scene generation and exhibits
versatility in crafting diverse scenes, adjusting designs to various styles, and
comprehending the semantics of intricate queries. Central to our approach is the
utilization of a large language model (LLM), which imbues the framework with
common-sense knowledge to envision plausible scene configurations. Additionally,
we harness a vast collection of 3D assets sourced from Objaverse to populate scenes
with a rich array of objects. We further enhance the framework by integrating a
sophisticated agent capable of providing feedback to the generation process. This
agent, powered by Multimodal Models like GPT-4 Vision, operates as a feedback
agent, guiding the generation towards desired outcomes. Furthermore, we harness
the capabilities of Retrieval Augmented Generation (RAG) to enrich the generation
process, and incorporate the use of reference images, leveraging the advanced
visual understanding of GPT-4 Vision.
User evaluations indicate a strong preference for our approach, with 75% of
users favoring scenes generated using the feedback agent, 55.6% preferring scenes
generated using RAG, and 83.3% agreeing that there is a resemblance of the generated
scene to the referenced image. In comparison with the state of the art, our
implementation is faster and more modular, enhancing user experience and system
efficiency.
Additionally, this thesis introduces a novel algorithm that integrates Large Language
Models (LLMs) with Conformal Geometric Algebra (CGA) to revolutionize
controllable 3D scene editing, particularly for object repositioning tasks. Conventional
methods typically suffer from reliance on large training datasets or lack a
formalized language for precise edits. Utilizing CGA as a robust formal language,
our framework precisely models spatial transformations necessary for accurate object
repositioning. Leveraging the zero-shot learning capabilities of pre-trained
LLMs, our framework translates natural language instructions into CGA operations,
facilitating exact spatial transformations within 3D scenes without the need
for specialized pre-training.
To accurately assess the impact of CGA, we benchmark against robust Euclideanbased
baselines, evaluating both latency and accuracy. Comparative performance
evaluations indicate that our framework significantly reduces LLM response times
by 16% and boosts success rates by 9.6% on average compared to traditional methods.
These advancements underscore our framework’s potential to democratize 3D
scene generation and editing, enhancing accessibility and fostering innovation
across sectors such as education, entertainment, and virtual reality.
|