E-Locus - Institutional Repository of the University of Crete

Home Collections Type of Work Post-graduate theses

Post-graduate theses

Current Record: 3 of 6695

[Add to Basket]

Identifier

000466161

Title

3D scene generation and editing using foundational models and geometric algebra

Alternative Title

Δημιουργία και επεξεργασία τρισδιάστατων σκηνών χρησιμοποιώντας μεγάλα μοντέλα όρασης-γλώσσας και γεωμετρική άλγεβρα

Author

Αγγελής, Δημήτριος Α.

Thesis advisor

Παπαγιαννάκης, Γεώργιος

Reviewer

Τζίτζικας, Ιωάννης
Ρούσσος, Αναστάσιος
Πρατικάκης, Πολύβιος

Abstract

In the realm of Embodied AI, the creation of 3D simulated environments holds paramount significance, yet it often demands specialized expertise and substantial manual labor, consequently limiting their diversity and expansiveness. In this thesis, we introduce a novel framework designed to address this limitation by facilitating the fully automated generation of 3D environments tailored to user-supplied prompts. Our framework automates scene generation and exhibits versatility in crafting diverse scenes, adjusting designs to various styles, and comprehending the semantics of intricate queries. Central to our approach is the utilization of a large language model (LLM), which imbues the framework with common-sense knowledge to envision plausible scene configurations. Additionally, we harness a vast collection of 3D assets sourced from Objaverse to populate scenes with a rich array of objects. We further enhance the framework by integrating a sophisticated agent capable of providing feedback to the generation process. This agent, powered by Multimodal Models like GPT-4 Vision, operates as a feedback agent, guiding the generation towards desired outcomes. Furthermore, we harness the capabilities of Retrieval Augmented Generation (RAG) to enrich the generation process, and incorporate the use of reference images, leveraging the advanced visual understanding of GPT-4 Vision. User evaluations indicate a strong preference for our approach, with 75% of users favoring scenes generated using the feedback agent, 55.6% preferring scenes generated using RAG, and 83.3% agreeing that there is a resemblance of the generated scene to the referenced image. In comparison with the state of the art, our implementation is faster and more modular, enhancing user experience and system efficiency. Additionally, this thesis introduces a novel algorithm that integrates Large Language Models (LLMs) with Conformal Geometric Algebra (CGA) to revolutionize controllable 3D scene editing, particularly for object repositioning tasks. Conventional methods typically suffer from reliance on large training datasets or lack a formalized language for precise edits. Utilizing CGA as a robust formal language, our framework precisely models spatial transformations necessary for accurate object repositioning. Leveraging the zero-shot learning capabilities of pre-trained LLMs, our framework translates natural language instructions into CGA operations, facilitating exact spatial transformations within 3D scenes without the need for specialized pre-training. To accurately assess the impact of CGA, we benchmark against robust Euclideanbased baselines, evaluating both latency and accuracy. Comparative performance evaluations indicate that our framework significantly reduces LLM response times by 16% and boosts success rates by 9.6% on average compared to traditional methods. These advancements underscore our framework’s potential to democratize 3D scene generation and editing, enhancing accessibility and fostering innovation across sectors such as education, entertainment, and virtual reality.

Language

English

Subject

Generative artificial intelligence

Large language models

Large vision models

Γενετική τεχνητή νοημοσύνη

Δημιοργία τρισδιάστατης σκηνής

Επεξεργασία τρισδιάστατης σκηνής

Issue date

2024-07-26

Collection