We introduce Amortized Text-to-Mesh (AToM), a feed-forward text-to-mesh framework optimized across multiple text prompts simultaneously. In contrast to existing text-to-3D methods that often entail time-consuming per-prompt optimization and commonly output representations other than polygonal meshes, AToM directly generates high-quality textured meshes in less than 1 second in inference with around 10 times reduction in the training cost, and generalizes to unseen prompts. Our key idea is a novel triplane-based text-to-mesh architecture with a two-stage training strategy that ensures stable optimization and scalability. Through extensive experiments on various prompt benchmarks, AToM significantly outperforms state-of-the-art amortized approaches with over 4 times higher accuracy (in DF415 dataset) and more distinguishable and higher-quality 3D outputs. AToM demonstrates strong generalizability, offering finegrained details of 3D content for unseen interpolated prompts, unlike per-prompt solutions.
AToM proposes a triplane-based text-to-mesh architecture with a two-stage amortized optimization training that ensures stable optimization and scalability. AToM is optimized through score distillation sampling without 3D data.
AToM generalizes to unseen interpolated prompts. Comparing AToM to AToM Per-Prompt on the Pig64 compositional prompt set in the format of ``a pig {activity} {theme}'', where each row and column represent a different activity and theme. Models are trained using 56 prompts and tested on all prompts, while the 8 unseen testing prompts are evaluated on the diagonal.
AToM generalizes to unseen prompts (diagonal from left up to right down)
Per-prompt text-to-3D cannot generalize and yields low consistency
Train on only 300 prompts, AToM generalizes to 2400 interpolated prompts. Here we show part of them. See the consistent identity, orientation, and quality.
AToM offers high-quality textured meshes in less than 1 second in inference. Here we show the results of AToM in DF415 dataset.