0 reviews
Single embodied multimodal LLM up to 562B parameters (PaLM-E-562B).
Decoder-only autoregressive text generation based on PaLM.
Encodes images, robot states, sensor data, and neural 3D representations into the language embedding space.
Treats continuous observations as tokens in multimodal sentences for end-to-end training.
Embodied reasoning across multiple robot embodiments (tabletop and mobile manipulation).
Visual-language generalist performance with state-of-the-art results on OK-VQA and strong VQA/captioning.
Positive transfer via joint training on internet-scale language, vision, and visual-language data.
Zero-shot multimodal chain-of-thought reasoning for navigation, math on images, and egocentric Q&A.
Textual planning outputs executable by low-level robot policies or planners.
Special tokens for unambiguous object grounding and referencing in prompts.
Maintains strong language capabilities while adding multimodal and embodied skills.
Supports sensor fusion and reasoning in complex, dynamic physical environments.
If you've used this product, share your thoughts with other customers
Access Top AI Models in One Click
Google's PaLM 2: Revolutionizing AI Across Diverse Domains
Unleash the Power of Language with 10x LLM.
Unlock the Power of Multi-Model AI with MultiChat AI
Transform Text Creation with GPT-3's Advanced Language Model
Design and execute long-horizon tabletop and mobile manipulation plans from multimodal observations and text prompts.
Program robots via natural-language instructions for pick-and-place, sorting, or tool-use with object grounding tokens.
Conduct VQA, captioning, and scene understanding experiments using a single model across datasets and modalities.
Fuse images, states, and sensor data for embodied reasoning and decision-making in dynamic environments.
Enable natural-language dialogue with robots that converts user requests into executable stepwise action plans.
Perform zero-shot question answering and reasoning over temporally annotated egocentric video streams.
Apply multimodal chain-of-thought reasoning to navigation questions like assessing route feasibility from images.
Use VQA and captioning to check parts, describe scenes, and flag anomalies from multi-sensor inputs.
Create interactive demos that explain visual scenes, perform math on handwritten numbers, and plan embodied tasks.
Transfer policies and reasoning across multiple robot embodiments using a unified multimodal language model.