Isolate the subject of any image on a transparent background. Generative edges keep hair, glass, and fine detail clean. Outputs a transparent PNG ready for compositing.
Models
All Models
Synced foley and ambience for any clip. Optional text, negative prompt, or 2 to 4s reference audio. Control duration, steps, guidance, text-driven mode, and seed.
Upscale images to 2x, 4x, 8x, or 16x using a diffusion engine that adds photorealistic detail. One creativity slider controls strict fidelity vs. texture enhancement. Up to 8K.
Extract ControlNet-ready detection maps from any image. Ten preprocessors in one model: edges, depth, pose, normals, lines, and more. Free to run, instant output.
Uthana Character Rigging (aka Uthana Create Character) lets you upload a 3D model and automatically rig it for animation.
Auto Subtitles by Scenario embeds subtitles directly into your video. It automatically transcribes and translates audio, with control over font, size, color, outlines, and borders.
Generate cinematic videos from text (T2V) or a first frame (I2V). Native audio, 720P/1080P output, 3 to 15 second clips, and strong motion control.
Edit existing videos with text instructions and up to 5 optional image references. Preserve or transform the source clip, up to 15 seconds, with 720P/108P output.
Create videos from a prompt plus 1 to 9 reference images. Keep characters, props, and products consistent with native audio and 720P/1080P output.
Turn a single photo into a detailed, watertight 3D mesh. Great for game assets, 3D printing, and rapid prototyping, fine details and complex geometry handled.
Convert 1 to 4 portrait photos into a 3D face model with PBR textures. Up to 2M faces at 1536 cube resolution for skin, hair, and facial detail.
Uthana Text-to-Motion turns simple text prompts into fully animated 3D characters. Just upload your character, describe the motion, and get ready-to-use animated rigs.
Turn text prompts into native 4K cinematic video, up to 15s long, with physics-aware motion that keeps characters grounded and scenes looking genuinely real.
Animate any image into native 4K video at up to 60fps. Cinematic physics, multi-shot scenes, and built-in lip-sync — without the usual AI jitter.
From a few photos to one 3D model on Scenario. Pick full color or a quick untextured version, add a standing pose when you want to animate later.
Tracks and segments moving objects across video frames into isolated mask tracks. Requires a text prompt. Outputs one video mask per object, up to 16 simultaneous tracks.
Segments any image into isolated object masks. Accepts a text prompt or up to 10 bounding boxes to guide detection. Outputs one PNG mask per object, up to 90+ in complex scenes.
ERNIE Image Turbo is the fast, lower-cost variant of ERNIE Image - same text-in-image rendering strengths at roughly 6x the speed.
ERNIE Image is Baidu's text-to-image model built for accurate text rendering inside images. Great for posters, signage, infographics, and UI mockups.
Automatically enhance and upscale images while preserving facial identity. No prompt required. Drop in an image and receive a sharper, higher-quality version.
Edit photos with text instructions. Upload up to 10 reference images, describe the change, and Phota transforms the scene while keeping the subject recognizable.
Generate photorealistic images from text. Built for human subjects: portraits, lifestyle, fashion, and group shots. Outputs at 1K or 4K across six aspect ratios.
Generate video driven by an audio clip. Voice cadence controls pacing, musical energy shapes motion. Up to 20 seconds, 1080p, precise audio-visual sync.
Transform any song into a new genre or style. Preserves the original melody while reimagining vocals, instruments, and arrangement. Outputs at up to 44,100 Hz and 256 kbps.
Generate studio-grade music from a text prompt. Control duration up to 3 minutes, toggle vocals off, and export in up to Opus 48kHz 192kbps quality.
Translate video or audio into 30 languages using dubbing (not true lip-sync), preserving speakers' voice via cloning. Supports up to 10 speakers and removes background noise.
Animate any image into a cinematic video clip. Describe the motion, PixVerse V6 does the rest.
Generate cinematic videos from text in five artistic styles, up to 15 seconds at 1080p.
Ideogram V3 Layerize Text on Scenario: split text from flat graphics into layers plus a clean base, optional prompt, font names or font URLs per tier, seed.
Generate images with a native alpha channel - no background removal needed. Four speed tiers from "Flash“ to "Quality". Supports logos, icons, stickers, overlays, or UI assets
Google Veo 3.1 Lite: Google’s AI for realistic physics-based video with integrated audio and music generation.
Magnific Video Upscaler Precision: A high-fidelity upscaling model focused on accuracy, strength blending, and detail preservation.
Magnific Video Upscaler Creative: An AI tool for creative video enhancement with 4K support, "flavor" controls, and a creativity slider for adding detail.
Turn one to eight photos of the same object into a textured 3D model on Scenario with ReconViaGen 0.5. Tune mesh detail, texture sharpness, and how multiple views combine.
Professional-grade AI lipsync. Drop in a video and any audio, Sync-3 aligns mouth movements to sound. Built for dubbing, voice-over, ADR, and high-fidelity video editing.
JoyAI Image Edit on Scenario: natural-language edits to one uploaded image, optional negative prompt, guidance and inference steps.
Pruna's P-Image Upscale: Fast, precise upscaling (1-8x, 1-8MP) with optional detail enhancement.
Edit video with natural language — swap backgrounds, shift lighting, apply style transfers, and restyle with a reference image while preserving original motion. 2–10s at 720p.
Alibaba Wan 2.7 text-to-video — 720p/1080p, 2–15s, optional synced audio, prompt expansion.
Wan 2.7 Image Pro: Alibaba’s advanced AI for 4K text-to-image generation and multi-image editing.
Alibaba Wan 2.7 Image on Scenario: text or up to nine reference images for edits, 1K or 2K sizes, image sets, thinking mode for text-only runs, up to four outputs per job.
Animate images into cinematic video with first/last frame, or clip-continuation modes. Direct multi-shot sequences with temporal brackets. 2–15s at 720p or 1080p.
Generate full songs from lyrics and style prompts. Supports 14 structure tags for arrangement control, vocal and instrumental modes, and CD-quality output at 44.1kHz / 256kbps.
Fast text-to-speech with the same 17 voices, emotions, and 40+ languages as HD, optimized for speed and cost. For real-time apps, assistants, or interactive content.
Premium text-to-speech with 17 voices, 10 emotions, 40+ languages, and natural interjections. Fine-tune speed, pitch, and volume for broadcast-ready narration and voice-overs.
Transfer outfits, swap faces, and blend textures from any reference photo. No fine-tuning needed. Up to 3 images and Thinking mode for higher quality.
Google Lyria 3 Clip is the short-form variant of the Lyria 3 family, purpose-built for generating tight, expressive 30-second music clips in MP3 format.
Grok Edit Video is a video-to-video editing model powered by xAI's Grok Imagine technology, integrated directly into Scenario.
Grok Extend Video on Scenario: upload a clip, describe how it continues, generate two to ten seconds of new AI footage. Prompt up to ten thousand characters; video affects cost.
Produces high-quality 3D renders of character outfit sets and equipment displayed on invisible figures with detailed textures and studio lighting.