LTX 2.3 Audio: How Native Sound, Lipsync & Moaning Generation Works

virtuavixen No Comments

The single biggest reason to use LTX 2.3 over WAN 2.2 or any other open-source video model is native audio. Instead of generating silent video and adding sound in post, LTX 2.3 outputs a synced audio track in the same forward pass as the video — moaning that matches the rhythm, breathing that tracks the camera, ambient sound that fits the scene, and dialogue that lipsyncs to whatever you put in the [SPEECH]: tag of your prompt.

This page explains how the audio side of LTX 2.3 actually works — why it sometimes fails, what to write in the prompt to drive it, and how to fix common issues like silent output or off-sync lips. Try the live workflows on VirtuaVixen Studio, or run them locally with our ComfyUI Workflow Pack. Discord for support.

How LTX 2.3 Generates Audio

LTX 2.3 was trained on paired audio+video data — every clip in the training set had its audio track. The model learned to predict video latents and audio latents jointly. At inference time, the sampler runs a unified diffusion process over a combined “AV latent” that encodes both modalities.

The technical pipeline:

  1. LTXVConcatAVLatent combines a video latent and an audio latent into a single tensor.
  2. SamplerCustomAdvanced runs the unified diffusion process over the combined tensor.
  3. LTXVSeparateAVLatent splits the result back into video and audio latents.
  4. VAEDecodeTiled decodes the video latent to RGB frames.
  5. LTXVAudioVAEDecode decodes the audio latent to a waveform using a separate audio VAE.
  6. CreateVideo combines the frames and the waveform into an mp4 with audio track.

This is fundamentally different from MMAudio (the post-processing approach used with WAN 2.2). MMAudio analyses the finished video and infers what sound it should have. LTX 2.3 generates the sound during the same diffusion that generated the video — it has access to the underlying video plan, so the sync is much tighter.

What LTX 2.3 Audio Sounds Like

  • Moaning — pitched and paced to match the on-screen action. Tracks intensity changes well.
  • Breathing — synced to chest movement and effort. Sounds natural rather than canned.
  • Skin sounds — slap, impact, friction. Driven by the visual motion.
  • Ambient — environmental layer (water, wind, room tone) appropriate to the scene.
  • Speech / dialogue — when a [SPEECH]: tag is in the prompt, the talking-head LoRA lipsyncs the character's mouth to the words.

Driving the Audio with Prompts

Three prompt mechanisms control the audio:

[SPEECH]: tag

Whatever you put after [SPEECH]: becomes spoken dialogue with lipsync. Keep lines short — the LTX-2.3-22b-AV-LoRA-talking-head LoRA performs whole sentences cleanly, but long monologues drift off-sync. See the prompting guide for sentence patterns that work.

[SOUNDS]: tag

The [SOUNDS]: tag drives ambient and foley audio. Phrases like moaning, sensual, heavy breathing, skin slapping against skin emphasise specific sound layers. The nsfwsks tag (used in many of our workflows) activates explicit sound generation.

Negative prompt

Adding music and silent or muted audio to the negative prompt prevents two common failures: unwanted background music being injected, and the audio decoder collapsing to silence.

Why Your Audio Is Silent

If your output video has no audio, walk through these in order:

  1. Audio VAE not loaded — the audio side requires LTX23_audio_vae_bf16.safetensors. Without it, audio decode silently produces zeros. Verify the VAELoaderKJ node is configured.
  2. ImpactSwitch select=2 with broken input — some workflows have a switch between “normal audio” and “user-uploaded audio”. If select=2 and the LoadAudio file doesn't exist, you get silence.
  3. Audio latent not concatenated — LTXVConcatAVLatent must combine both video_latent and audio_latent before the sampler runs. If audio_latent is empty, the sampler produces a video-only result.
  4. Negative prompt too aggressive — overly broad negatives like sound, music, audio can suppress the audio entirely. Stick to specific suppressions.

Why Lipsync Is Off

  • Talking-head LoRA strength too low — we run it at 0.88. Below 0.7 the lips stop tracking words.
  • Dialogue too long — break monologues into shorter sentences. The model performs ~5–8-word lines well.
  • Multiple speakers in prompt — only one set of lips can sync. Don't write dialogue from a second character.
  • Wrong language — primary trained language is English. Add english native as a style modifier to lock the accent.

Audio Without Video (Audio-Only Workflow)

LTX 2.3 also supports audio-only generation — useful for generating moaning audio loops or dialogue tracks. Our ltx23_audio_only workflow does this: same model, same encoder, but only the audio path is decoded. Output is a wav file. Useful for adding LTX-quality audio to footage generated by other models.

User-Uploaded Audio (Lipsync Workflow)

Our ltx23_lipsync and ltx23_lipsync_action workflows take a user-uploaded audio file (mp3, wav, flac) and animate the character's lips to it. The audio is encoded to a latent via LTXVAudioVAEEncode, then mixed into the sampling process. This is how you get a character to “say” specific pre-recorded text — voice clone a podcast host, then have your AI character lipsync to it.

Try It

The full LTX 2.3 audio stack is live in our Studio‘s Cinema tab — every workflow we ship runs the abliterated text encoder and the talking-head LoRA so audio works out of the box. Free 160 daily tokens.

For local control over the audio side — your own audio VAE, your own speech LoRAs, your own voice samples — the ComfyUI Workflow Pack ships the full audio stack pre-configured. Includes the audio VAE, the talking-head LoRA, the lipsync workflow with audio upload support, and the audio-only generator.

Related Reading

Leave a comment

Are you 18 or older?

You must be 18 years or older to access this website.

👑 AI Studio ×

Categories