Speech Generation Flow

AI Text-to-Speech Pipeline Overview

Lexora is an AI text-to-speech platform built for modern publishing workflows. It supports two generation models: Classic Mode and Auto Mode.

Both modes use the same neural speech synthesis engine, but they differ in how and when audio is generated. This allows teams to choose between controlled pre-generation and on-demand AI voice generation.

Classic Mode vs Auto Mode

Classic Mode

In Classic Mode, generation is started manually from the configurator. Lexora processes the text, renders AI speech, and stores a final reusable audio asset.

Auto Mode

In Auto Mode, text-to-speech AI generation happens on demand when users visit your page. Lexora reads content from a target container (or direct script input), generates chunks progressively, and starts playback as soon as the first chunk is available.

Classic Mode: Step-by-Step AI Speech Generation

  1. Text validation: checks content quality, length, and processability.
  2. Credit estimation: shows expected cost before generation starts.
  3. Neural rendering: converts text into natural AI voice output.
  4. Asset creation: stores the final MP3 as a reusable speech asset.
  5. Audio ID assignment: enables embedding and programmatic reuse.

This flow is ideal when you want predictable production and a stable final audio file ready for distribution.

Auto Mode: Real-Time Text-to-Speech AI Delivery

  1. Project and domain validation to authorize generation requests.
  2. Input acquisition via content selector or direct script text.
  3. Chunk-based rendering for faster initial playback.
  4. Progressive playback while additional chunks are generated.
  5. Reuse logic when matching generation already exists.

Auto Mode is designed for dynamic websites, blogs, docs, and frequently updated content where pre-generating every audio file is not practical.

Credit Logic in AI Text-to-Speech

Credit consumption is based on generation, not on playback. This rule applies across both modes.

In Auto Mode, generation depends on the pair (text + voice_id):

  • If text + voice_id is unchanged, Lexora reuses existing audio and does not consume credits again.
  • If text changes, a new AI speech generation is required and credits are consumed.
  • If voice ID changes, generation is treated as new and credits are consumed again.

Since Auto Mode can be triggered by real traffic, high-volume pages may consume credits faster than Classic Mode.

What Improves AI Voice Quality

High-quality AI speech synthesis starts with clean, structured text and the correct voice selection.

  • Choose the right language for accurate pronunciation.
  • Select a voice model aligned with your content tone.
  • Use punctuation and formatting to improve rhythm and pacing.
  • Follow the Text Guidelines for best output quality.

Why This Architecture Matters for Modern TTS

  • Scalable AI text-to-speech for both static and dynamic content.
  • Clear separation between generation workflow and playback delivery.
  • Reusable outputs that reduce unnecessary regeneration.
  • Flexible integration for publishers, SaaS products, and content-heavy websites.

Lexora is not just a text-to-speech tool: it is an AI voice infrastructure layer for teams that need quality, scalability, and control.