AI Digital Avatar System

Overview

The system is built around a simple interaction loop: the user speaks, the browser transcribes locally, the backend orchestrates an LLM response, and Cloud Digital Human animates an avatar that speaks the answer back in real time. UI, language intelligence, and avatar rendering are kept as separate, independently managed concerns.

System Architecture

The AI Brain is the central orchestrator. The Digital Human Channel is the bidirectional interface between the user and that intelligence — receiving voice input and presenting rendered avatar output over a persistent stream.

AI Brain

Backend API + LLM

Persona & Prompt ConfigBusiness RulesSession ControlAPI Integration

Answer Text via API

Digital Human ChannelWebRTC / RTMP Stream

InputUser → AI Brain

1User speaks at kiosk

2Browser captures microphone

3Web Speech API → text

4Text sent to AI Brain

OutputAI Brain → User

1AI Brain returns answer text

2Cloud TTS synthesizes voice

3Avatar animates in sync

4Stream delivered to display

Kiosk / Web—Captures user voice · STT converts to text · Sends transcribed text to backend server

Cloud Digital Human—Receives answer text · Synthesizes speech · Renders animation · Streams back via RTMP / WebRTC

Deployment Use Cases

The same AI Brain + Digital Human Channel can be deployed across offline kiosk and online contexts by swapping the avatar persona and knowledge configuration.

Shopping Mall Info Kiosk

Offline

Greets visitors and answers questions about stores, events, promotions, and directions — 24/7 without a help desk.

Product Promoter

Offline

Placed near product displays to explain features, answer comparisons, and upsell — acting as a knowledgeable brand ambassador.

Online Sales Assistant

Online

Embedded on a website or app to guide prospects through offerings, answer pricing questions, and qualify leads in real time.

Product & Customer Support

Online

Handles first-line support, troubleshooting FAQs, and order status — reducing ticket volume while keeping a human-like tone.

Core Components

Kiosk App & Web Frontend

Customer-facing browser UI

—Render the digital avatar stream
—Provide push-to-talk button interaction
—Manage UI states: listening, processing, speaking
—Subscribe to WebRTC / RTMP playback stream

Browser Speech-to-Text

In-browser voice capture

—Capture microphone input on user action
—Perform in-browser transcription locally
—Return plain text for backend processing
—Reduce backend complexity for voice input

Backend Orchestrator

Control center & integration hub

—Receive transcribed user text from frontend
—Attach predefined system prompt and persona
—Send final prompt package to OpenAI
—Call cloud APIs: start, stop, speak

OpenAI LLM Layer

Natural language intelligence

—Interpret and understand user questions
—Follow avatar persona and guardrails
—Generate concise, spoken-friendly responses
—Return text-only output to backend

Cloud Digital Human

Avatar rendering & speech synthesis

—Accept start / stop session commands
—Synthesize speech from answer text
—Animate avatar facial and body performance
—Publish stream via RTMP or WebRTC

Streaming Delivery Layer

Real-time media transport

—WebRTC for low-latency interactive playback
—RTMP for traditional streaming environments
—Deliver avatar video and voice to kiosk display
—Frontend subscribes and renders stream inline

End-to-End Request Flow

Session Init

—Kiosk page loads and initializes avatar player
—Backend calls Cloud API to start avatar session
—Cloud prepares the stream endpoint
—Frontend connects over WebRTC or RTMP

User Question

—User taps talk button and speaks
—Browser STT converts speech to transcribed text
—Backend combines avatar prompt + user question
—OpenAI returns answer text
—Backend sends answer to Cloud Digital Human
—Avatar speaks; stream updates on kiosk UI

Session End

—Interaction ends or kiosk times out
—Backend calls Cloud API to stop session
—Stream closes and frontend state resets

API Boundary Design (Sample)

Frontend → Backend

POST/api/avatar/chat
Send transcribed user text
POST/api/avatar/session/start
Init Cloud avatar session
POST/api/avatar/session/stop
Terminate avatar session

Backend → OpenAI

System:

You are the company's digital service assistant. Speak clearly, politely, and briefly.

User:

What services do you offer?

System prompt enforced on the server — never exposed to the browser.

Design Decisions

STT in the Browser

Reduces backend audio handling complexity. Sending text instead of raw audio keeps payloads small and improves perceived responsiveness.

Prompt Engineering on the Server

Prevents prompt leakage to the client. Keeps avatar behavior consistent across kiosks and makes policy updates easier to manage centrally.

Cloud Handles Rendering

Offloads lip sync, speech synthesis, and animation from the backend. Allows switching between RTMP and WebRTC without restructuring the pipeline.

Non-Functional Requirements

Performance

—Fast in-browser STT turnaround
—LLM responses optimized for short spoken answers
—Sub-second to near-real-time perceived end-to-end latency

Reliability

—Graceful fallback on STT or LLM failure
—Reconnecting state on stream disconnect
—Safe default message when LLM is unavailable
—Retry cloud session creation on failure

Security

—API keys kept on backend only
—System prompts never exposed to browser
—All frontend requests validated before upstream calls
—Rate-limiting to prevent kiosk abuse