AI Digital Avatar System
A voice-driven customer interaction platform for kiosk environments — combining browser STT, LLM orchestration, and Cloud Digital Human to deliver 24/7 human-like conversations without on-site staffing.
Overview
The system is built around a simple interaction loop: the user speaks, the browser transcribes locally, the backend orchestrates an LLM response, and Cloud Digital Human animates an avatar that speaks the answer back in real time. UI, language intelligence, and avatar rendering are kept as separate, independently managed concerns.
System Architecture
The AI Brain is the central orchestrator. The Digital Human Channel is the bidirectional interface between the user and that intelligence — receiving voice input and presenting rendered avatar output over a persistent stream.
Backend API + LLM
Deployment Use Cases
The same AI Brain + Digital Human Channel can be deployed across offline kiosk and online contexts by swapping the avatar persona and knowledge configuration.
Shopping Mall Info Kiosk
OfflineGreets visitors and answers questions about stores, events, promotions, and directions — 24/7 without a help desk.
Product Promoter
OfflinePlaced near product displays to explain features, answer comparisons, and upsell — acting as a knowledgeable brand ambassador.
Online Sales Assistant
OnlineEmbedded on a website or app to guide prospects through offerings, answer pricing questions, and qualify leads in real time.
Product & Customer Support
OnlineHandles first-line support, troubleshooting FAQs, and order status — reducing ticket volume while keeping a human-like tone.
Core Components
Kiosk App & Web Frontend
Customer-facing browser UI
- —Render the digital avatar stream
- —Provide push-to-talk button interaction
- —Manage UI states: listening, processing, speaking
- —Subscribe to WebRTC / RTMP playback stream
Browser Speech-to-Text
In-browser voice capture
- —Capture microphone input on user action
- —Perform in-browser transcription locally
- —Return plain text for backend processing
- —Reduce backend complexity for voice input
Backend Orchestrator
Control center & integration hub
- —Receive transcribed user text from frontend
- —Attach predefined system prompt and persona
- —Send final prompt package to OpenAI
- —Call cloud APIs: start, stop, speak
OpenAI LLM Layer
Natural language intelligence
- —Interpret and understand user questions
- —Follow avatar persona and guardrails
- —Generate concise, spoken-friendly responses
- —Return text-only output to backend
Cloud Digital Human
Avatar rendering & speech synthesis
- —Accept start / stop session commands
- —Synthesize speech from answer text
- —Animate avatar facial and body performance
- —Publish stream via RTMP or WebRTC
Streaming Delivery Layer
Real-time media transport
- —WebRTC for low-latency interactive playback
- —RTMP for traditional streaming environments
- —Deliver avatar video and voice to kiosk display
- —Frontend subscribes and renders stream inline
End-to-End Request Flow
Session Init
A- —Kiosk page loads and initializes avatar player
- —Backend calls Cloud API to start avatar session
- —Cloud prepares the stream endpoint
- —Frontend connects over WebRTC or RTMP
User Question
B- —User taps talk button and speaks
- —Browser STT converts speech to transcribed text
- —Backend combines avatar prompt + user question
- —OpenAI returns answer text
- —Backend sends answer to Cloud Digital Human
- —Avatar speaks; stream updates on kiosk UI
Session End
C- —Interaction ends or kiosk times out
- —Backend calls Cloud API to stop session
- —Stream closes and frontend state resets
API Boundary Design (Sample)
Frontend → Backend
- POST/api/avatar/chatSend transcribed user text
- POST/api/avatar/session/startInit Cloud avatar session
- POST/api/avatar/session/stopTerminate avatar session
Backend → OpenAI
System prompt enforced on the server — never exposed to the browser.
Design Decisions
STT in the Browser
Reduces backend audio handling complexity. Sending text instead of raw audio keeps payloads small and improves perceived responsiveness.
Prompt Engineering on the Server
Prevents prompt leakage to the client. Keeps avatar behavior consistent across kiosks and makes policy updates easier to manage centrally.
Cloud Handles Rendering
Offloads lip sync, speech synthesis, and animation from the backend. Allows switching between RTMP and WebRTC without restructuring the pipeline.
Non-Functional Requirements
Performance
- —Fast in-browser STT turnaround
- —LLM responses optimized for short spoken answers
- —Sub-second to near-real-time perceived end-to-end latency
Reliability
- —Graceful fallback on STT or LLM failure
- —Reconnecting state on stream disconnect
- —Safe default message when LLM is unavailable
- —Retry cloud session creation on failure
Security
- —API keys kept on backend only
- —System prompts never exposed to browser
- —All frontend requests validated before upstream calls
- —Rate-limiting to prevent kiosk abuse