Instruction-guided text-to-speech (ITTS) enables users to control speech generation through natural language prompts, offering a more intuitive interface than traditional TTS. However, the alignment between user style instructions and listener perception remains largely unexplored. This work first presents a perceptual analysis of ITTS controllability across two expressive dimensions (adverbs of degree and graded emotion intensity) and collects human ratings on speaker age and word-level emphasis attributes. To comprehensively reveal the instruction-perception gap, we provide a data collection with large-scale human evaluations, named Expressive VOice Control (E-VOC) corpus. Furthermore, we reveal that (1) gpt-4o-mini-tts is the most reliable ITTS model with great alignment between instruction and generated utterances across acoustic dimensions. (2) The 5 analyzed ITTS systems tend to generate Adult voices even when the instructions ask to use child or elderly voices. (3) Fine-grained control remains a major challenge, indicating that most ITTS systems have substantial room for improvement in interpreting slightly different attribute instructions.
Listen to examples showing how different models interpret the same instruction. How well do they align with your perception based on the instructions?
Models are instructed to generate speech with varying emotional intensity using adverbs like "extremely," "very," and "slightly."
| Context | Transcript | Model | Extremely | Very | - | Slightly |
|---|---|---|---|---|---|---|
| Teacher-Student | Our professor outlines assignment expectations while the student records the key instructions. | gpt-4o-mini-tts | ||||
| Parler-TTS-large-v1 | ||||||
| Parler-TTS-mini-v1 | ||||||
| PromptTTS++ | ||||||
| UniAudio |
| Context | Transcript | Model | Extremely | Very | - | Slightly |
|---|---|---|---|---|---|---|
| Normal | I'm looking at the forecast, and it says there's a 50% chance of rain this afternoon. | gpt-4o-mini-tts | ||||
| Parler-TTS-large-v1 | ||||||
| Parler-TTS-mini-v1 | ||||||
| PromptTTS++ | ||||||
| UniAudio |
| Context | Transcript | Model | Extremely | Very | - | Slightly |
|---|---|---|---|---|---|---|
| Lover | We decide on a meeting location with coffee and a reserved table tonight. | gpt-4o-mini-tts | ||||
| Parler-TTS-large-v1 | ||||||
| Parler-TTS-mini-v1 | ||||||
| PromptTTS++ | ||||||
| UniAudio |
| Context | Transcript | Model | Extremely | Very | - | Slightly |
|---|---|---|---|---|---|---|
| Friends | We gather near the park to discuss plans for the upcoming weekend. | gpt-4o-mini-tts | ||||
| Parler-TTS-large-v1 | ||||||
| Parler-TTS-mini-v1 | ||||||
| PromptTTS++ | ||||||
| UniAudio |
Models are instructed to generate speech using adjectives that represent different levels of emotional intensity.
| Context | Transcript | Model | Outraged | Angry | Irritated | Frustrated | Upset |
|---|---|---|---|---|---|---|---|
| Customer | I provide directions to the service desk where products are available for purchase. | gpt-4o-mini-tts | |||||
| Parler-TTS-large-v1 | |||||||
| Parler-TTS-mini-v1 | |||||||
| PromptTTS++ | |||||||
| UniAudio |
| Context | Transcript | Model | Ecstatic | Overjoyed | Happy | Content | Satisfied |
|---|---|---|---|---|---|---|---|
| Family | I arrange the dining setup with chairs, plates, cups, and napkins for dinner. | gpt-4o-mini-tts | |||||
| Parler-TTS-large-v1 | |||||||
| Parler-TTS-mini-v1 | |||||||
| PromptTTS++ | |||||||
| UniAudio |
| Context | Transcript | Model | Heartbroken | Sad | Unhappy | Disappointed | Gloomy |
|---|---|---|---|---|---|---|---|
| Friends | We gather near the park to discuss plans for the upcoming weekend. | gpt-4o-mini-tts | |||||
| Parler-TTS-large-v1 | |||||||
| Parler-TTS-mini-v1 | |||||||
| PromptTTS++ | |||||||
| UniAudio |
| Context | Transcript | Model | Surprised | Stunned | Amazed | Unexpected | Intrigued |
|---|---|---|---|---|---|---|---|
| Lover | We decide on a meeting location with coffee and a reserved table tonight. | gpt-4o-mini-tts | |||||
| Parler-TTS-large-v1 | |||||||
| Parler-TTS-mini-v1 | |||||||
| PromptTTS++ | |||||||
| UniAudio |
We carefully design a framework to quantify the alignment between instruction and perception. We define 4 key control dimensions, collect a new dataset (E-VOC) for analysis, and run large-scale human evaluations.
We first measure the objective acoustic properties (loudness, pitch, speaking rate) of the generated speech. Figure 1 shows that most models have good control over pitch and speaking rate when guided by adverbs of degree, but consistently struggle to modulate loudness based on instructions like "slightly quiet" or "extremely loud".
Fig 1. Loudness (LUFS), pitch (Hz), and speaking rate (words/s) across ITTS models for Task I. Adverbs of Degree.
Human evaluations reveal the gap between instructions and perception. Figure 2 shows the perceived emotion intensity for different models across four emotions. While some models follow the general trend (e.g., "ecstatic" is perceived as more intense than "happy"), there are significant discrepancies and overlaps, showing that fine-grained emotional control remains a major challenge.
Fig 2. Averaged perceptual emotion intensity of ITTS models across 4 emotions, analyzed by Adverbs of Degree and Emotion-Intensity Adjectives.
Task: Generate 10-15 word sentences as "Text prompts," describing life conditions in specific contexts without using inherently polar or sentimental words. The generated sentences are naturally spoken in interaction, for evaluating how well state-of-the-art text-to-speech (TTS) models synthesize emotion.
Steps:
1) Select Context: family, friends, customer, lover, or teacher–student.
2) Sentence Construction: create a 10–15 word sentence describing the context.
3) Polarity Check: exclude inherently polar or sentimental words.
4) Repetition: generate sentences across various contexts for diversity.
Output Format: List the interaction context followed by the 10–15 word sentence, neutrally described.
Examples:
- Friends: I plan to buy plates, forks, knives, and glasses arranged on the table for the meal. Would you want to come?
- Traveling: Schedule of our trip includes flight departure, hotel check-in procedure, museum visit, and city tour.
Notes: Sentences must remain descriptive and contextually relevant, with neutral language. The prompt design ensures that TTS evaluation focuses on emotional style alignment.
| Happy Intensity | WF WIKI | Sad Intensity | WF WIKI | Angry Intensity | WF WIKI | Surprised Intensity | WF WIKI | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ecstatic | 0.954 | 2,979 | Heartbroken | 0.969 | 2,254 | Outaged | 0.964 | 6,784 | Surprised | 0.930 | 51,083 |
| Overjoyed | 0.909 | 1,921 | Sad | 0.864 | 6,819 | Angry | 0.824 | 34,184 | Stunned | 0.820 | 6,254 |
| Happy | 0.788 | 80,205 | Unhappy | 0.750 | 16,934 | Irritated | 0.706 | 2,860 | Amazed | 0.781 | 1,255 |
| Content | 0.688 | 182,702 | Disappointed | 0.636 | 19,109 | Frustrated | 0.636 | 17,278 | Unexpected | 0.711 | 24,728 |
| Satisfied | 0.500 | 22,700 | Gloomy | 0.578 | 2,672 | Upset | 0.439 | 39,299 | Intrigued | 0.430 | 4,679 |
| Context | Transcription |
|---|---|
| Family | You always make breakfast on Sundays. |
| Friends | Let's explore downtown tonight without plans. |
| Customer | Your order is ready for pickup. |
| Lover | I adore every moment with you. |
| Teacher | Submit your project before class tomorrow. |
| Sibling | You might borrow my car later. |
| Colleagues | Our meeting starts at nine sharp. |
| Neighbor | Please return my gardening tools soon. |
| Task | Adjective (Adj.) / Level | ||||
|---|---|---|---|---|---|
| Pitch | Low | High | |||
| Loudness | Quiet | Loud | |||
| Speed Rate | Slow | Fast | |||
| Age | Child | Teenage | Adult | Elderly | |
| Level | 1 | 2 | 3 | 4 | |
| Adv. Deg. | Slightly | Very | Extremely | ||
| Emotion Level | 1 | 2 | 3 | 4 | 5 |
| Happy-I.A. | Satisfied | Content | Happy | Overjoyed | Ecstatic |
| Sad-I.A. | Gloomy | Disappointed | Unhappy | Sad | Heartbroken |
| Angry-I.A. | Upset | Frustrated | Irritated | Angry | Outraged |
| Surprised-I.A. | Intrigued | Unexpected | Amazed | Stunned | Surprised |
| Task | Template 1 | Example | Template 2 |
|---|---|---|---|
| Pitch | Speak in a/an "Adv. Deg." "Adj." tone. | Speak in a Very High tone. | Voice: "Adv. Deg." "Adj." |
| Loudness | Speak in a Slightly Quiet tone. | Tone: "Adv. Deg." "Adj." | |
| Speaking Rate | Speak in a Very Fast tone. | Pacing: "Adv. Deg." "Adj." | |
| Emotion | Speak in a Happy tone. | Emotion: "Adv. Deg." "Adj." | |
| Emotion-I.A. | Speak in a/an "Adj." tone. | Speak in an Ecstatic tone. | Emotion: "Adj." |
| Emphasis | Articulate clearly, placing special stress on the term "word". | Pronunciation: Clear and precise, empathize on keyword "word". | |
| Age | Use a/an "age group"'s voice. | Use a/an Child's voice. | Delivery: A classic "age group" tone. |
@inproceedings{Lin_2026,
title = {Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-to-Speech Systems},
author = {Lin, Yi-Cheng and Chou, Huang-Cheng and Wei, Tzu-Chieh and Chen, Kuan-Yu and Lee, Hung-yi},
booktitle = {Submission to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year = {2026}
}