Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-to-Speech Systems

Submission to ICASSP 2026

1National Taiwan University, 2University of Southern California, 3University of Michigan
Teaser image showing analysis of TTS models.

We systematically evaluate the gap between natural language instructions and listener perception in modern Text-to-Speech systems across 4 expressive and 3 acoustic dimensions.

Abstract

Instruction-guided text-to-speech (ITTS) enables users to control speech generation through natural language prompts, offering a more intuitive interface than traditional TTS. However, the alignment between user style instructions and listener perception remains largely unexplored. This work first presents a perceptual analysis of ITTS controllability across two expressive dimensions (adverbs of degree and graded emotion intensity) and collects human ratings on speaker age and word-level emphasis attributes. To comprehensively reveal the instruction-perception gap, we provide a data collection with large-scale human evaluations, named Expressive VOice Control (E-VOC) corpus. Furthermore, we reveal that (1) gpt-4o-mini-tts is the most reliable ITTS model with great alignment between instruction and generated utterances across acoustic dimensions. (2) The 5 analyzed ITTS systems tend to generate Adult voices even when the instructions ask to use child or elderly voices. (3) Fine-grained control remains a major challenge, indicating that most ITTS systems have substantial room for improvement in interpreting slightly different attribute instructions.

Audio Examples

Listen to examples showing how different models interpret the same instruction. How well do they align with your perception based on the instructions?

Task I. Adverbs of Degree (Adv. Deg.)

Models are instructed to generate speech with varying emotional intensity using adverbs like "extremely," "very," and "slightly."

Emotion: Angry

Context Transcript Model Extremely Very - Slightly
Teacher-Student Our professor outlines assignment expectations while the student records the key instructions. gpt-4o-mini-tts
Parler-TTS-large-v1
Parler-TTS-mini-v1
PromptTTS++
UniAudio

Emotion: Happy

Context Transcript Model Extremely Very - Slightly
Normal I'm looking at the forecast, and it says there's a 50% chance of rain this afternoon. gpt-4o-mini-tts
Parler-TTS-large-v1
Parler-TTS-mini-v1
PromptTTS++
UniAudio

Emotion: Sad

Context Transcript Model Extremely Very - Slightly
Lover We decide on a meeting location with coffee and a reserved table tonight. gpt-4o-mini-tts
Parler-TTS-large-v1
Parler-TTS-mini-v1
PromptTTS++
UniAudio

Emotion: Surprised

Context Transcript Model Extremely Very - Slightly
Friends We gather near the park to discuss plans for the upcoming weekend. gpt-4o-mini-tts
Parler-TTS-large-v1
Parler-TTS-mini-v1
PromptTTS++
UniAudio

Task II. Emotion–Intensity Adjective (Emo-I.A.)

Models are instructed to generate speech using adjectives that represent different levels of emotional intensity.

Angry Emotion

Context Transcript Model Outraged Angry Irritated Frustrated Upset
Customer I provide directions to the service desk where products are available for purchase. gpt-4o-mini-tts
Parler-TTS-large-v1
Parler-TTS-mini-v1
PromptTTS++
UniAudio

Happy Emotion

Context Transcript Model Ecstatic Overjoyed Happy Content Satisfied
Family I arrange the dining setup with chairs, plates, cups, and napkins for dinner. gpt-4o-mini-tts
Parler-TTS-large-v1
Parler-TTS-mini-v1
PromptTTS++
UniAudio

Sad Emotion

Context Transcript Model Heartbroken Sad Unhappy Disappointed Gloomy
Friends We gather near the park to discuss plans for the upcoming weekend. gpt-4o-mini-tts
Parler-TTS-large-v1
Parler-TTS-mini-v1
PromptTTS++
UniAudio

Surprise Emotion

Context Transcript Model Surprised Stunned Amazed Unexpected Intrigued
Lover We decide on a meeting location with coffee and a reserved table tonight. gpt-4o-mini-tts
Parler-TTS-large-v1
Parler-TTS-mini-v1
PromptTTS++
UniAudio

Evaluation Framework

We carefully design a framework to quantify the alignment between instruction and perception. We define 4 key control dimensions, collect a new dataset (E-VOC) for analysis, and run large-scale human evaluations.

Control Dimensions

Our evaluation covers 4 expressive dimensions:
  • Adverbs of Degree (Adv. Deg.): Can models interpret modifiers like "very" or "slightly" (e.g., "speak very happily")?
  • Emotion-Intensity Adjective (Emo-I.A.): Can models distinguish between graded emotions (e.g., "content" vs. "happy" vs. "ecstatic")?
  • Speaker Age (Age): Can models generate speech that sounds like a child, teenager, adult, or elderly person?
  • Word-level Emphasis (Emphasis): Can models selectively highlight a target word in a sentence?

Results

Objective Acoustic Control

We first measure the objective acoustic properties (loudness, pitch, speaking rate) of the generated speech. Figure 1 shows that most models have good control over pitch and speaking rate when guided by adverbs of degree, but consistently struggle to modulate loudness based on instructions like "slightly quiet" or "extremely loud".

Figure 1 from the paper showing acoustic control results

Fig 1. Loudness (LUFS), pitch (Hz), and speaking rate (words/s) across ITTS models for Task I. Adverbs of Degree.


Perceptual Emotion Intensity

Human evaluations reveal the gap between instructions and perception. Figure 2 shows the perceived emotion intensity for different models across four emotions. While some models follow the general trend (e.g., "ecstatic" is perceived as more intense than "happy"), there are significant discrepancies and overlaps, showing that fine-grained emotional control remains a major challenge.

Figure 2 from the paper showing perceptual emotion intensity

Fig 2. Averaged perceptual emotion intensity of ITTS models across 4 emotions, analyzed by Adverbs of Degree and Emotion-Intensity Adjectives.

Appendix

Prompt for Transcription Generation

Task: Generate 10-15 word sentences as "Text prompts," describing life conditions in specific contexts without using inherently polar or sentimental words. The generated sentences are naturally spoken in interaction, for evaluating how well state-of-the-art text-to-speech (TTS) models synthesize emotion.

Steps:
1) Select Context: family, friends, customer, lover, or teacher–student.
2) Sentence Construction: create a 10–15 word sentence describing the context.
3) Polarity Check: exclude inherently polar or sentimental words.
4) Repetition: generate sentences across various contexts for diversity.

Output Format: List the interaction context followed by the 10–15 word sentence, neutrally described.

Examples:
- Friends: I plan to buy plates, forks, knives, and glasses arranged on the table for the meal. Would you want to come?
- Traveling: Schedule of our trip includes flight departure, hotel check-in procedure, museum visit, and city tour.

Notes: Sentences must remain descriptive and contextually relevant, with neutral language. The prompt design ensures that TTS evaluation focuses on emotional style alignment.

Word Frequency and Emotion Intensity

Happy Intensity WF WIKI Sad Intensity WF WIKI Angry Intensity WF WIKI Surprised Intensity WF WIKI
Ecstatic0.9542,979 Heartbroken0.9692,254 Outaged0.9646,784 Surprised0.93051,083
Overjoyed0.9091,921 Sad0.8646,819 Angry0.82434,184 Stunned0.8206,254
Happy0.78880,205 Unhappy0.75016,934 Irritated0.7062,860 Amazed0.7811,255
Content0.688182,702 Disappointed0.63619,109 Frustrated0.63617,278 Unexpected0.71124,728
Satisfied0.50022,700 Gloomy0.5782,672 Upset0.43939,299 Intrigued0.4304,679

Transcription Examples

Context Transcription
FamilyYou always make breakfast on Sundays.
FriendsLet's explore downtown tonight without plans.
CustomerYour order is ready for pickup.
LoverI adore every moment with you.
TeacherSubmit your project before class tomorrow.
SiblingYou might borrow my car later.
ColleaguesOur meeting starts at nine sharp.
NeighborPlease return my gardening tools soon.

Adjective and Adverb Mappings

Task Adjective (Adj.) / Level
PitchLowHigh
LoudnessQuietLoud
Speed RateSlowFast
AgeChildTeenageAdultElderly
Level1234
Adv. Deg. Slightly Very Extremely
Emotion Level12345
Happy-I.A.SatisfiedContentHappyOverjoyedEcstatic
Sad-I.A.GloomyDisappointedUnhappySadHeartbroken
Angry-I.A.UpsetFrustratedIrritatedAngryOutraged
Surprised-I.A.IntriguedUnexpectedAmazedStunnedSurprised

Prompt Template Examples

Task Template 1 Example Template 2
Pitch Speak in a/an "Adv. Deg." "Adj." tone. Speak in a Very High tone. Voice: "Adv. Deg." "Adj."
Loudness Speak in a Slightly Quiet tone. Tone: "Adv. Deg." "Adj."
Speaking Rate Speak in a Very Fast tone. Pacing: "Adv. Deg." "Adj."
Emotion Speak in a Happy tone. Emotion: "Adv. Deg." "Adj."
Emotion-I.A. Speak in a/an "Adj." tone. Speak in an Ecstatic tone. Emotion: "Adj."
Emphasis Articulate clearly, placing special stress on the term "word". Pronunciation: Clear and precise, empathize on keyword "word".
Age Use a/an "age group"'s voice. Use a/an Child's voice. Delivery: A classic "age group" tone.

BibTeX

@inproceedings{Lin_2026,
  title     = {Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-to-Speech Systems},
  author    = {Lin, Yi-Cheng and Chou, Huang-Cheng and Wei, Tzu-Chieh and Chen, Kuan-Yu and Lee, Hung-yi},
  booktitle = {Submission to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year      = {2026}
}