Multimodal AI is changing how people ask for help. A user may point at a screenshot, speak a request, upload a document or ask the app to explain a chart.
The best interface combines modes without forcing them. Voice is great for intent, images are great for reference, text is great for precision, and structured app context keeps the answer grounded.
Designers should think in handoffs: voice to action draft, screenshot to diagnosis, document to summary, summary to workflow. The interaction should end with something useful happening.
