Microsoft has announced a new set of AI models, including MAI-Transcribe-1, MAI-Voice-1 and MAI-Image-2, aimed at improving speech, voice and image generation capabilities. According to the company, these models are now available through Microsoft Foundry and the MAI Playground (US-exclusive), with a focus on faster performance, efficiency and competitive pricing. The rollout brings upgrades across transcription accuracy, voice generation and image creation, with Microsoft also integrating these capabilities into its own products.


Transcription, voice and image models


MAI-Transcribe-1: Microsoft said that this is designed for speech-to-text tasks and supports transcription across the top 25 most-used languages, based on the FLEURS benchmark. The company said that the model is built to handle real-world audio conditions and delivers batch transcription speeds that are 2.5 times faster than its existing Azure Fast offering.

 
 


MAI-Voice-1: This model focuses on voice generation, producing speech with natural tone, emotional range and consistency across longer content. Microsoft has also added support for creating custom voices using a short audio sample. The model can generate up to 60 seconds of audio in one second, with the company highlighting efficient GPU usage for cost-effective performance.


MAI-Image-2: As per Microsoft, it offers at least twice the generation speed compared to earlier systems on Foundry and Copilot, based on production data. Microsoft said the model is designed to deliver realistic lighting, accurate skin tones and clear text rendering for visual content. It is also being rolled out in phases across services such as Bing and PowerPoint.


Availability and pricing


Microsoft said all three models are available starting today on Microsoft Foundry, with MAI Playground access currently limited to users in the US. The company has positioned the models as offering competitive price-to-performance across cloud providers.

 


Pricing starts at $0.36 per hour for MAI-Transcribe-1, $22 per one million characters for MAI-Voice-1, and $5 per one million tokens for text input and $33 per one million tokens for image output with MAI-Image-2.


Microsoft added that these models are also being used within its own products and are available for developers to build applications and services.

 



Source link

YouTube
Instagram
WhatsApp