Tuesday, February 18, 2025

 Infrastructure engineering for AI projects often deal with text based inputs for analysis and predictions whether they are sourced from chatbots facing the customers, service ticket notes, or a variety of data stores and warehouses but the ability to convert text into audio format is also helpful for many scenarios, including DEI requirements and is quite easy to setup and offers the convenience to listen when screens are small, inadequate for usability and are difficult to read. Well-known audio formats can be played on any device and not just phones.

Although there are many dedicated text-to-speech software and online services available, some for free, and with programmability via web requests, there are built-in features of the public cloud service portfolio that make such capabilities more mainstream and at par with the rest of the AI/ML pipelines. This article includes one such implementation towards the end but first an introduction to this feature and capabilities as commercially available.

Most audio is characterized by amplitude as measured in decibels preferably at least 24db, the stream or bit rate which should preferably be at least 192kbps, and a limiter. Different voices can be generated with variations in amplitude, pitch, and tempo much like how singing is measured. Free text-to-speech software, whether standalone, or online gives conveniences in the input and output formats. For example, Natural Reader allows you to load documents and convert them to audio files. Balabolka can save narrations as a variety of audio file formats with customizations for pronunciations and voice settings. Panopreter Basic, also free software, add both input formats and mp3 output format. TTSMaker supports 100+ languages and 600+ AI voices for commercial purposes. Murf AI although not entirely free has a converter that supports 200+ realistic AI voices and 20+ languages. Licensing and distribution uses varies with each software maker.

Public-cloud based capabilities for text-to-speech can be instantiated with a resource initialization from the corresponding service in their service portfolio. The following explains just how to do that.

Sample implementation:

1. Get text input over a web api:

from flask import Flask, request, jsonify, send_file

import os

import azure.cognitiveservices.speech as speechsdk

app = Flask(__name__)

# Azure Speech Service configuration

SPEECH_KEY = "<your-speech-api-key>"

SERVICE_REGION = "<your-region>"

speech_config = speechsdk.SpeechConfig(subscription=SPEECH_KEY, region=SERVICE_REGION)

speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Audio16Khz32KBitRateMonoMp3)

speech_config.speech_synthesis_voice_name = "en-US-GuyNeural" # Set desired voice

@app.route('/text-to-speech', methods=['POST'])

def text_to_speech():

    try:

        # Check if text is provided directly or via file

        if 'text' in request.form:

            text = request.form['text']

        elif 'file' in request.files:

            file = request.files['file']

            text = file.read().decode('utf-8')

        else:

            return jsonify({"error": "No text or file provided"}), 400

        # Generate speech from text

        audio_filename = "output.mp3"

        file_config = speechsdk.audio.AudioOutputConfig(filename=file_name)

        synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=file_config)

        result = synthesizer.speak_text_async(text).get()

        if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:

            # Save audio to file

            with open(audio_filename, "wb") as audio_file:

                audio_file.write(result.audio_data)

            return send_file(audio_filename, as_attachment=True)

        else:

            return jsonify({"error": f"Speech synthesis failed: {result.reason}"}), 500

    except Exception as e:

        return jsonify({"error": str(e)}), 500

if __name__ == "__main__":

    app.run(host="0.0.0.0", port=5000)

2. Prerequisites to run the script:

a. pip install flask azure-cognitiveservices-speech

b. Create an Azure Speech resource in the Azure portal and retrieve the SPEECH_KEY and SERVICE_REGION from the resources’ keys and endpoint section and use them in place of `<your-speech-api-key>` and `<your-region>` above

c. Save the script and run it in any host as `python app.py`

3. Sample trial

a. With curl request as `curl -X POST -F "text=Hello, this is a test." http://127.0.0.1:5000/text-to-speech --output output.mp3`

b. Or as file attachment with `curl -X POST -F "file=@example.txt" http://127.0.0.1:5000/text-to-speech --output output.mp3`

c. The mp3 audio file generated can be played.

Sample output: https://b67.s3.us-east-1.amazonaws.com/output.mp3

Pricing: Perhaps the single most sought-after feature on text-to-speech is the use of natural sounding voice and service providers often markup the price or even eliminate programmability options for the range of the natural-voices offered. This severely limits the automation of audio books. A comparison of costs might also illustrate the differences between the service providers. Public Cloud text-to-speech services typically charge $4 and $16 per million characters for standard and neural voices respectively which is about 4-5 audio books. Custom voices require about $30 per million characters while dedicated providers such as Natural Voice with more readily available portfolio of voices charge about $60/month as a subscription fee and limits on words. This is still costly but automation of audio production for books is here to stay simply because of the time and effort saved.



No comments:

Post a Comment