Real time specification

Scribe provides a websocket-based API that allows you to stream chunks of audio in real time, at a granularity of down to 0.25 seconds. Scribe will then stream back transcribed audio within approximately 15 to 30 seconds. The real time API is accessible at:

wss://scribe.kensho.com/ws

Workflow

After connecting to the Real Time API, you should issue an Authenticate request and wait for a Authenticated response from the server before sending transcription requests. To start transcribing audio, send a StartTranscription request, and wait for a TranscriptionStarted response from the server before sending audio. you can then upload chunks of audio between 250 milliseconds and 15 seconds using the AddData method. Uploads must not exceed 1.5x real time, with a 30 second buffer. When audio is uploaded, the server will acknowledge it by sending a DataAdded response. When transcribed text is available, the server will send an AddTranscript message. After finishing uploading audio, the client must send an EndOfStream message, after which the server will transcribe all remaining audio and issue an EndOfTranscript message.

Client initiated messages

Authenticate

Authenticate the websocket connection. This must be the first method called after opening the websocket.

{
    "message": "Authenticate",
    "token": str, # The access token
}

StartTranscription

Start a transcription request. This must be the first method called after authentication.

{
    "message": "StartTranscription",
    "audio_format": {
        "type": str,  # Only 'RAW' supported
        "encoding": str,  # Only 'pcm_s16le' supported
        "sample_rate_hz": int,  # Only 16000 supported
        "num_channels": int,  # Only 1 supported
    },
    "hotwords": List[str],  # An optional list of up to 1024 words to weight higher on transcription
    },
}

ResumeTranscription

Resumes a transcription request if the connection is dropped before being complete. When attempting to resume a transcription this must be first and only message before AddData

{
    "message": "ResumeTranscription",
    "request_id": str, # The request id returned from the previous TranscriptionStarted message
    "token": str, # The authentication token
}

AddData

Add audio to the server

{
    "message": "AddData",
    "audio": str,  # base64 encoded string representing audio
    "sequence_number": int  # Starting at 0, each `AddData` must increment by 1
}

EndOfStream

Called when the client has finished uploading audio data to the server

{
    "message": "EndOfStream",
    "last_sequence_number": int
}

Server initiated messages

Authenticated

Called when the server has received a Authenticate request from a client with a valid authenication token. If the request fails, the server will send an Error instead of Authenticated

{
   "message": "Authenticated",
}

TranscriptionStarted

Called when the server has received a StartTranscription request from a client. If the request fails, the server will send an Error instead of TranscriptionStarted

{
   "message": "TranscriptionStarted",
   "request_id": str
}

TranscriptionResumed

Called when the server has received a ResumeTranscription request from a client. If the request fails, the server will send an Error instead of TranscriptionResumed

{
   "message": "TranscriptionResumed",
   "request_id": str,
   "sequence_number": int # The sequence number expected with the next `AddData` message
}

DataAdded

Acknowledge data added by the client to the server.

{
    "message": "DataAdded",
    "sequence_number": int
}

AddTranscript

Send transcribed text back to the client

{
    "message": "AddTranscript",
    "transcript": "<transcript format defined below>"
}

Transcript format

{
    "transcript": str,
    "accuracy": float(0, 1),
    "sequence_number": int,
    "speaker_id": int,
    "speaker_accuracy": float,
    "token_meta": [
        {
            "transcript": str,
            "accuracy": float(0, 1),
            "start_ms": float,
            "duration_ms": float,
            "align_success": bool
        },
        ...
    ]
}

EndOfTranscript

Signal to the client that all audio has been transcribed and returned

{
   "message": "EndOfTranscript",
}

Error

{
    "message": "Error",
    "type": str,
    "reason": str
}

Error Handling

There are a number of reasons a message could raise an error: Exceeding the rate limit, or uploading invalid audio are two examples. If any message uploaded by a client triggers an error, the server will emit an Error message, with information about why the error happened. Once an Error has been emitted to a client, all subsequent requests will also emit an Error and fail.

Real Time API Example Usage

For an example usage, please see the Real Time Streaming Development Guide