Stereo
Stereo
Recording and processing conversations (such a phone calls) in stereo can significantly improve transcription accuracy and analytical insight. To realize the benefit, each speaker is recorded on a different channel (left or right), and the speaker metadata is provided to VoiceBase when uploading the recording.
Enabling stereo transcription
To enable one speaker per channel stereo transcription, add the "ingest" configuration when making a POST request to the /media resource and specify the label to use for each channel.
{
"ingest": {
"stereo": {
"left": { "speakerName": "Customer" },
"right": { "speakerName": "Agent" }
}
}
}
- `ingest` : the ingest sub-section.
- `stereo` : specification of the stereo channels. Both child sections are required.
- `left` : specification of the left channel.
- `speakerName` : the name of left channel speaker.
- `right` : specification of the right channel.
- `speakerName` : the name of right channel speaker.
Effects on Transcripts
When stereo processing is enabled, the word-by-word JSON transcript will return both a "words" array and a "turns" array, offering different options for parsing the data.
The "words" array contains nodes indicating the change of turn on the speaker. These nodes are identified by the attribute "m" set to "turn" and the attribute "w" set to the speaker's label provided in the configuration for one of the channels. The words following a turn node belong to the speaker identified by it until a new turn node appears in the transcript.
{
"transcript" : {
"confidence": 0.958,
"words": [
{
"p": 0,
"c": 1.0,
"s": 1420,
"e": 4260,
"m": "turn",
"w": "Agent"
},
{
"p": 1,
"c": 0.486,
"s": 1420,
"e": 1620,
"w": "Hi"
},
{
"p": 2,
"c": 0.917,
"s": 1630,
"e": 1790,
"w": "this"
}
]
}
}
The "turns" array contains the key "text", and its value is a block of text all pertaining to one speaker, as in the following example:
"turns": [
{
"speaker": "Agent",
"text": "Hi this is c.s.v. shipping company Brian speaking how can I help you.",
"s": 1420,
"e": 4260
}
]
The plain text version of the transcript will show each segment of the conversation prefixed with the speaker name (e.g. 'Agent:' or 'Customer')
curl https://apis.voicebase.com/v3/media/${MEDIA_ID}/transcript/text \
--header "Accept: text/plain" \
--header "Authorization: Bearer ${TOKEN}"
Agent: Hi this is c.s.v. shipping company Brian speaking how can I help you. Customer: This is Henry we spoke earlier I got a quote from you.
The SRT version of the transcript will also contain the speaker names provided in
the configuration.
curl https://apis.voicebase.com/v3/media/${MEDIA_ID}/transcript/srt \
--header "Accept: text/srt" \
--header "Authorization: Bearer ${TOKEN}"
1
00:00:02,76 --> 00:00:05,08
Agent: Well this is Taneisha
thank you for calling A.B.C.
2
00:00:05,08 --> 00:00:07,03
Cable services. How may I help you today.
3
00:00:08,28 --> 00:00:11,93
Customer: Hi I'm calling because I'm
interested in buying new cable services.
4
00:00:12,64 --> 00:00:16,43
Agent: OK great let's get started.
Effects on Keywords and Topics
When processing a file in stereo, you will get keywords and topics detected
independently on each channel. The occurrences of the keywords are grouped by
speaker.
{
"knowledge": {
"keywords": [
{
"name": "cable service",
"relevance": "0.953",
"mentions": [
{
"speakerName" : "Agent",
"occurrences" : [
{ "s" : 5090, "e": 6070, "exact": "cable service" }
]
},
{
"speakerName" : "Customer",
"occurrences" : [
{ "s" : 234550, "e": 235700, "exact": "cable product" },
{ "s" : 347567, "e": 349000, "exact": "cable services" }
]
}
]
}
]
}
}
Effects on Audio Redaction
Audio redaction of PCI will redact the audio in both channels, irrespective of which channel contains the detected PCI data.
Examples
Processing in Stereo
curl https://apis.voicebase.com/v3/media \
--form [email protected] \
--form configuration='{
"ingest": {
"stereo": {
"left" : { "speakerName": "Customer" },
"right": { "speakerName": "Agent" }
}
}
}' \
--header "Authorization: Bearer ${TOKEN}"
Updated over 3 years ago