Tech in EdTech

Transforming Live Event Accessibility with AI

Magic EdTech Season 1 Episode 53

 In this episode, Chris Zhang, Senior Solutions Architect at AWS Elemental joins Rishiraj Gera in a conversation about multilanguage automatic captions and audio dubbing for live events. Chris discusses his career and current role, focusing on making live events more accessible using AI and automatic speech recognition (ASR) technologies. The conversation covers the technical aspects of embedding captions and the broader implications for EdTech, emphasizing inclusivity and improved user experience. Chris also advises educators to leverage modern AI tools to reduce costs and logistical challenges, ultimately, making content more accessible to a global audience.



00:01.56

Rishi

Welcome everyone to the Tech in EdTech podcast. It's a pleasure to welcome you once again to our newest episode. Today we delve into a thought-provoking discussion - Breaking Down Multilanguage Automatic Captions and Audio Dubbing for Live Events. I'm your host Rishi from Magic EdTech and our guest for today's podcast is Chris Zhang, Senior Solutions Architect from AWS Edge Elemental. Chris is an evangelist with almost three decades of experience and his efforts are pivotal towards helping improve the live event language caption experience. Chris, welcome on board.


00:40.11

Chris

Thank you, Rishi. Thank you for inviting me to this podcast. Really appreciate it.


00:45.48

Rishi

Chris, why don't we start today's session with you telling us about yourself and your journey in your career so far?


00:51.89

Chris

Yeah, absolutely yeah. My name is Chris, Chris Zhang. I started at networking programming. So I specialized - had a lot of deep learning in networking before. And then I joined into media technologies starting from a smaller startup. And later I fortunately joined the Elemental Technologies which a couple of years ago, got acquired by AWS. So I'm a Senior Solution Architect in AWS focusing on media services. So here most of my job is helping customers architecting a variety of streaming workloads on AWS. So from the live event perspective, there's a you know, live event for social media, there's a customer doing town hall meetings and there's a lot of public events and live training, education sports, you name it. So essentially, that's my current job and I'm very passionate about to bring live events to your audiences. Recently, I work a lot on tackling the problem - how to bring live accessibility for live streams. The reason being for live streaming - when people talk about bringing your live accessibility to your stream - that's a lot of heavy lifting, and that’s not a very easy solution available in the industry. Especially when you talk about how you bring multiple language captions to the customer. So that's basically my background and let's dive into it.


02:31.56

Rishi

Thanks! Thanks, Chris! This is really an interesting journey and I think you have come a long way, you know? Basically, you know, the live event language along with accessibility that talks about a great career. you have so far. So Chris can you talk about the applications of multi-language automatic captions and audio dubbing in edtech?


02:54.58

Chris

Totally. So as the people might already know that in education a lot and a lot more people are doing, providing live streaming like in the classroom fashion. And also because, from the pandemic, people are accustomed to like remote learning. Regardless, of whether you may be in the hybrid setup. But when it comes to learning in edtech, a lot of people coming from different backgrounds, especially you're coming to - for example in America a lot of universities still and they are coming from all over the world, right? Some people coming from Korea, some people come from Japan, some are from Asia or maybe Europe. But their native language might not be English. So normally English is their second language, but when they learn in the class, all the teachers are actually in general speaking in English. But how do we help the student to have an immersive experience in the classroom fashion? I have seen students sometimes, they put up some application to be able to automatically transcribe whatever the teachers are saying to help them to get more good experience or understand some of the words more clearly. But I think what if we can bring a live streaming into the classroom with multiple language captioning enabled? So the students do not have to spend extra time and do the heavy lifting or try to find all the different tutoring to help them better understand the content of the classroom teaching. If we can make the live captions in different languages available right in their choice at their fingertips - so I think that going to help them - the students - providing a much better experience. And also you know they don't have to focus on the technology part but rather focus on the content being teached or whatever they were trying to learn.


05:01.47

Rishi

No, this is wonderful and I believe you know these live captions adding to the diversity and the belongingness is definitely a world to look for. So let's do a deep dive into understanding the challenges today around embedding captions. Where do you think the world is going and what are typical challenges around caption embedding?


05:25.81

Chris

So in a live event scenario - so those challenges are different from comparing to the VOD. So today we're going to be focusing on live events or live streaming. So in live streaming the typical traditional way of providing multi-language captions - those are you going to hire a stenographer or people who are - we call it you know captionists, right? Those are humans who are actually fluent in listening to English. For example, if the main content are speaking in English and then they listen to your voice and they translate it into different languages. So for a live event if you have, let's say six different language outputs like Spanish, German or you know Japanese or Korean; you need to hire multiple captionists to actively listen and translate and type into those different languages. And also in today’s scenario, the captionist how they work - they need to take breaks especially if you have a longer event like multiple 4 hours, or maybe a whole day event; you're going to have to hire multiple captionists to be able to take turns in order to get some break. So those have typically added a lot of complexity. Not only from the streaming setup but also logistically and from the cost perspective. So it's very very challenging for all customers whether it's the edtech or a public space or even enterprise right, to provide multi-language live caption solutions today. But if you look at the technology available today, we have Generative AI, we have a lot of automatic audio-based speech recognition engines available. It's very mature. But how are we leveraging those ASR engines to automatically transcribe your voice into let's say in English and we can translate the English into different languages? These are the current technologies available to us but the core difficult part when we will apply to live streaming is that if you look at the end-to-end pipeline for live events from the camera, to the transcoding engine to the CDN distribution and all the way to the wheelers. This particular process is pretty easy to set up in the cloud today. But when you try to merge into the caption workflow, into the existing live streaming workflow, that's where the difficulty comes in and that's where the industry actually does not have a proper solution. So that's where you know preventing people from delivering such a solution today.


08:22.75

Rishi

That's wonderful! And can you help me relate to these challenges with respect to how you are seeing them at AWS and how you are seeing them impacting a classroom? Like one way you can talk of - how will embedding captions in a live stream in a classroom environment - how will it revolutionize education? Can you talk of it?


08:42.79

Chris

Yeah, totally so let me first talk about maybe, give a little bit the background in the live caption world right? So traditionally in America when you look at the captions the right way most people are doing today - they hire a captioner, embedding those captions into the live streaming itself before the hit the transcoding process. And those are normally hardware-based solutions and you need hardware premises - like where your event is happening. They call it caption encoder or something like that. And then you hit the event - once you have the caption embedding into your video, you'll do the transcoding and then you'll do the delivery. But one of the limitations for that is that those captioning encoders today and the protocols they were using are called 608 or 708. These are only supporting 7 languages which are Latin-American-based languages. So by that what do you expect? Which means they support English, they support Spanish, but when you talk about Korean, Japanese, or Russian; they cannot support that. That protocol of support is not available in there. So when you look at the streaming from the caption generation and also from the distribution, I would like to consider the problem into 2 separate phases. The first phase is signal generation and a contribution to the cloud. And the second part once you have your signal in there, how do you distribute it to your viewers using the current streaming technology? So what I think the solution will be - we will try to help customers to leverage the cloud-based ASR engines and a Generative AI-based technology to remove the heavy lifting they have to deal with for on-premise-based workflows. By doing that I think we will enable the industry to take advantage of the modern technologies and be able to achieve their goals for delivering accessible content in there. So in the solution what I'm imagining is that we do a late binding solution for your captions. So the captions at the end of the day it's just texts right, in different languages. And the 608, 708 protocol is to try to embed your texts inside your video delivery stream. There are 2 problems. So one problem is today that protocol is limited. We do not have a proper protocol to carry you know the multiple languages per se. The 708 do enable that particular capabilities. But if you're an industry expert, you will know that end-to-end support for 708 is not there. It's going to be another heavy lifting to be able to support it. So what are the efficient ways to make it happen relatively easily without a lot of heavy lifting? So the second part will be - since it's a text-based protocol and if we look at how we're delivering the live event today over the internet. Especially if you’re, you know, watching a live show on Youtube or watching a live show - sports or a live event - most likely those are delivered using HTTP live streaming protocol. And the protocol itself supports VTT sidecar files like a caption. So easily we can think about a solution - that how can we leverage the ASR engine to generating all the captions in different languages? And how we can bind those captions synchronized with your video and presented in the VTT format. So that by default with this particular architecture of design, your live stream will be CDN agnostic because other CDN is going to support it by default and it's going to be viewer player agnostic because other players are most likely to ubiquitous support for streaming. So specifically edtech - I think depending on the edtech scenario, whether it's classroom or it's a hybrid session set up in the education tech; I think by enabling this technology within your live streaming, edtech can actually leverage this to implement the solution in the cloud. So essentially I think today most edtech may be enabling the closed captions. You know, using the On-premise hardware like caption encoders. But with this technology moving forward, they just need to send a stream to the Cloud and leverage the cloud to be able to generate the multi-language captions. And then they can provide that live caption either you know, to the student in the classroom or maybe in remote locations with the accessibility feature turnaround. I think that’s going to be revolutionary for how edtech can enable their viewer experience.


14:20.74

Rishi

Chris, your insight into the 608 and 708 for embedding captions is definitely intriguing. I would say you are really an expert into this area. Can you talk about how AI tools integrating into a live video caption will help the learning environment?


14:39.43

Chris

Oh absolutely. So accessibility I think it's gonna be you know a lot of strategic discussions today for every enterprise and every segment of the industry right? You want to reach more people, you want to reach more audiences. Coming back to the educational segment I think the way we enable that is that we - these multiple language captions - this is the first step. But the first step will actually help the viewers or students to be able to more effectively learn and also your content will actually enable you to reach the global audiences. For example, if you have English-speaking content right? Of course that is available to you. If you only can speak English, you cannot reach you know folks in Japan for example, who might be interested in your content. How do you reach them? So by providing multiple language captions. That's one way. In further, we also talk about audio dubbing right? In today’s world, we need to provide the experience to your viewers rather than feeding them information. From the experience perspective, as in customers or viewers like me, even to myself, I sometimes prefer to watch your course material in my native language or I might want to watch the course in English but I would like to turn Chinese captions on, so I can pick up the interesting words or some words I'm not very familiar with, so I can get better experience and learn better from the content I wanted to learn. So I think with captions and audio dubbing, you will be able to increase your audiences all over the world globally and also providing a better user experience and give a user a choice either watching the captions or select the proper audio dubbing feature available to you.


17:02.81

Rishi

How can educators be trained to effectively use these AI tools in their teaching practices?


17:11.39

Chris

That's a good question. Actually, the training part I think because the solution really caught it late binding and the accessibility feature is built into the players, most of the video players. If you watch a video player - like you watch Netflix or Amazon Prime or even any of the by-player, there's a CC option in there if you move your mouse over the video player area. You'll normally see a toolbar show up. In the toolbar you can turn on captions, you might see a CC icon in there, you click in there, you will be able to select different languages or turn off your CC or you know turn it on. So within those, whatever language you provided to your viewers, those options are going to show up. So the customer already knows how to select those.


18:09.47

Rishi

And are there any ethical considerations to keep in mind when these teachers or institutions are implementing AI-driven language tools in the educational setting?


18:20.24

Chris

I think it's more like for the teachers, I think it's gonna be transparent. Because all the technology we were trying to integrate is actually on the backend right? For live streaming, we are leveraging AI and the ASR engine to automatically transcribe the teacher's voice. I think today we might, there may be a little bit of concern - not concern but if the teacher - if for example, I'm teaching in English and maybe there's some word I'm not speaking correctly in native English. So those words might be mistaken by the ASR engine. So I think those are going to be fine-tuned but those will be continuously improved by the different ASR engines. So I think eventually the adoption - as the technology matures, you should be getting much better in terms of picking up the voices and doing a better job in there.


19:23.14

Rishi

So basically based on your extensive experience Chris, what advice you would give to edtech companies looking to innovate with AI-driven language tools?


19:36.59

Chris

So my advice will be always being open-minded because you know a lot of people when we talk about captions we go directly into the you know, how are we doing the caption today? Like we were doing 608, we were doing caption encoders. But normally when I look at this particular problem I try to say you know, what is the outcome I'm trying to drive? What is a customer experience I want to try to deliver to the customers? I do not start from what's available from a technology perspective, but rather I would say what do you want to do for your customers? What are the features that make your product to improve your customer experience and can make an impact? And then be working backward from there saying, okay yeah, those are identified features or functions we need to deliver. Then we look at how we're going to deliver it using the existing technologies. What are available? What are not available? And what are the integration points in there? So before if you look at the live caption I think why did you adopt it? The reason being that you know crafting a proper system is so hard and it's difficult and it's not scalable. But once you are able to work in backward from the outcome from the customer perspective you will have a better understanding of where you should be focusing on and solving those problems rather than leveraging the older, I mean leveraging the existing technology to how do I say that there's a lot of baggage in there, I think in the design process that we need to break.


21:37.80

Rishi

Yeah, so basically the way you can cite is that given the evolution and technological advancement right, to the precision and outcome. So humans are influencing the live captioning a lot right?


21:52.77

Chris

Totally right.


21:55.90

Rishi

Yeah, so basically you know on the personal insights and advice to the industry particularly about the faculty and institutions at large, would you recommend them to embrace leveraging live captioning AI-based? Is this something an area to watch and see more?


22:16.80

Chris

Yeah, that's absolutely right. With this particular protocol and the availability of the solution, I think this solution moving forward is going to get a lot easier for the customer and also much cheaper. But one of the reasons why the live caption or for multi-language caption are not available for live events, because it's the logistics and the complexity of how to set it up. And there's a lot of costs associated and you know when you try to hire a lot of stenographers or hire people to do the job for your live event. Not only the scheduling of your stenographer but also the complexity to set up the system to deliver your event, it's so freaking difficult. But with this late binding technology with the current proposal where I see it moving forward, we were able to remove all the heavy lifting from the customer to just the you know, enable the customer to provide this particular experience, but also the systems - IT technician or professional who set up or is manage this particular pack in the process - a lot easier. Probably you are already familiar with how to set up a live stream using AWS services or our partner solutions. Those are like in one single click you can pull up your video and you'll get your delivery in there. We want to make this backend experience to deliver multiple language captions also as easy as possible. Just like one single click, and you'll select the languages you want to deliver to your customers and be done with it. So as the technology matures and also because we were leveraging more and more AI-based technology in the Cloud, I can see the cost also significantly bring down. So with that, I definitely encourage people to start looking at the live captioning features and the solutions in there and be able to help enable your content to global audiences. Another part of it will be - I hear from a lot of customers - they really really want to deliver their content to their audiences. Not only from a global reach perspective but also you need to think about the inclusion, right? What if people are really in need of accessibility for your content? For example, for people who are visually impaired, so they are really really going to need an enabled audio dubbing feature. So you're going to have to provide your content with audio dubbing so I can hear whatever the content is being teached, because I cannot see very clearly and how do you enable that? For example, if you have a Korean or Japanese student who wants to learn your content. So how good is your experience going to be if you can listen to the lecturer speak in Japanese for your course? That will be an awesome experience. I think the need is not only coming from the enterprise's strategic perspective but also driven by some regulations. So sometimes your content needs to be delivered in such a way. So I think both are actually formalizing the strategy and the direction moving forward. So this, the technology here we were developing is definitely going to be helping the customers to remove the heavy lifting, reduce costs, and deliver the best possible accessible product to your customers.


26:10.54

Rishi

Thanks, Chris for your parting thoughts and advice for our leaders in the edtech and education space. I think we would love to hear more from you as you delve into you know this live captioning area. I believe you should definitely share your experience in the educational setting with our audience. I think it's a message to our growing community that if you are focused on improving accessibility in education, do follow us on Linkedin and Twitter and at the same time do connect to Chris Zhang for his wonderful views. Thanks! Thanks, Chris, appreciate your time today.


26:43.44

Chris

Thank you Rishi for having me. Glad to be able to share the solution here and then I'm hoping we can provide more help in achieving the outcome for your customers. Thank you.


26:52.50

Rishi

Thank you, thank you, Chris.