Today, Modulate has launched ToxMod, a new service that uses AI to scan voice chat in video games for toxic speech or other bad behavior. It flags everything from racist speech to predatory behavior, taking into context the way in which people say words to tell game developers what needs their attention.
Modulate says the service is the world’s first voice-native moderation service, enabling companies to monitor toxic speech in real time and giving game companies a chance to detect the hateful words in a timely manner. It complements other voice-related technologies at the Cambridge, Massachusetts-based company, which is using machine learning techniques to create customizable “voice skins” for games. These let players modify their voice so they can have funny voices or disguise their identities.
ToxMod helps developers detect toxic, disruptive, or otherwise problematic speech in real time and automatically take nuanced actions, like blocking individual words, such as racial slurs, or identifying information, such as a phone number. Of course, this is gaming we’re talking about, and many games have rough talk as the ante for multiplayer play. ToxMod uses sophisticated machine learning models to understand not just what each player is saying but how they are saying it — including their emotion, volume, prosody, and more. In short, the company says, it knows the difference between “f*** off” and “f*** yeah!”
The goal is to help eliminate toxic community members on a large scale so that developers can keep up with the numbers of offenders and create real change in gaming communities.
Modulate raised $2 million from 2Enable Partners and Hyperplane Venture Capital, and it raised another $4 million earlier this year.
“Modulate’s mission statement is making voice chat more inclusive and immersive for online socialization,” Modulate CEO Mike Pappas said in an interview with GamesBeat. “At the core of what we’ve done is use machine learning techniques to process audio, whether for changing the player experience through voice skins or through better analyzing what’s actually happening within the game.”
Roots in machine learning
Modulate cofounders Carter Huffman (now chief technology officer) and Pappas met each other in college at MIT when Pappas stopped to help solve a physics problem that Huffman was pondering on a hallway chalkboard. Huffman went on to polish his chops in machine learning for spacecraft at the Jet Propulsion Laboratory, and he became interested in generative adversarial networks, a neural network technology that would later become useful in converting human voices. Huffman conceived of Modulate in 2015 and incorporated it in the fall of 2017. Pappas also joined as founder; Terry Chen, vice president of audio, also helped get the company off the ground.
In 2019, Modulate introduced the concept of “voice skins” to the world. The VoiceWear service enables players to take on their chosen character’s authentic voice, transcending old-school voice changers. Of all the feedback they received about voice skins, one comment intrigued them the most. Many players from all demographics reported that voice skins were the one thing that allowed them to participate in voice chat at all. In speaking with these players, Modulate realized that many players simply don’t feel comfortable putting their real voice out there given the unfortunate toxicity and harassment that’s all too prevalent in these communities. And it was clear this wasn’t merely anecdotal — studies show that 48% of all in-game toxicity now takes place through voice. Given the increasing importance of voice chat for socializing and coordinating in-game, this was an obviously crucial problem.
The voice skins have gotten a lot of traction, and the company found that a lot of neural network tech behind voice chat could be used for other purposes as well.
“We started building out not only voice skins but seeing if we could moderate voice chat directly when people are being toxic in voice chat,” Huffman said. “We want to help community managers and moderators take action based on that voice chat proactively, and that’s where this new ToxMod product came from.”
By feeding voice signals into a moderation tool, ToxMod could assess voice chats for toxicity as they happen, with much greater accuracy than any other tools out there. The key here is the capability to analyze not just what is being said but also how it’s being said, including the emotion, prosody, and volume it’s spoken with.
ToxMod keeps an eye out for the bad actors to ensure that nobody is damaging others’ experiences. ToxMod can do all of this directly on each player’s device, in real time, unlocking two unique capabilities. The first is that ToxMod can react in real time to offensive speech — not just shutting down whole conversations, but also taking more nuanced actions like blocking racial slurs or personal information.
The second is that ToxMod can preserve player privacy better than other voice moderation tools, the company said. Since it is processing the data on the device, the only reason it would send any data where anyone else can hear it would be if it detects a high probability of toxicity. Even then, the first stop for that data would be Modulate’s secure servers, which run even more sophisticated algorithms to validate the suspicion of toxicity. Only when there is a strong sense that something problematic is occurring will any of the audio be shared with a human moderation team. Because this chain of command is necessary to ensure accuracy, the detection and moderation can’t be entirely automated.
“There’s a trade-off between latency and accuracy here,” Huffman said. “That’s one of the big problems that we’re solving. And we’re pouring a lot of research and our machine learning chops into where, on the one hand, you have to be fast enough to be running in real time and be accurate enough to detect problems with no errors. We already have kind of a lot of expertise. But when you’re starting to detect these swear words or these racial slurs or this personal contact information, if you jump too early, you’re going to get a bunch of false positives and bleep out things that you shouldn’t.”
Still, the automation of the detection will help community teams enormously, as those groups can be inundated with work, especially if they have to find a way to transcribe a questionable gaming session.
Some teams will want to modify the threshold for toxicity. If you’re playing an adult-focused game like Call of Duty, you’ll hear a lot of swear words. But Modulate will be able to parse whether those swear words translate to serious threats or not, Huffman said. That’s where an individual player’s record is important. If the player has a history of toxicity, then the community manager can act more quickly to ban that player.
“If you hear the emotion of the speaker and everyone is having a good time, then you can predict there is a lower probability that this is actually going to be problematic,” Huffman said. “But if the speaker is sounding very loud and aggravated, and otherwise problematic, then it could be a toxic situation. And the moderator would want to jump on that.”
Modulate has been testing ToxMod for a while with its community and its own team. The company is talking to a number of big studios about using technology for tasks like protecting kids from predators. Conceivably, a platform such as YouTube could use this to screen videos as they’re being uploaded to its service, just as it can screen for copyrighted music before a post goes up.
“All of these developers deeply understand how important it is to solve toxicity and voice chat,” Pappas said. “And so as soon as we came to them with this, the response was really overwhelmingly exciting, and we’ve seen extremely rapid movement from all of these studios. We’re very interested in the livestreaming applications from this as well.”
ToxMod might be able to help other AI startups as well. Alithea AI is using Open AI to create animated avatars that can hold conversations with people. But to guard against any abuses of that system, Alithea AI would have to monitor what the avatars are used for, and that means monitoring their speech. With lots of avatars created, automating that processing of monitoring hate speech would be necessary.
ToxMod can use some of the same data that the voice skins use in modifying speech in real time, and this enables Modulate to detect hate speech as it happens. But because human moderators have to get involved before the action happens, game developers will still lag behind in intercepting the bad speech and toxic players. The challenge is that Modulate has to keep up with gamers who change their words so they can avoid being snagged by keyword detectors while still delivering a toxic message.
The tool could also help call center employees deal with toxic callers, Huffman said. Modulate is part of the Fair Play Alliance, a consortium of game companies that want to solve problems such as toxic speech. “Many of the studios we are working with are members of the Fair Play Alliance as well,” Pappas said.
Overall, Modulate wants to create a single platform that can solve everything related to making voice chat better, from the voice skins to ToxMod. “We want to make people comfortable using voice chat,” Pappas said. “There are a lot of people, whether it’s because they’re worried they might get harassed, or they simply don’t like the sound of their voice, who don’t use voice chat today. Studios are interested in unlocking voice chat for more people.”
GamesBeat Gift Guides: