To address the need for more life-like social gatherings during the COVID-19 pandemic lockdown(s), I developed an online social meeting platform, similar to spatial.chat, with the help of some friends along the way.
During the COVID-19 pandemic many people grew increasingly frustrated with using video chat services such as Zoom and Teams. For a long time it was the only means through which we communicated with colleagues, friends and family. However, while the existing video chat applications were effective in the context of work, I found the interactions generally lacking for more informal social events (both at work, and with friends). I noticed time and time again that online gatherings would end up with 10 to 20 people silently lingering in a video chat, while one of two people in the group discuss a topic. This happens because the chat gets very chaotic if two or more people start talking at the same time. You have to wait for your turn (in a chat with many people that could take a while), which is not very engaging. Hence, I found these events would generally not last very long.
Now let’s compare that to a ‘real’ social gathering. Many conversations are happening at the same time. Contrary to the online setting, we are able to filter out background noise and focus on one specific conversation (the Cocktail party effect). How do we do this? Colin Cherry, in 1953, showed that there are many affecting variables, including gender of the speaker, the direction from which the sound is coming, the pitch, and the rate of speech .
Some of these variables are present for traditional video chat platforms as well. We can definitely identify gender, pitch and rate of speech in a video call. That being said, the quality of the microphone, speakers, and compression algorithms will inhibit our ability ever so slightly. Other variables, such as direction and loudness are not present at all.
To address this, early on during the pandemic I envisioned a video chat system where the audio volume and direction would depend on the speakers location relative to your own avatar. I hypothesised that, by adding sufficient visual and auditory cues, I could realize the cocktail party effect in the online setting, and leverage it to improve social gatherings. In addition, by mimicking the audio inverse square law, it would be easy to form new groups or join other groups, which further eliminates passively idling in video chats.
After a year lingering in the back of my mind, I finally set out to build a WebRTC prototype. This prototype was inspired by the “proximity chat” mods for the very popular game at the time: “Among Us”, as well as a prior experience with the service spatial.chat.
Since, there have been many (new) companies that address this need, using a variety of approaches. My thoughts on some of these:
- gather.town uses 8-bit characters walking around. There is a great deal of available customisations, as well as secondary interactions like games or a shared whiteboard. I find the voice chat implemented a bit weirdly, there is no easing in: audio and video just suddenly appear on the screen. There is also no clear spatial connection between a character and its video. In the end, when five or so people join together, it looks exactly like another Teams call.
- wonder.me is a great attempt too, and adds functionality to join an ‘area’ to be fully isolated from other users, even if they are nearby. This is a nice take, however, to me this kind of misses the point. Seeing other people closely, hearing them talk in the background, creates a lively social scene. The rooms take this benefit away again. I am also not able to get used to the odd way of moving your avatar around (holding down click in the direction you want to go).
- spatial.chat was a source of inspiration for my attempt, and I think they nailed it in terms of UX design. Moving your avatar is intuitive direct-manipulation (i.e., you manipulate the thing you want to change, instead of external, non-associated controls such as keyboard keys or clicking the location you want to go), there are no restrictions in how you can move, and all changes sync seemingly instantaneous between users.
The solution: Proximity Chat
The system was originally built using peer.js for WebRTC communication. I liked the idea of a fully peer-to-peer network with no central server (only for initial handshaking). However, during the first tests it was immediately clear this approach did not scale well to multiple users. At around six to seven participants, some video and audio feeds started cutting out for some users (but not all). My guess is that their internet bandwidth was saturated by having to upload their own video several times to different peers, which left no bandwidth for downloading the streams of others (although I never managed to verify this).
Hence, version two switched to using mediasoup as WebRTC backend instead. This is a very sophisticated (and frankly, rather complicated) Selective Forwarding Unit (SFU) server and client implementation. As a central server forwards the video feeds to others, peers only send their video once, cutting down on bandwidth requirements. Additionally, as the SFU only forwards and not re-encodes or compresses streams, the server-side load is relatively small, allowing it to be run on a cheap or even free VPS of choice.
Next, to share the positions of other users with the least amount of latency, we used uWebSockets.js (uws). The web socket protocol is easily able to handle position updates 30 frames per second from ±30 users simultaneously. WebRTC signalling and other gimmick features (such as ordering and drinking ‘beers’) are also handled using uws.
In addition to adjusting the volume of streams based others’ position relative to you, initial prototypes also adjusted the audio balance for a stereo effect to indicate whether people were either to the left or right from you. Even though this was nice when it worked, often it did not due to users’ bad audio setups. Especially bluetooth headsets with a microphone will switch from a2dp (good sound quality, can’t use the mic) to hsp/hfp (bad quality, both in and output), which reduces the audio quality significantly, and only outputs mono. Some early beta testers also mistook the feature for a bug, reporting “it is broken, the left audio does not work”.
As a final touch, I got access to the building blueprints of the clubhouse of the students association I frequently (used to) visit. With Blender, I mocked up an isomorphic projection render of the building to be used as a background. A recognisable and skeuomorphic background is, in my opinion, truly what makes the difference between a boring video chat and a fun social drink.
Initially, we used the system privately within a friend group of the student association, but as more people got interested, we scaled it up for use within the entire fraternity first, and later the entire student association. Some other associations and groups at the university have used the system for social events too, and even added their own 3D renders of their buildings.
To some extent I think I succeeded in creating an online cocktail party effect, and more generally an effective online social video chat platform. At its peak, the system supported over 35 simultaneously connected users. However, as the COVID-19 shelter-in-place measures seem to be reducing, so too does the interest in the platform. It was a fun and educational project though! For anyone interested, the code is hosted on GitHub:
 Cherry, E. Colin (1953). “Some Experiments on the Recognition of Speech, with One and with Two Ears”