If you want to build a multiple session (multiple people) connecting to a website with their camera and audio on, would you use WebRTC or WebSocket + getUserMedia ? Why ?
WebRTC is going to be better for most users (IMO). WebRTC provides congestion control so video stays real-time. It works pretty well with very little effort.
One to one video/audio is fine and securely end-to-end encrypted with vanilla WebRTC.
But for group conversation this won't scale because with p2p (mesh network) WebRTC, each client has to encode and decode for all users (which might be too much processing to do for the client).
For that reason many group video conf apps use a centralized router when there are several people.
Privacy then becomes an issue, and all group video conf such as Zoom, Skype, etc. are basically unencrypted (not end-to-end) and they can see & log everything (your face, your voice, chat, files exchanged, etc.) because the centralised router routes non-encrypted streams.
Jitsi is going to be an exception by adding a layer of encryption using insertable streams (https://jitsi.org/blog/e2ee/).
So in summary I'd say :
- for 1-to-1 or a few more people go with normal p2p WebRTC.
- for multiple people session, host your own Jitsi server (or rent one from 8x8 or other providers) & use the jitsi SDK/API to integrate it in your app
WebRTC includes things like NAT traversal, peer-to-peer.
IMO, don't forget to look at things like TokBox (Vonage), Twilio, and Agora. If you're just trying to get an MVP/POX, it's a lot easier to start with someone's API, then dig deeper into the details when you're ready.
WebRTC has some difficult-to-debug corner cases that these and APIs already handle for you. (Today I had to convince the boss to try one of the APIs for an MVP because I hit a bug where Chrome and Edge just stopped listening to the microphone. Thanks to these samples, I learned it wasn't our code.)