This is based on an invited talk I gave this January at IS&T’s Electronic Imaging 2018, in the Photography, Mobile, and Immersive Imaging conference. The slides are uploaded here (PDF, 12MB), but since I tend to put more content in words than bullet points, I’ve also translated the talk into an “article” format in this post.
At Occipital, I lead up the hardware and systems engineering aspects of everything discussed here; every video and screenshot represents a real system or piece of hardware that I’ve worked on, which is either available for consumer purchase today, or has otherwise been publicly demonstrated. Of course, none of this would be possible without the efforts of everyone at Occipital, from computer vision to industrial design, software engineering, operations, and everything else.
Recommended background material:
– Alex Vlachos, Advanced VR Rendering (GDC Vault)
– Simultaneous Localization and Mapping (Wikipedia)
– Rolling Shutter vs Global Shutter (Q Imaging)
The heart of this discussion is tightly-integrated systems for SLAM (Simultaneous Localization and Mapping): the design confluence of hardware and software for inside-out positional tracking and mapping, applied to the general problems of augmented and virtual reality systems. And really, it’s a full system problem. Imaging and sensing hardware is one of the key inputs, and design decisions there can ripple through the entire experience, but joint optimization of everything from cameras to displays and software is required to make a compelling mixed-reality product.
I’ve always found it difficult to explain the computer-vision systems I work on in text or conversation. The performance of a real-time VR or mixed-reality system is so visual that I always find it best to have someone experience it live. Videos don’t quite give you the “gut feeling” that’s important for VR, in which your stomach can clearly inform you whether the tracking is good or not, but they’re probably best for a blog format.
This is a demo Occipital recently showed at CES 2018 – it’s a real-time positional tracking and mapping system for VR. The hardware is two inexpensive, VGA, global shutter, wide FOV, monochrome cameras – here running at about 60Hz – plus one cheap mobile 6-DoF IMU. This has been strapped to a Vive (which we’re only using for the display, no Lighthouse), and data is being streamed live to the host PC, where a SLAM system is driving the positional tracking and a sparse map of the environment. We’re compositing this tracked map onto a pass-through stereo feed from the cameras to illustrate the tracking quality. (More on the pass-through later.)
This one is a more dense reconstruction. Here, a Structure Sensor – a mobile structured-light camera – is clipped onto an iPad. The depth feed from the sensor is being fused with the iPad’s camera and IMU, in real time, to make 3D scans like this, on-device. Compared to the previous video: This is still a SLAM system, but where the first video had more focus on your position in the world, this one’s primary interest is the reconstruction. The point of this product was for object and room scanning…
…But that’s not to say the same technology cannot be used for VR/AR, as well. In this video, the Structure Sensor is with an iPhone, inside a Bridge headset, driving a fully self-contained mixed reality experience. Having a dense mesh as the basis of the tracking system can enable compositing virtual characters onto real scenes, with proper occlusions and scene awareness. You can even have little game-dev accidents, like some clipping @ 0:04, because from the perspective of the game developer or player, meshes for the environment and the character are both actually in the scene.
The hardware that powers these systems has a lot of variety – but, under the hood, these are all still SLAM systems; just applied in different ways. Also, at a high level, the hardware has some similarities: Everything that powers the above experiences has multiple cameras from multiple perspectives, and they all seek to extract 3D information in some way from what they see, so it can be used in a SLAM system, which in turn is used to do something interesting – like scanning or VR.
Fundamental for any 3D understanding of a scene – at the sensor and at the SLAM system level – is a deep and accurate understanding of the relationships between individual cameras and sensors in any sensing product. Broadly speaking, that goes into 3 categories: Photometric, Geometric, and Temporal calibrations.
This angle is largely your standard fare for calibrating camera response.
For a computer vision system that relies on stereo camera correspondences, it helps if the thing the system is looking at looks the same in both cameras – or, more generally, that the system can predict what a feature viewed from one eye would look like in the other.
More generally, the quality of the camera matters. A better camera – higher quantum efficiency, higher shutter efficiency, better SNR, a sharper lens, etc – generally means a better vision system; it also usually means a more expensive or larger system, so there’s a bit of a balance to be maintained for consumer products.
For an augmented or mixed reality display, we need to go a step higher into the system. Say we have pass-through video from cameras and we want to render some augmented reality experience on top of it – that rendered something probably needs to look like the background image in the headset. So long as absolute human-eye perfect cameras and displays remain in the future, we can try to calibrate the color response of the cameras and displays (among other things), and make the renderer aware of these non-idealities.
This might be familiar from general vision processing. For reasons of everything from geometric triangulation to processing efficiency, we must calibrate:
…plus a few more things, depending on the system. This is a lot of degrees of freedom to calibrate – more than 60 for the list above, depending on the model(s) used for each component – but I can attest that such a calibration can be realized in a mass-production factory context.
However, the world is a lot squishier and bendier than we all like to admit. Even “flat” things (like the monitor/phone/tablet/whatever you are reading this on) will bend a surprising amount under load; this will ruin any factory calibration. To throw out some ballpark numbers, let’s invent a reasonable 2-camera system with a metal stiffener:
If I were to hold one end of the bar and push down on the other, it would only require about 0.7N of force to tilt the cameras apart more than 0.25deg; that’s half a pixel at the center of the FOV – more elsewhere in the image – and probably more error than many vision systems can bear.
0.7N is about the force required to press a key on a computer keyboard, and that’s enough to break a fixed calibration. Compare that against the force required to, say, stretch the straps on a VR headset. Additionally, this is with a 5mm-thick piece of metal; this is already a bit thick for many applications, and isn’t even counting thickness of the cameras themselves, PCB(s), cosmetic plastic/glass, fasteners, mount points, or anything else that would need to fit into a real product.
The only real solutions are to design systems where deformation is either unimportant (errors do not impact the final experience), or ensure such changes are accounted for in some way – i.e. over time, the system is able to learn changes to its calibration, or is otherwise tolerant to change.
If a system has more than one camera, chances are they need to be synchronized – to each other, and/or potentially to some external timing reference. Likewise, if a system combines multiple types of sensors – visual and IMU, for example – we will eventually want to correlate these measurements on some unified time scale.
In practice, this means calibrating everything about these sensors’ internal timings and group delays; for a full AR/VR system, though, our understanding of the timing has to go much higher – all the way to the displays.
To illustrate, let’s invent another pass-through system, modeled after the one from the CES demo above. The goal will be to make something that passes through video feeds to a modern VR display, compositing some rendered content on top to make a mixed-reality experience.
The hardware might look like this:
The timing of this system might look like the following.
Right away, we have a problem: Following the timing diagram directly, there’s about a 50 millisecond delay – four and a half frames! – between the camera exposure and when the user might see it. A user’s stomach will inform them about that sort of latency in a hurry, so we need to pull some tricks here to make the experience feel great.
The VR/AR “game” being driven by the SLAM system needs a pose at two points in time: during the game simulation, and right before the GPU begins rendering. I’ll focus on rendering, since that’s of critical importance for the experience.
At the time of rendering, the SLAM system is working off of a camera exposure that’s more than two frames old, plus any integrated data from the faster IMU that it might have received in the meantime. The pose that is desired, however, is the one two frames in the future – the location of the user’s head when they get to see the image that the GPU rendered. This sort of data integration + prediction already works very well today, but this doesn’t help us with pass-through camera data.
For this, we have to use a combination of more-recent data and prediction. Even though we have not had time for the SLAM system to process the marked frame, we still have it in memory when we want to send data to the GPU, and we will have some idea (via IMU integration) where this image was taken from. This way, we 1) have the ability to transform this image based on our pose prediction, and 2) we reduce the amount of forward prediction required to make the transformed composite image.
From a temporal perspective, anything that shrinks the system timing diagram is great for the user experience of the system. Faster cameras are a part of this; something that could produce compelling video at not 90 FPS, but 180 or 270 FPS could easily shave off a frame of delay between exposure and readout.
That said, such increases need to come with photometric improvements in sensor quantum efficiency / better lenses / improved SNR; faster images won’t do any good if they are too noisy or dark to use. In fact, most of my photometric wish list boils down to “better cameras” – the challenge being to do so in the form factor and cost of current miniature fixed-focus plastic lenses.
As far as geometric improvements are concerned: VR display resolutions won’t be decreasing in the future; cameras for pass-through systems are already trying to catch up in terms of resolution. The trend in the mobile small-camera market has been towards floating auto-focus lenses, which further compound the mechanical problems detailed in this article. I would argue that new calibration & SLAM system models are needed to properly account for the variances those sorts of systems might cause.
I also have a “wish list” of improvements to sensor silicon that could also help systems like these… but that might be something for a future post.