Next-Gen Sensory Tech for Virtual Business Interactions

Computer System Challenges for BiReality July 9, 2003

Business travel is time consuming and expensive $100 Billion a year Current alternatives (audio and video conferences) leave a lot to be desired Motivation

Project Overview • Goal: • To create to the greatest extent practical, both for the user and the people at the remote location, the sensory experience relevant for business interactions of the user actually being in the remote location.

What Sensory Experiences are Relevant? • Visual perception: • Wide visual field • High resolution visual field • Everyone is “life size” • Gaze is preserved • Colors are accurately perceived • Audio perception: • High-dynamic range audio • Full frequency range • Directional sound field • Mobility: • Able to move around the remote location • Sitting/standing position and height • Other senses (smell, taste, touch, kinesthetic, vestibular) are not as important for most types of business

System Using First-Generation Prototype What can we do with current technology? High-bandwidth, low-cost internet Surrogate at remote location User in immersion room

Some Core Technologies • Common to both model 1 and 2: • Audio Telepresence • Gaze Preservation

Audio Telepresence • Near CD-quality dynamic range & frequency range • Users should enjoy directional hearing • Enables listening to one person in room full of speaking people • aka the “Cocktail Party Effect” • Users should have directional output • Users can whisper into someone else’s ear in a meeting • This enables selective attention to and participation in parallel conversations • A compelling sense of presence is created (“spooky”) • Challenge: Full-duplex with minimum feedback • Limitation: Latency between local and remote location

Gaze Preservation • Fundamental means of human communication • Very intuitive: toddlers can do it before talking • Not preserved in traditional videoconferencing • Gaze is very useful • Signals focus of attention (on presenter or clock?) • Turn taking in conversations • Remote and local participants must be presented life size for total gaze preservation (harder than 1:1 eye contact) • Otherwise angles don’t match • Life-size presentation also necessary for: • Reading facial expressions • Making everyone an equal participant

Video

2nd Generation MIMT Advances • Two drivers of advances: • Improvements based on experience with model 1 • Improvements from base technology • More anthropormorphic surrogate • Closer to a human footprint • 360-degree surround video • Enables local rotation (model 3 feature) • Near-Infrared head tracking • Can’t use blue screen anymore • Eliminates blue screen halo in user’s hair • Preserves the user’s head height • User can sit or stand at remote location

Model 2 Surrogate

IR 360

360-Degree Surround Video • Improved video quality over model 1 • Four 704x480 MPEG-2 streams in each direction • Hardware capture and encoder • Software decoder (derived from HPLabs decoder) • Very immersive - after several minutes: • Users forget where door is on display cube • But users know where doors at remote location are

Problem: Head Tracking in IR 360 • Can’t rely on chromakey • Heads of remote people projected on screens too

Visible Images

IR 360: Track via NIR Difference Keying • Projectors output 3 colors: R, G, B (400-700nm) • 3 LCD panels with color band pass filters • Projectors do not output NIR (700-1000nm) • Projection screens & room would appear black in NIR • Add NIR illuminators to evenly light projection screens • People look fairly normal in NIR (monochrome) • Use NIR cameras to find person against unchanging NIR background

Near-Infrared Images

Near-Infrared Differences

Preserving the User’s Head Height • Lesson from model 1: Hard to see “eye-to-eye” with someone if you are sitting and they are standing • Formal interactions sitting (meeting in conference room) • Casual conversations standing (meet in hallway) • Model 2 system supports both: • Computes head height of user by NIR triangulation • Surrogate servos height so user’s eyes at same level on display • User controls just by sitting or standing • No wires on user => very natural

Preserving Gaze and Height Simultaneously • Since the user can stand or sit down, a single camera won’t preserve gaze vertically • Similar problem to camera on top of monitor • Solution: • Use 4 color cameras in each corner of display cube • Select between them using video switcher based on user’s eye height • Eye height computed from head height via NIR • Tilt cameras in head of surrogate at same angle as user’s eyes to center of screen • Angle computed via NIR head tracking • Warp video in real time to so adjacent videos still match for panorama • When user stands or sits down, their perspective changes as if the screen was a window

Enhancing Mobility with Model 3 • Many important interactions take place outside meeting rooms • Need mobility to visit offices, meet people in common areas • Lesson from model 1: teleoperated mechanical motion is unimmersive • Holonomic platform for Model 2 in development • Can move in any direction without rotation of platform • User rotates in display cube, not at remote location • No latency or feedback delay • Natural and immersive • Base will move perpendicular to plane of user’s hips • People usually move perpendicular to hip plane when walking • Speed will be controlled by wireless handgrip

Model 3 Base • Same approximate form factor as model 2 base • Holonomic design with six wheels for enhanced stability

Overview of System-Level Challenges • Mobile computing in the large • Model 1 and Model 3 surrogates run off batteries • Only have 1-2KWh (1000-2000X AAA cell) • Extreme computation demands from multimedia • 3.06GHz HT machines only capable of two 704x480 at 10fps • Already use hardware MPEG-2 encoders • Would like frame rates of 15 to 30fps • 15fps should be possible with 0.09um processors • Still only provides 20/200 vision • 20/20 vision would require 100X pixels, ? bits, ? MIPS

Model 2 vs. Model 3 Surrogate • Model 2 is powered via a wall plug • Not mobile, but • Avoids power issues for now • Moore’s Law leads to more computing per Watt with time • Stepping stone to mobile model 3 • Only one model 2 prototype • Learn as much as possible before model 3

Power Issues • Reduction in PCs from 4 to 2 saves power • Model 1 power dissipation = 550W • Model 2 power dissipation = 250W • Motion uses relatively little power • Good news – most power scales with Moore’s Law

CPU and Graphics Requirements • Performance highly dependent on both • Graphics card performance • Dual screen display, 2048x768 • Most texture mapping for games use semi-static textures • We have to download new textures for each new video frame • 720x480x15x4 = 21MB/sec per video stream • 42MB/sec per card • CPU performance • Currently use 3.06GHz hyperthreaded Pentium 4, DDR333 • MPEG decoders have to produce bits at 42MB/sec • Currently uses 25% of CPU per video stream • 50% of CPU for two • Makes you really appreciate a 1940’s tube TV design

WLAN Challenges • Model 2 bandwidth requirements • About 21Mbit/s total • 8 MPEG-2 full D1 (720x480) video streams (95% of bandwidth) • 5 channels of near CD-quality audio (only 5% of total bandwidth) • Getting these to work with low latency over 802.11a is challenging • Packets lost and lack of QOS • Like all WLAN, bandwidth is a strong function of distance • Vendors don’t like to advertise this • 1/9 bandwidth at larger distances common • Currently use 2nd generation Atheros 802.11a parts via Proxim • 108 Mb/sec max in Turbo mode • 12Mb/s at longer ranges

UDP • Model 1 & 2 systems developed with TCP • TCP plus WLAN at near capacity doesn’t work • TCP can add delay • Small video artifacts better than longer latency for interactive use • Converting to UDP now • Requires adding error detection and concealment to MPEG-2 decoder

Low Latency Video and Audio • Necessary for interactive conversations • Problems: • Buffering needed is a function of system and network load • Lightly loaded PC on internet 2: little buffering needed • Heavily loaded PC on WLAN: more buffering needed • Windows 2K is not a real-time OS • Hyperthreading or dual CPUs help responsiveness • Model 1 used dual CPU systems • Model 2 uses HT CPUs

Summary • We are close to achieving a useful immersive experience • This is significantly better for unstructured human communication • Many key qualities not preserved in videoconferencing, including: • Gaze • Directional hearing • 360 degree surround vision • BiReality implementation technologies (PCs, projectors, etc.) are not that expensive and are getting cheaper, faster, and better • Enables lots of interesting research

MIMT Project Team • Norm Jouppi, Wayne Mack, Subu Iyer, Stan Thomas and April Slayden (intern) in Palo Alto • Shylaja Sundar Rao, Jacob Augustine, Shivaram Rao Kokrady, and Deepa Kuttipparambil of the India STS

Demo in 1L Lab • Across from Sigma conference room

Next-Gen Sensory Tech for Virtual Business Interactions

Next-Gen Sensory Tech for Virtual Business Interactions

Presentation Transcript

System 2020: Research Grand Challenges in Computer Architecture

Hints for Computer System Design

Computer System

Computer System

Key Challenges for Theoretical Computer Science

Computer System

COMPUTER SYSTEM

Hints for Computer System Design

Computer System

Computer System

Computer System

Computer System

Computer System

Computer System

Computer System

Key Challenges for Theoretical Computer Science

Computer System

Hints for Computer System Design

Computer System Challenges for BiReality