Mau-Tsuen Yang CVVR Lab.

Experimental Demo & Games

Integrating Depth-Based and Deep Learning Techniques for Real-Time Video Matting Without Green Screens

Traditional virtual production often relies on green screens, which can be expensive and cumbersome to set up. To overcome these challenges, we have developed a green screen-free virtual production system that uses a 3D tracker for camera tracking, allowing for the seamless compositing of virtual and real-world images from a moving camera with varying perspectives. To tackle the key challenge of video matting in virtual production, we introduce a novel Boundary-Selective Fusion (BSF) technique, which combines alpha mattes generated by deep learning-based and depth-based methods, effectively leveraging the strengths of both approaches.

Deep Learning-Based Sea Surface Obstacle Detection

Unmanned Surface Vehicles (USVs) are widely used in maritime environmental monitoring tasks, such as sea search and rescue, maritime patrols, oil spill detection, ship wastewater leakage, debris detection, illegal fishing prevention, wildlife monitoring, wind farm and drilling platform inspections, and coral reef surveillance. A key element for the autonomous operation of USVs is environmental perception, which presents challenges such as water surface variations, changes in lighting, obstacle size, hardware limitations, and real-time requirements. We utilized YOLOv7 to detect surface objects and enhanced accuracy by filtering out low-risk objects that would not cause collisions. This approach led us to secure third place in the 2023 IEEE WACV MaCVi Competition for Surface Obstacle Detection.

Appearance-Based Multimodal Human Tracking and Identification

One of the key challenges in realizing the potential of smart home services is the detection, tracking, and identification of individuals within the home. To address this, we model and record multi-view facial features, full-body colors, and shapes of family members in an appearance database using two Kinects positioned at the home’s entrance. These Kinects, along with additional color cameras installed throughout the house, are then utilized to detect, track, and identify people by matching the captured images with the registered templates in the appearance database. Detection and tracking are performed through multisensor fusion (combining data from Kinects and color cameras) using a Kalman filter, which effectively handles duplicate or partial measurements. Identification is achieved through multimodal fusion (integrating face, body appearance, and silhouette) using track-based majority voting. Additionally, the modules for appearance-based human detection, tracking, and identification work together seamlessly, enhancing each other's performance.

Fall Risk Assessment and Early-Warning for Toddler Behaviors at Home

Accidental falls are a leading cause of serious injuries in toddlers, with most incidents occurring at home. To address this, we propose an early-warning childcare system designed to monitor behaviors that may lead to falls, rather than relying solely on immediate fall detection based on short-term observations. Utilizing 3D human skeleton tracking and floor plane detection through depth images captured by a Kinect system, we have developed eight fall-prone behavioral modules for toddlers, organized according to four key criteria: posture, motion, balance, and altitude. The system's final fall risk assessment is determined through multi-modal fusion, using either weighted mean thresholding or support vector machine (SVM) classification.

Multi-Viewpoints Human Action Analysis

Human action analysis and recognition are essential for various vision-based applications, including video surveillance, human-computer interfaces, and video indexing and searching. We propose a method to analyze and recognize human actions in an image sequence by extracting features from key points detected in both the spatial and temporal domains.

A Multimodal Fusion System for Human Detection and Tracking

A person detection system that relies on a single feature can become unstable in dynamic environments. To address this, we have designed a multimodal fusion system that detects and tracks people with accuracy, robustness, speed, and flexibility. Each detection module, which focuses on a single feature, operates independently, and their outputs are integrated and tracked using a Kalman filter. The integration results are then fed back into each individual module to update their parameters, reducing false alarms and improving detection accuracy.

Moving Cast Shadow Detection

Cast shadows from moving objects can create challenges for various applications, including video surveillance, obstacle tracking, and Intelligent Transportation Systems (ITS). To address this, we have developed a segmentation algorithm that detects and removes shadows in both static and dynamic scenes by considering factors such as color, shading, texture, neighborhood relationships, and temporal consistency.

Augmented Reality Based Interactive Cooking Guide

We introduce a new cooking assistance system where the user only needs to wear an all-in-one augmented reality (AR) headset, eliminating the need for external sensors or devices in the kitchen. Leveraging the built-in camera and advanced computer vision (CV) technology, the AR headset can recognize available food ingredients simply by having the user look at them. Based on the identified ingredients, the system suggests suitable recipes. A step-by-step video tutorial for the chosen recipe is then displayed through the AR glasses. The user can interact with the system effortlessly using eight natural hand gestures, without ever needing to touch any devices during the entire cooking process. In comparison to deep learning models like ResNet and ResNeXt, experimental results indicate that while YOLOv5 may have slightly lower accuracy for ingredient recognition, it excels in locating and classifying multiple ingredients in a single scan, making the process more convenient for users.

Comparison of Tracking Techniques on 360-Degree Video

With the rise of 360-degree cameras, 360-degree videos have become increasingly popular. To attach a virtual tag to a physical object in these videos for augmented reality applications, automatic object tracking is essential, ensuring the virtual tag follows the corresponding physical object throughout the 360-degree footage. Unlike ordinary videos, 360-degree videos in an equirectangular format present unique challenges such as viewpoint changes, occlusion, deformation, lighting variation, scale changes, and camera shakiness. Tracking algorithms designed for standard videos may not perform well in this context. To address this, we thoroughly evaluate the performance of eight modern tracking algorithms, assessing both accuracy and speed specifically for 360-degree videos.

Virtual Tour for Smart Guide & Language Learning (Demo)

This project focuses on creating an immersive VR and AR tour guide with gesture-based interactions for Eastern Taiwan. Our goal is to develop a VR tour guide app featuring panoramic videos, utilizing the Google Cardboard platform. We establish a VR tour guide experience station with 3D reconstruction technology, powered by the HTC Vive platform. We develop an AR tour guide experience station with gesture-based interactions, using the Microsoft HoloLens platform.

Virtual English Classroom with Augmented Reality (VECAR)

The integration of physical-virtual immersion and real-time interaction is crucial for effective cultural and language learning. Augmented Reality (AR) technology allows for the seamless blending of virtual objects with real-world environments, enhancing the sense of immersion. In tandem, computer vision (CV) technology can detect and interpret free-hand gestures from live images, facilitating natural and intuitive interactions. To harness these capabilities, we have developed a Virtual English Classroom, named VECAR, which incorporates the latest AR and CV algorithms to foster an immersive and interactive language learning experience. By using mobile computing glasses, users can engage with virtual content in a three-dimensional space through intuitive free-hand gestures.

Multi-perspective Interactive Displays using Virtual Showcases

We propose a virtual showcase that presents virtual models within a real-world environment. Additionally, learners can interact with these virtual models of earth science materials by utilizing multiple user perspectives combined with augmented reality display technologies. Learners control and view the stereoscopic virtual learning materials through head movements and hand gestures. Our experiments indicate that this virtual showcase significantly enhances learners' interest and improves their efficiency in the learning process.

AR-Based Surface Texture Replacement

We propose a texture replacement system that utilizes AR markers to identify the contours of a surface object and seamlessly apply a new texture within the detected area. This system analyzes input images, detects the contours of the target surface, and realistically applies new material to it. This technology has broad applications in digital image and video post-production, as well as in interior design.

An Interactive 3D Input Device: 3D AR Pen

Augmented Reality (AR) technology is ideal for creating a more natural and intuitive Human-Computer Interface (HCI), blending real-world scenes with virtual objects to offer users a completely new experience. As many emerging applications require 3D user control, current input devices like the mouse, stylus, and multi-touch screens are limited to 2D data input. To address this, we have developed a 3D AR pen by combining a cube with AR markers and a pen equipped with a button, LED, and scrollbar. This 3D AR pen is designed to enable users to easily and reliably create and manipulate 3D content. We demonstrate its potential through various applications, including 3D drawing, 3D carving, and 3D interface control, all of which are simple to operate with the proposed 3D AR pen.

Vision-Based Trajectory Analysis for Table Tennis Game

We introduce a vision-based system designed for recording and analyzing table tennis games. Our system captures detailed information on ball trajectories, hit positions, and spin directions to document every point in a match. Additionally, we demonstrate how to record the game's content and analyze strategic elements. The insights generated by our system can serve as valuable references for players, coaches, and audiences alike.

Visual Analysis for Offense-Defense Patterns in Basketball Game

We analyze the relative 2D positions of players, the basketball, and the basket to identify offensive and defensive patterns in basketball videos captured from a fixed viewpoint. This system provides audiences with deeper insights into game strategies and other information that are often overlooked in traditional basketball broadcasts.

Pitching Type Recognition for Baseball Game

We propose a baseball analysis system that identifies pitching types by reconstructing the baseball's trajectory from broadcast videos. This system enhances the viewing experience by helping audiences recognize different pitches during the broadcast. Additionally, it provides valuable insights for players, allowing them to study pitchers' styles and develop effective batting strategies.

On-road Collision Warning Based on Multiple FOE Segmentation using a Dashboard Camera

Many accidents can be prevented if drivers are alerted just a few seconds before a collision. We investigate the potential of a visual collision-warning system that relies solely on a single dashboard camera, which is both widely available and easy to install. Traditional vision-based collision warning systems typically detect specific targets, such as pedestrians, vehicles, and bicycles, using pre-trained statistical models. In contrast, our proposed system focuses on detecting the general motion patterns of any approaching object. Since all motion vectors from points on an approaching object converge at a point known as the focus of expansion (FOE), we design a cascade-like decision tree to filter out false detections early on and develop a multiple FOE segmentation algorithm to classify optical flows based on the FOEs of individual objects. Further analysis is conducted on objects located in high-risk areas, referred to as the danger zone.

Vision-Based Traffic Surveillance

Vision-based traffic surveillance is crucial for effective traffic management. However, challenges such as varying outdoor illumination, cast shadows, and vehicle diversity often complicate video analysis. To address these issues, we propose a real-time traffic monitoring system that analyzes spatial-temporal profiles to reliably estimate traffic flow and speed while simultaneously classifying vehicle types.

Guitar Tutoring System Based on Markerless Augmented Reality

We present a real-time guitar teaching system based on augmented reality. Using color images captured by a webcam, the system employs the Hough transform to detect lines and circles, identifying the guitar's position without the need for special markers. Virtual indicators are overlaid on the guitar's neck and soundhole through a display, guiding the user on left-hand chord fingerings and right-hand strumming rhythms. This allows for intuitive learning as users follow the instructions. Additionally, skin tone detection is used to verify the correctness of the left-hand position, while the right-hand rhythm is assessed using optical flow and sound recognition to ensure the user's timing is accurate.

Note-Taking for 3D Curricular Contents using Markerless Augmented Reality

For e-Learning with 3D interactive curricular content, an ideal note-taking method should be intuitive and closely integrated with the learning material. Augmented reality (AR) technology, which can overlay virtual content onto real-world images, offers unique opportunities and challenges to enhance note-taking for contextual learning in real-life environments. By combining head-mounted displays with cameras and wearable computers, AR enables an innovative approach to note-taking. We propose an AR-based note-taking system specifically designed for 3D curricular content. Learners can take notes on a physical tabletop by writing with their fingers, manipulate curricular content using hand gestures, and embed their complete notes directly into the corresponding content within a 3D space.

Virtual Reality Navigation using Image-Based Rendering (IBR) Techniques

We have designed a visualization and presentation system that creates a virtual space using image-based rendering (IBR) techniques, allowing new images to be generated from captured images rather than geometric primitives. With panorama technology, users can explore a virtual scene from any viewing angle, while object movie technology enables them to view a virtual object from any perspective. We have further extended these technologies to explore the possibilities of stereo panorama and interactive object movies.

Finding the Next Best View in 3D Scene Reconstruction using Panorama Arrays

We present a 3D modeling system for scene reconstruction by capturing a set of panoramas, referred to as a panorama array. The system begins by dividing an initial volume containing the target scene into fixed-sized cubes, known as voxels. The projections of each voxel onto every panorama are efficiently determined using an index coloring method. By gradually carving away voxels that fail the color consistency test, the structure of the scene is progressively revealed. Additionally, we have developed a smart sampling strategy that identifies the optimal next viewpoint for capturing new panoramas, even without prior knowledge of the scene layout. The goal is to reconstruct the maximum area of the scene using the fewest possible panoramas. The reconstructed scenes and panorama arrays create a virtual space where users can move freely with six degrees of freedom.

Vision-Based Mobile Robot Navigation

We employ a mobile robot equipped with an active camera and a laptop computer to serve in a digital home environment. One of the key challenges in realizing the potential of autonomous robots is their ability to accurately perceive their surroundings. An intelligent robot should be capable of answering several crucial questions: Where is the robot? What or who is in the environment? What actions should be taken? And how should these actions be performed? We propose to address these questions using vision-based techniques. The potential applications of autonomous robots in an intelligent digital home include home security, entertainment, e-learning, and home care.

Facial Expression Recognition for Learning Status Analysis

Facial expressions offer crucial insights for teachers to assess students' learning progress. Therefore, expression analysis is valuable not only in Human-Computer Interfaces but also in e-Learning environments. This research aims to analyze nonverbal facial expressions to determine students' learning status in distance education. The Hidden Markov Model (HMM) is used to estimate the probabilities of six expressions (Blink, Wrinkle, Shake, Nod, Yawn, and Talk), which are combined into a six-dimensional vector. A Gaussian Mixture Model (GMM) is then applied to evaluate three learning scores: understanding, interaction, and consciousness. These scores provide a comprehensive view of a student's learning status, offering valuable feedback for both teachers and students to enhance teaching and learning outcomes.

3D Virtual English Classroom (VEC3D)

We have designed a campus-like virtual reality environment that provides an authentic experience for English learners in Taiwan. This project stands out for its visual and audio representation, synchronous communication, and real-time interaction, all of which encourage students to actively engage in the learning process. VEC3D enables students to create their own virtual rooms using 3D graphics, choose from various avatars to visually express their identity, and navigate freely within the virtual world. Upon entering this virtual environment, students encounter a mentoring avatar and can interact with anonymous online partners. They can see and communicate with each other through real-time text and voice in various virtual spaces, such as the student center, classroom, multimedia lab, cafeteria, or their individual rooms.

Lip Contour Extraction

We developed a lip contour extraction system that leverages color, shape, and motion features. The extracted facial animation parameters are used to animate a virtual talking head on the remote side. This solution can be applied to e-learning, teleconferencing, and various other applications.

Adaptive Multi-part People Detection, Segmentation & Tracking

E-learning and teleconferencing have become essential and widely-used applications of information technology. A key challenge in these applications is the need to transmit image sequences over networks with limited bandwidth. A vision-based human feature detection and tracking system can significantly reduce network traffic by transmitting only the region of interest (ROI) or feature parameters, rather than the entire image.