Sign Language Smart Home Control

Class project for Missouri State University. Perform signs in front of web cam to spell a command and control a smart home device. Used machine learning and Google's Mediapipe framework to predict gestures in real time. (14 ASL signs supported)

Dev Overview

Software is split into 4 components: Client Server, Modified MediaPipe, Gesture Interpreter, and the Formar Grammar Module component. Used UDP to stream video to server, detects points on hand using Mediapipe framework, uses those pionts to predict a gesture in our trained model, does grammar correction, and performs the command if it is in our library of commands.

Background:
Class project for Missouri State University's 450 Software Development class. Project done with 5 team members total, using Agile Scrum with 2 week sprints. Involved Python, C++ programming, UDP + TCP, threading, Tensorflow and machine learning. Project was mostly completed with 14 different ASL detectable signs and a %98.3 accuracy. Project only runs on Ubuntu and requires a graphics card.

Architecture:
Our design involves 4 main components:

Client Server Component:

Client reads webcam sends over UDP
Server receives UDP and reconstructs frames
Sends to MediaPipe over local TCP (Python to C++ IPC)

Modified MediaPipe Component:

Listens for JPEG image over TCP
Uses MediaPipe (google framework) to detect landmarks
Outputs landmarks over TCP

Gesture Interpreter Component:

Receives landmark input
Predicts gesture using trained model
Records gesture until timeout to form words

Formar Grammar Module Component:

Receives input of words
Autocorrects and fixes grammar
Calls smart home REST API

More in depth view:

Development:
Development required multiple steps:

Prepare Dataset
Train Model
Build System

Prepare Dataset:
We obtained a dataset from Kaggle containing images of ASL sign language hand gestures. We modified Google's MediaPipe framework to work with an input of an image and output a CSV file of the 21 detected points on any hand contained in the image. We ran this on the entire image dataset to create a corresponding dataset of CSV files containing hand coordinate data instead of hand image data.

This figure shows an image from the dataset with its 21 detected points graphed over it. The points are not supposed to lie on the hand image but rather are normalized to the bounding box around the hand. We manually went through and removed outlier data from blurry photos, bad lighting, etc.

Train Model:
Using the dataset of coordinates we trained a model using a Convolutional Nueral Network with Tensorflow and an input shape of 5x5x3 by grouping the 21 points by finger and 3D coordinate . Our model predicts 14 different signs with a %98.3 accuracy.
Tensorflow Architecture:

Build System
Building the system was straight forward development. We used python to create UDP video stream, C++ to modify MediaPipe to receive its input over TCP and output over TCP, and Python once again to classify that output. We wrote install scripts to help get any new user setup and we documented everything.

My Contribution:
Though all team members put in work, here is my personal contribution list:

customized mediapipe to accept TCP image as input and output TCP landmarks
created python process to use model to predict letter of detected landmarks, and keep track of current predicted word
helped on training model
created mediapipe customization binary and python process to convert image dataset to landmark csv dataset
create python process to manually visualize dataset and removed bad data
setup Amazon web services (unused)
created server installer and run script
organized, cleaned github, and created readmes
created demonstration videos, edited
created final presentation
Testend and ran entire project on my ubuntu with graphics card
installed entire project on MSU computer
Connected final modules together to configure final project
created client install and run scripts for linux
Finalized project and testing. Only teammeber able to actually run entire project (becase requires Ubuntu with graphics card)

Demos

Demonstration Videos of the software in action. Signs included: 'A','B','C','D','E','F','H','I','K','L','O','P','Q','W'.

This video shows whole system in action. Sign 'Idea' to turn on light and 'Black' to turn light off. See terminal in bottom right for output of current readings.

This video shows spelling out Hi Iqbal to our professor, Dr. Iqbal.

Gallery

Documents

Research Paper
Software Requirements Specification
Software Design Description
Quality Assurance Report
Project Charter

View On Github

Sign Language Smart Home Control

Dev Overview

Demos

Gallery

Documents

Skills