Sign Language Smart Home Control

Class project for Missouri State University. Perform signs in front of web cam to spell a command and control a smart home device. Used machine learning and Google's Mediapipe framework to predict gestures in real time. (14 ASL signs supported)

Dev Overview
Software is split into 4 components: Client Server, Modified MediaPipe, Gesture Interpreter, and the Formar Grammar Module component. Used UDP to stream video to server, detects points on hand using Mediapipe framework, uses those pionts to predict a gesture in our trained model, does grammar correction, and performs the command if it is in our library of commands.

Class project for Missouri State University's 450 Software Development class. Project done with 5 team members total, using Agile Scrum with 2 week sprints. Involved Python, C++ programming, UDP + TCP, threading, Tensorflow and machine learning. Project was mostly completed with 14 different ASL detectable signs and a %98.3 accuracy. Project only runs on Ubuntu and requires a graphics card.

Our design involves 4 main components:

Client Server Component:

  • Client reads webcam sends over UDP
  • Server receives UDP and reconstructs frames
  • Sends to MediaPipe over local TCP (Python to C++ IPC)

Modified MediaPipe Component:
  • Listens for JPEG image over TCP
  • Uses MediaPipe (google framework) to detect landmarks​
  • Outputs landmarks over TCP

Gesture Interpreter Component:
  • Receives landmark input ​
  • Predicts gesture using trained model
  • Records gesture until timeout to form words

Formar Grammar Module Component:
  • Receives input of words
  • Autocorrects and fixes grammar
  • Calls smart home REST API

More in depth view:

Development required multiple steps:
  1. Prepare Dataset
  2. Train Model
  3. Build System

Prepare Dataset:
We obtained a dataset from Kaggle containing images of ASL sign language hand gestures. We modified Google's MediaPipe framework to work with an input of an image and output a CSV file of the 21 detected points on any hand contained in the image. We ran this on the entire image dataset to create a corresponding dataset of CSV files containing hand coordinate data instead of hand image data.

This figure shows an image from the dataset with its 21 detected points graphed over it. The points are not supposed to lie on the hand image but rather are normalized to the bounding box around the hand. We manually went through and removed outlier data from blurry photos, bad lighting, etc.

Train Model:
Using the dataset of coordinates we trained a model using a Convolutional Nueral Network with Tensorflow and an input shape of 5x5x3 by grouping the 21 points by finger and 3D coordinate . Our model predicts 14 different signs with a %98.3 accuracy.
Tensorflow Architecture:

Build System
Building the system was straight forward development. We used python to create UDP video stream, C++ to modify MediaPipe to receive its input over TCP and output over TCP, and Python once again to classify that output. We wrote install scripts to help get any new user setup and we documented everything.

My Contribution:
Though all team members put in work, here is my personal contribution list:
  • customized mediapipe to accept TCP image as input and output TCP landmarks
  • created python process to use model to predict letter of detected landmarks, and keep track of current predicted word
  • helped on training model
  • created mediapipe customization binary and python process to convert image dataset to landmark csv dataset
  • create python process to manually visualize dataset and removed bad data
  • setup Amazon web services (unused)
  • created server installer and run script
  • organized, cleaned github, and created readmes
  • created demonstration videos, edited
  • created final presentation
  • Testend and ran entire project on my ubuntu with graphics card
  • installed entire project on MSU computer
  • Connected final modules together to configure final project
  • created client install and run scripts for linux
  • Finalized project and testing. Only teammeber able to actually run entire project (becase requires Ubuntu with graphics card)


Demonstration Videos of the software in action. Signs included: 'A','B','C','D','E','F','H','I','K','L','O','P','Q','W'.

This video shows whole system in action. Sign 'Idea' to turn on light and 'Black' to turn light off. See terminal in bottom right for output of current readings.

This video shows spelling out Hi Iqbal to our professor, Dr. Iqbal.