Dev Overview
Software is split into 4 components: Client Server, Modified MediaPipe, Gesture Interpreter, and the Formar Grammar Module component. Used UDP to stream video to server, detects points on hand using Mediapipe framework, uses those pionts to predict a
gesture in our trained model, does grammar correction, and performs the command if it is in our library of commands.
More
Background:
Class project for Missouri State University's 450 Software Development class. Project done with 5 team members total, using Agile Scrum with 2 week sprints. Involved Python, C++
programming, UDP + TCP, threading, Tensorflow and machine learning. Project was mostly completed with 14 different ASL detectable signs and a %98.3 accuracy. Project only runs on Ubuntu and requires a graphics card.
Architecture:
Our design involves 4 main components:
Client Server Component:
- Client reads webcam sends over UDP
- Server receives UDP and reconstructs frames
- Sends to MediaPipe over local TCP (Python to C++ IPC)
Modified MediaPipe Component:
- Listens for JPEG image over TCP
- Uses MediaPipe (google framework) to detect landmarks​
- Outputs landmarks over TCP
Gesture Interpreter Component:
- Receives landmark input ​
- Predicts gesture using trained model
- Records gesture until timeout to form words
Formar Grammar Module Component:
- Receives input of words
- Autocorrects and fixes grammar
- Calls smart home REST API
More in depth view:
Development: Development required multiple steps:
- Prepare Dataset
- Train Model
- Build System
Prepare Dataset: We obtained a
dataset from Kaggle containing images of ASL sign language hand gestures. We modified Google's MediaPipe framework
to work with an input of an image and output a CSV file of the 21 detected points on any hand contained in the image. We ran this on the entire image dataset to create a corresponding dataset of CSV files containing
hand coordinate data instead of hand image data.
This figure shows an image from the dataset with its 21 detected points graphed over it. The points are not
supposed to lie on the hand image but rather are normalized to the bounding box around the hand. We manually went through and removed outlier data from blurry photos, bad lighting, etc.
Train Model: Using the dataset of coordinates we trained a model using a Convolutional Nueral Network with Tensorflow and an input shape of 5x5x3 by grouping the 21 points by finger and 3D coordinate
. Our model predicts 14 different signs with a %98.3 accuracy.
Tensorflow Architecture:
Build System Building the system was straight forward development. We used python to create UDP video stream, C++ to modify MediaPipe to receive its input over TCP and output over TCP, and Python once
again to classify that output. We wrote install scripts to help get any new user setup and we documented everything.
My Contribution: Though all team members put in work, here is my personal contribution list:
- customized mediapipe to accept TCP image as input and output TCP landmarks
- created python process to use model to predict letter of detected landmarks, and keep track of current predicted word
- helped on training model
- created mediapipe customization binary and python process to convert image dataset to landmark csv dataset
- create python process to manually visualize dataset and removed bad data
- setup Amazon web services (unused)
- created server installer and run script
- organized, cleaned github, and created readmes
- created demonstration videos, edited
- created final presentation
- Testend and ran entire project on my ubuntu with graphics card
- installed entire project on MSU computer
- Connected final modules together to configure final project
- created client install and run scripts for linux
- Finalized project and testing. Only teammeber able to actually run entire project (becase requires Ubuntu with graphics card)