Real Time Object Detection using YOLO model: A Navigation Assistant for the visually impaired using Speech to Text, Object Detection and Text to Speech

Shreya Singh
3 min readMar 10, 2019

YOLO: Real-Time Object Detection. You only look once (YOLO) is a state-of-the-art, real-time object detection system. It is a pre-trained model for several common objects like chair, person, car, bottle etc.

The github repository of this project can be found at this url: https://github.com/singh-shreya6/Navigation-Assistant

We utilised it make a navigation assistant which tells how far left, right or ahead the object of search is in degrees. The entire system had three modules:

  1. Speech to Text using Google API: The user gives input in the form of speech. Example: Find me a chair, find bottle etc.
  2. Object Detection: The YOLO model detects the object in search. We modified it to tell us the degrees left or right at which object is present.
  3. Text to Speech: Pyttsx library of python was used to give back the output in the form of speech. Example: Chair found at 21 degrees left, Bottle found ahead etc.
Workflow of the system

Requirements:
It is primarily a python based model.

Requirements of the system

Run Command:

python code.py \
— prototxt MobileNetSSD_deploy.prototxt.txt \
— model MobileNetSSD_deploy.caffemodel

Workflow Explained:

  1. The entire system is initially on sleep mode. On speaking “Help Me”, the system gets activated.
  2. The user can then give their voice query in the form of find me a bottle, find bed etc.
  3. A video stream starts using the imutils library of python.
  4. The object detection module gets activated, where it matches the object against its set of predefined classes. Deep Neural Network (DNN) is used as the model.
  5. If object is found, it makes a box around the object. We have fetched the co-ordinates of the box. For getting the user’s viewpoint, we have considered the center of the bottom line of the box. If object is not present in its training dataset, the system returns object not found output.
  6. The calculation of degree has been made by the mathematical formula: math.degrees(math.atan(abs(centerX-user_x)/abs(centerY-user_y)))
  7. The video stream stops.
  8. Finally, the text output is converted to speech by using the pyttsx library of python.

Output:

Video stream and outputs on terminal

--

--

Shreya Singh

Software Development Engineer at Amazon, Google APAC Women Techmaker Scholar, C.S.E. Graduate @ NIT Patna