Voice Output of Object Detection Results

Introduction

This lesson creates a program that subscribes to YOLO object detection results implemented in Lesson7 and outputs the names of detected objects by voice. This enables the robot to tell “what it sees” by voice.

Learning Objectives

By the end of this lesson, you will be able to:

  • Integrate multiple ROS2 nodes (object detection and voice output)
  • Subscribe to detection results and process them
  • Combine sensor processing with robot interaction capabilities
  • Create interactive robot behaviors using ROS2 communication

Program Overview

This program has the following two main functions:

  1. Subscribe to YOLO object detection results from Lesson7
  2. Output detected object names by voice

These functions are implemented in one node:

  • kachaka_speak_detection: Subscribes to object detection results and outputs by voice

Required Packages

The following packages are required to run this program:

  • image_yolo_detection: Object detection package implemented in Lesson7
  • kachaka_interfaces: Package that provides interfaces for Kachaka robot

Program Implementation

Voice Output Node (kachaka_speak_detection.py)

kachaka_speak_detection.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import rclpy
from rclpy.node import Node
from std_msgs.msg import String
from kachaka_interfaces.action import ExecKachakaCommand
from kachaka_interfaces.msg import KachakaCommand
from rclpy.action import ActionClient
import json
import time


class KachakaSpeakDetection(Node):
    """
    Node for speaking detected objects
    """
    def __init__(self):
        super().__init__('kachaka_speak_detection')
        
        # Set QoS profile for object detection data
        qos = rclpy.qos.QoSProfile(
            reliability=rclpy.qos.ReliabilityPolicy.BEST_EFFORT,
            durability=rclpy.qos.DurabilityPolicy.VOLATILE,
            history=rclpy.qos.HistoryPolicy.KEEP_LAST,
            depth=10
        )
        
        # Subscribe to object detection information from Lesson7
        self.subscription = self.create_subscription(
            String,
            '/image_yolo_detection/objects',  # Topic from Lesson7
            self.object_callback,
            qos
        )

        # Action client for speech output
        self._action_client = ActionClient(
            self, ExecKachakaCommand, "/kachaka/kachaka_command/execute"
        )
        self._action_client.wait_for_server()
        
        # Dictionary for converting English object names (kept in English for voice output)
        self.object_names = {
            'person': 'person',
            'car': 'car',
            'chair': 'chair',
            'bottle': 'bottle',
            'cup': 'cup',
            'laptop': 'laptop',
            'mouse': 'mouse',
            'keyboard': 'keyboard',
            'cell phone': 'smartphone',
            'book': 'book',
            # Add more as needed
        }
        
        # Store the last detected objects
        self.last_objects = []
        
        # Create a timer to periodically speak the detected objects
        self.timer = self.create_timer(5.0, self.timer_callback)
        
        self.get_logger().info("Kachaka speak detection node has started.")

    def object_callback(self, msg):
        """
        Process detected object information and store it
        """
        try:
            # Convert JSON string to Python object
            object_info = json.loads(msg.data)
            self.last_objects = object_info.get("objects", [])
            
        except Exception as e:
            self.get_logger().error(f"Error processing object information: {str(e)}")
            
    def timer_callback(self):
        """
        Periodically speak the detected objects.
        
        This callback is triggered every 5 seconds by the timer. It creates
        a natural-sounding sentence describing the detected objects and
        sends it to the robot for voice output.
        """
        # If no objects are detected, don't say anything
        if not self.last_objects:
            return
            
        # Convert all detected objects and create a summary message
        # Map English object names to display names (or keep original if not mapped)
        object_names = []
        for obj in self.last_objects:
            # Get the display name from dictionary, or use original name if not found
            object_name = self.object_names.get(obj, obj)
            object_names.append(object_name)
        
        # Create a natural-sounding summary message
        # Handle singular vs plural cases for better speech
        if len(object_names) == 1:
            # Single object: "I detected person"
            text = f"I detected {object_names[0]}"
        else:
            # Multiple objects: "I detected person, car and chair"
            # Join all but the last with commas, then add "and" before the last one
            text = f"I detected {', '.join(object_names[:-1])} and {object_names[-1]}"
        
        # Execute speech output by sending the text to the robot
        self.send_goal(text)
        
        # Log what was said for debugging
        self.get_logger().info(f"Speech output: {text}")
            
    def send_goal(self, text):
        """
        Send a speech command to Kachaka using the action server.
        
        This method creates a command to make the robot speak the given text.
        The command is sent asynchronously, so it doesn't block the node.
        
        Parameters:
            text: The text string that the robot should speak
        """
        # Create a KachakaCommand message
        command = KachakaCommand()
        # Set command type to SPEAK_COMMAND (tells robot to use voice output)
        command.command_type = KachakaCommand.SPEAK_COMMAND
        # Set the text content to be spoken
        command.speak_command_text = text

        # Create a goal message for the action
        goal_msg = ExecKachakaCommand.Goal()
        # Attach the command to the goal
        goal_msg.kachaka_command = command

        # Send the goal asynchronously (non-blocking)
        # This returns a Future that will complete when the server responds
        future = self._action_client.send_goal_async(goal_msg)
        # Set a callback to handle when the server accepts or rejects the goal
        future.add_done_callback(self.goal_response_callback)

    def goal_response_callback(self, future):
        """
        Callback for when the goal is accepted or rejected
        """
        goal_handle = future.result()
        if not goal_handle.accepted:
            self.get_logger().info("Goal rejected")
            return
        self.get_logger().info("Goal accepted")
        result_future = goal_handle.get_result_async()
        result_future.add_done_callback(self.get_result_callback)

    def get_result_callback(self, future):
        """
        Callback for when the goal is completed
        """
        result = future.result().result
        self.get_logger().info(f"Result: {result}")


def main(args=None):
    rclpy.init(args=args)
    node = KachakaSpeakDetection()
    try:
        rclpy.spin(node)
    except KeyboardInterrupt:
        node.get_logger().info("Shutting down...")
    finally:
        node.destroy_node()
        rclpy.shutdown()


if __name__ == '__main__':
    main()

Program Explanation

Voice Output Node (kachaka_speak_detection.py)

This node has the following functions:

  1. Subscribe to YOLO object detection results from Lesson7
  2. Process detected object names
  3. Output object names by voice

Main processing flow:

  1. Receive information from Lesson7’s object detection node
  2. Convert JSON format string to Python object
  3. Process detected object names
  4. Output object names by voice

Implementation of Voice Output

Voice output is implemented using the Kachaka robot’s action server. The send_goal method performs voice output in the following steps:

  1. Create KachakaCommand message
  2. Set command type to SPEAK_COMMAND
  3. Set text to speak
  4. Create goal message
  5. Send goal asynchronously
  6. Set callback functions to handle goal acceptance and results

Callback Processing

Action client callback processing is implemented as follows:

  1. goal_response_callback: Handles whether the goal was accepted or rejected

    • If the goal is accepted, set a callback to get results
    • If the goal is rejected, log it
  2. get_result_callback: Handles goal execution results

    • Log the results

Periodic Voice Output

To periodically output detected objects by voice, the following functions are implemented:

  1. last_objects variable: Stores the latest detection results
  2. Timer: Calls timer_callback method every 5 seconds
  3. timer_callback method: Outputs stored detection results by voice

Usage

Preparation

  1. Create package
# Move to workspace src directory
cd ~/ros2_ws/src

# Create package
ros2 pkg create --build-type ament_python kachaka_speak_detection --dependencies rclpy std_msgs kachaka_interfaces --node-name kachaka_speak_detection
  1. Build package
# Move to workspace root directory
cd ~/ros2_ws

# Build package
colcon build --packages-select kachaka_speak_detection

# Set up environment
source install/setup.bash
  1. Confirm that Lesson7’s YOLO object detection node is running
ros2 run image_yolo_detection image_yolo_detection
  1. Run voice output node
ros2 run kachaka_speak_detection kachaka_speak_detection

Exercises

  1. Add object names

    • Add new object names to the object_names dictionary
  2. Display detection probability

    • Modify to also output detection probability by voice
  3. Detect only specific objects

    • Modify to detect and output only specific objects (e.g., person or car) by voice
  4. Add object position information

    • Modify to also output object position information (where in the screen they were detected) by voice

Notes

  • The voice output node will not work if Lesson7’s YOLO object detection node is not running
  • Other processing may be blocked during voice output
  • Extend object name mapping as needed