Module 3.6: Troubleshooting and Debugging


Duration	2 hours
Day	7 of 7

Learning Objectives

By the end of this module, students will be able to:

Diagnose common voice AI issues
Use debugging tools effectively
Implement error handling strategies
Build resilient agents

Topics

1. Common Issues and Symptoms (25 min)

Issue Classification

Category	Symptoms	Common Causes
Connectivity	No response, timeouts	Network, auth, SSL
Recognition	Wrong understanding	Accent, noise, jargon
Logic	Wrong actions taken	Prompt issues, missing context
Performance	Slow responses	API delays, cold starts
Integration	Function failures	API errors, data issues

Diagnostic Decision Tree

Issue Detected
     │
     ├─── Agent not responding?
     │         │
     │         ├── Check: Network connectivity
     │         ├── Check: Authentication
     │         └── Check: Health endpoint
     │
     ├─── Wrong understanding?
     │         │
     │         ├── Check: STT configuration
     │         ├── Check: Language settings
     │         └── Check: Hints/vocabulary
     │
     ├─── Wrong actions?
     │         │
     │         ├── Check: Prompt clarity
     │         ├── Check: Function parameters
     │         └── Check: Context state
     │
     └─── Slow responses?
               │
               ├── Check: Function latency
               ├── Check: External API times
               └── Check: Cold start issues

2. Using swaig-test (30 min)

Basic Testing

# Dump SWML output
swaig-test my_agent.py --dump-swml

# List all registered tools
swaig-test my_agent.py --list-tools

# Test specific function
swaig-test my_agent.py --exec get_order --order_id "12345"

# Test with metadata
swaig-test my_agent.py --exec process_payment \
  --amount "100" \
  --meta customer_id="C001" \
  --meta verified="true"

Debugging Function Issues

# Verbose output
swaig-test my_agent.py --exec my_function --x "test" --verbose

# Check function signature
swaig-test my_agent.py --list-tools --json | jq '.[] | select(.name=="my_function")'

# Test with raw_data simulation
swaig-test my_agent.py --exec my_function \
  --query "help" \
  --raw '{"call_id": "test-123", "meta_data": {"customer": "John"}}'

SWML Validation

# Validate SWML structure
swaig-test my_agent.py --dump-swml --validate

# Check specific sections
swaig-test my_agent.py --dump-swml | jq '.sections.main[0].ai'

# Verify functions are exposed
swaig-test my_agent.py --dump-swml | jq '.sections.main[0].ai.SWAIG.functions'

3. Log-Based Debugging (30 min)

Enable Debug Logging

import logging
import os

# Enable debug logging
logging.basicConfig(
    level=logging.DEBUG if os.getenv("DEBUG") else logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

logger = logging.getLogger(__name__)


class DebuggableAgent(AgentBase):
    def __init__(self):
        super().__init__(name="debuggable-agent")
        logger.debug(f"Agent initialized: {self._name}")
        self._setup_functions()

    def _setup_functions(self):
        @self.tool(
            description="Debug function",
            parameters={
                "type": "object",
                "properties": {
                    "input": {"type": "string", "description": "Input data"}
                },
                "required": ["input"]
            }
        )
        def debug_function(args: dict, raw_data: dict = None) -> SwaigFunctionResult:
            input = args.get("input", "")
            raw_data = raw_data or {}
            global_data = raw_data.get("global_data", {})
            logger.debug(f"Function called with input: {input}")
            logger.debug(f"raw_data: {raw_data}")
            logger.debug(f"Current global_data: {global_data}")

            result = process(input)
            logger.debug(f"Function result: {result}")

            return SwaigFunctionResult(result)

Run with Debug Mode

# Enable debug logging
DEBUG=1 python my_agent.py

# Or with specific logger
LOGLEVEL=DEBUG python my_agent.py

Analyzing Logs

# Filter for specific call
grep "call-123" agent.log

# Filter for errors
grep -E "(ERROR|CRITICAL)" agent.log

# Filter for function calls
grep "Function called" agent.log

# JSON log parsing
cat agent.log | jq 'select(.level=="ERROR")'

# Timeline analysis
cat agent.log | jq -r '[.timestamp, .message] | @tsv' | head -20

4. Error Handling Strategies (25 min)

Graceful Degradation

class ResilientAgent(AgentBase):
    def __init__(self):
        super().__init__(name="resilient-agent")
        self._setup_functions()

    def _setup_functions(self):
        @self.tool(
            description="Get order status",
            parameters={
                "type": "object",
                "properties": {
                    "order_id": {"type": "string", "description": "Order ID"}
                },
                "required": ["order_id"]
            }
        )
        def get_order_status(args: dict, raw_data: dict = None) -> SwaigFunctionResult:
            order_id = args.get("order_id", "")
            try:
                # Primary: Real-time API
                status = api.get_order(order_id)
                return SwaigFunctionResult(f"Order {order_id}: {status}")

            except ConnectionError:
                logger.warning(f"API unavailable, trying cache")
                try:
                    # Fallback: Cached data
                    status = cache.get(f"order:{order_id}")
                    if status:
                        return SwaigFunctionResult(
                            f"Order {order_id}: {status} "
                            "(Note: This may be slightly outdated)"
                        )
                except Exception:
                    pass

                # Final fallback: Helpful response
                return SwaigFunctionResult(
                    "I'm having trouble checking order status right now. "
                    "Would you like me to have someone call you back?"
                )

            except ValueError as e:
                logger.error(f"Invalid order ID: {order_id}")
                return SwaigFunctionResult(
                    "That doesn't look like a valid order number. "
                    "Order numbers are 8 digits. Could you check and try again?"
                )

            except Exception as e:
                logger.error(f"Unexpected error: {e}", exc_info=True)
                return SwaigFunctionResult(
                    "Something went wrong. Let me connect you with support."
                )

Retry Logic

import time
from functools import wraps


def retry(max_attempts=3, delay=1, backoff=2):
    """Retry decorator with exponential backoff."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            attempts = 0
            current_delay = delay

            while attempts < max_attempts:
                try:
                    return func(*args, **kwargs)
                except (ConnectionError, TimeoutError) as e:
                    attempts += 1
                    if attempts == max_attempts:
                        raise
                    logger.warning(
                        f"Attempt {attempts} failed: {e}. "
                        f"Retrying in {current_delay}s"
                    )
                    time.sleep(current_delay)
                    current_delay *= backoff

        return wrapper
    return decorator


class RetryAgent(AgentBase):
    @retry(max_attempts=3, delay=0.5)
    def _fetch_data(self, id: str):
        return api.get(id)

    @AgentBase.tool(
        description="Get data with retry",
        parameters={
            "type": "object",
            "properties": {
                "id": {"type": "string", "description": "Data ID"}
            },
            "required": ["id"]
        }
    )
    def get_data(self, args: dict, raw_data: dict = None) -> SwaigFunctionResult:
        id = args.get("id", "")
        try:
            data = self._fetch_data(id)
            return SwaigFunctionResult(f"Data: {data}")
        except Exception:
            return SwaigFunctionResult(
                "Unable to retrieve data after multiple attempts."
            )

Circuit Breaker

from datetime import datetime, timedelta


class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=60):
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.failures = 0
        self.last_failure = None
        self.state = "closed"  # closed, open, half-open

    def call(self, func, *args, **kwargs):
        if self.state == "open":
            if datetime.now() - self.last_failure > timedelta(seconds=self.reset_timeout):
                self.state = "half-open"
            else:
                raise CircuitOpenError("Circuit breaker is open")

        try:
            result = func(*args, **kwargs)
            if self.state == "half-open":
                self.state = "closed"
                self.failures = 0
            return result

        except Exception as e:
            self.failures += 1
            self.last_failure = datetime.now()

            if self.failures >= self.failure_threshold:
                self.state = "open"
                logger.error(f"Circuit breaker opened after {self.failures} failures")

            raise


class CircuitBreakerAgent(AgentBase):
    def __init__(self):
        super().__init__(name="circuit-agent")
        self.api_circuit = CircuitBreaker(failure_threshold=3, reset_timeout=30)
        self._setup_functions()

    def _setup_functions(self):
        @self.tool(
            description="Protected API call",
            parameters={
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "API query"}
                },
                "required": ["query"]
            }
        )
        def protected_call(args: dict, raw_data: dict = None) -> SwaigFunctionResult:
            query = args.get("query", "")
            try:
                result = self.api_circuit.call(api.search, query)
                return SwaigFunctionResult(f"Found: {result}")

            except CircuitOpenError:
                return SwaigFunctionResult(
                    "Our search system is temporarily unavailable. "
                    "Please try again in a few minutes."
                )

5. Production Debugging (30 min)

Remote Debugging Setup

import os

if os.getenv("REMOTE_DEBUG"):
    import debugpy
    debugpy.listen(("0.0.0.0", 5678))
    print("Waiting for debugger attach...")
    debugpy.wait_for_client()

Request Replay

import json
from datetime import datetime


class ReplayableAgent(AgentBase):
    def __init__(self):
        super().__init__(name="replayable-agent")
        self.request_log = []
        self._setup_functions()

    def log_request(self, function_name: str, args: dict, raw_data: dict):
        """Log request for replay."""
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "function": function_name,
            "args": args,
            "raw_data": raw_data
        }
        self.request_log.append(entry)

        # Also write to file
        with open("requests.jsonl", "a") as f:
            f.write(json.dumps(entry) + "\n")

    def replay_request(self, entry: dict):
        """Replay a logged request."""
        func = getattr(self, entry["function"])
        return func(**entry["args"], raw_data=entry["raw_data"])


# Replay from file
def replay_from_log(agent, log_file: str):
    with open(log_file) as f:
        for line in f:
            entry = json.loads(line)
            print(f"Replaying: {entry['function']}")
            try:
                result = agent.replay_request(entry)
                print(f"Result: {result}")
            except Exception as e:
                print(f"Error: {e}")

Health Check Debugging

@app.get("/debug/health")
async def debug_health():
    """Detailed health check for debugging."""
    results = {
        "timestamp": datetime.utcnow().isoformat(),
        "checks": {}
    }

    # Agent SWML generation
    start = time.perf_counter()
    try:
        swml = agent.get_swml()
        results["checks"]["swml_generation"] = {
            "status": "ok",
            "duration_ms": (time.perf_counter() - start) * 1000,
            "function_count": len(swml.get("functions", []))
        }
    except Exception as e:
        results["checks"]["swml_generation"] = {
            "status": "error",
            "error": str(e)
        }

    # Database connectivity
    start = time.perf_counter()
    try:
        db.execute("SELECT 1")
        results["checks"]["database"] = {
            "status": "ok",
            "duration_ms": (time.perf_counter() - start) * 1000
        }
    except Exception as e:
        results["checks"]["database"] = {
            "status": "error",
            "error": str(e)
        }

    # External API
    start = time.perf_counter()
    try:
        response = requests.get(
            "https://api.example.com/health",
            timeout=5
        )
        results["checks"]["external_api"] = {
            "status": "ok" if response.ok else "degraded",
            "status_code": response.status_code,
            "duration_ms": (time.perf_counter() - start) * 1000
        }
    except Exception as e:
        results["checks"]["external_api"] = {
            "status": "error",
            "error": str(e)
        }

    # Memory usage
    import psutil
    process = psutil.Process()
    results["system"] = {
        "memory_mb": process.memory_info().rss / 1024 / 1024,
        "cpu_percent": process.cpu_percent()
    }

    return results

Troubleshooting Quick Reference

Agent Won’t Start

# Check syntax
python -m py_compile my_agent.py

# Check imports
python -c "from my_agent import *"

# Check environment
env | grep SIGNALWIRE

# Verbose startup
python my_agent.py --verbose

Function Not Called

# Verify function registered
swaig-test my_agent.py --list-tools | grep function_name

# Check SWML includes function
swaig-test my_agent.py --dump-swml | jq '.sections.main[0].ai.SWAIG.functions[] | select(.function=="function_name")'

# Test function directly
swaig-test my_agent.py --exec function_name --param "value"

Slow Performance

# Profile function
python -m cProfile -s cumulative my_agent.py

# Time external calls
curl -w "@timing.txt" -o /dev/null -s https://api.example.com/

# Check memory
python -c "import tracemalloc; tracemalloc.start(); from my_agent import *; print(tracemalloc.get_traced_memory())"

Key Takeaways

Systematic debugging - Follow the decision tree
Use the tools - swaig-test is your friend
Log thoughtfully - Context and correlation
Handle errors gracefully - Fallbacks and retries
Monitor continuously - Catch issues before users do

Preparation for Lab 3.6

Identify a problematic agent or function
Gather error logs
Note reproduction steps

Lab Preview

In Lab 3.6, you will:

Debug a broken agent
Fix common issues
Implement error handling
Add resilience patterns