Building a Sports AI Assistant: Architecture Walkthrough

This document provides a technical walkthrough of the Sports AI Assistant — a retrieval-augmented generation (RAG) system designed to handle complex, real-time sports queries using large language models (LLMs), semantic retrieval, and live data integrations.

Overview
Technology Stack
API Layer
Retrieval-Augmented Generation (RAG)
Entity Extraction
Workflow Orchestration
Guardrails and Agent Routing
Memory and Context Management
Real-Time Data Integrations
Database Strategy
Streaming Architecture
Configuration Management
Observability
Key Insights
Summary

1. Overview

The assistant combines domain-specific knowledge with live sports data to produce contextual answers. It integrates structured retrieval, real-time APIs, and workflow orchestration to answer questions about games, players, betting lines, and conditions with both precision and freshness.

System Diagram

User
  ↓
API Router (FastAPI) ──▶ Guardrails (NeMo)
  ↓                         │
  └─────────────▶ Agent Router (Sports/Betting/General)
                            │
                            ▼
                    Orchestrator (LlamaIndex Workflow)
                            │
         ┌─────────── Hybrid Retrieval ─────────────┐
         │      (Vector KB + Real-time Sources)     │
         │  pgvector  + odds/injury/weather APIs    │
         └──────────────────┬───────────────────────┘
                            ▼
                    Contextual Chat Engine (LLM)
                            │
                          Memory
                            │
                         Streaming
                            │
                           UI

2. Technology Stack

Layer	Technology	Purpose
API Framework	FastAPI	Async web server for request handling and streaming
Embeddings	HuggingFace (BAAI/bge-small-en-v1.5)	Fast, high-recall semantic understanding
Orchestration	LlamaIndex	Workflow and RAG coordination
Database	PostgreSQL + pgvector	Semantic vector storage and similarity search
LLM	OpenAI GPT-4	Reasoning and language generation engine
Guardrails	NeMo Guardrails	Conversation control and intent routing
Memory	LangChain Buffer Memory	Context summarization and recall
Integrations	Custom async APIs	Live sports, odds, injury, and weather data

3. API Layer

The system exposes two primary endpoints — one for standard JSON responses and one for real-time streaming. FastAPI manages context injection, CORS, and async execution.

@app.post("/api/chat/handle_chat/")
async def handle_chat(request: ChatRequest):
    context = AssistantContext.from_request(request)
    response = await assistant.chat(context)
    return JSONResponse({'response': response.content})

@app.post("/api/chat/stream_chat/")
async def stream_chat(request: ChatRequest):
    context = AssistantContext.from_request(request)
    return StreamingResponse(
        assistant.stream_chat(context),
        media_type="text/event-stream"
    )

Design Note: Embeddings and model configurations are initialized globally at startup to ensure consistency across all sessions.

4. Retrieval-Augmented Generation (RAG)

The RAG pipeline combines stored sports knowledge with live data feeds to produce contextually grounded answers.

4.1 Vector Knowledge Base

Structured data such as team histories, player bios, and game metadata is stored semantically in PostgreSQL using pgvector.

from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.core import VectorStoreIndex

def build_vector_index() -> VectorStoreIndex:
    vector_store = PGVectorStore.from_params(
        database="sports_knowledge",
        host="localhost",
        user="assistant_user",
        password="secure_password",
        table_name="sports_embeddings_v2",
        embed_dim=384
    )
    return VectorStoreIndex.from_vector_store(vector_store)

4.2 Hybrid Retriever

The retriever merges vector search results with live context from APIs (odds, weather, injury data).

class HybridSportsRetriever(BaseRetriever):
    def __init__(self, vector_retriever, live_data_sources):
        self.vector_retriever = vector_retriever
        self.live_data_sources = live_data_sources

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        vector_results = self.vector_retriever.retrieve(query_bundle)
        live_context = []

        for source in self.live_data_sources:
            if source.is_relevant_for_query(query_bundle.query_str):
                data = source.fetch_current_data()
                live_context.append(self.format_as_node(data))

        return vector_results + live_context

This design allows the model to reference current sports data without retraining or re-indexing.

5. Entity Extraction

Every query is first parsed into structured entities (players, teams, leagues, markets, and date references). This enables precision filtering and routing downstream.

from pydantic import BaseModel
from typing import List, Optional

class SportsEntityData(BaseModel):
    players: List[str] = []
    teams: List[str] = []
    leagues: List[str] = []
    date_references: List[str] = []
    betting_markets: List[str] = []
    condensed_question: Optional[str] = None

Example: Question Processing

Input Question: "What are the current point spreads for tonight's Lakers vs Warriors game, and how has LeBron James performed against Golden State this season?"

Extracted Entities:

SportsEntityData(
    players=["LeBron James"],
    teams=["Lakers", "Warriors", "Los Angeles Lakers", "Golden State Warriors"],
    leagues=["NBA"],
    date_references=["tonight", "this season"],
    betting_markets=["point spread"],
    condensed_question="Current point spreads for Lakers vs Warriors tonight and LeBron's performance vs Warriors this season"
)

Context-Aware Follow-up Processing

The system also handles conversational context. Consider this follow-up question:

Follow-up Question: "How has he performed overall this season?"

Without Context: This question lacks essential information - who is "he"?

With Conversation History: The system references the previous question to create a complete query.

Condensed Question Output:

SportsEntityData(
    players=["LeBron James"],  # Resolved from conversation context
    teams=["Lakers"],  # LeBron's current team from context
    leagues=["NBA"],
    date_references=["this season"],
    betting_markets=[],
    condensed_question="How has LeBron James performed overall this season with the Lakers"
)

This context resolution happens before entity extraction, ensuring downstream processes receive complete, actionable queries regardless of conversational ambiguity.

The output of this extraction step drives data retrieval and workflow logic.

6. Workflow Orchestration

LlamaIndex's workflow engine coordinates entity extraction, data gathering, and response generation.

from llama_index.core.workflow import Workflow, step, Context
import asyncio

class SportsAnalysisWorkflow(Workflow):

    @step
    async def analyze_question(self, ctx: Context, ev):
        llm = ctx.data["llm"]
        entities = llm.structured_predict(SportsEntityData, entity_prompt, question=ev.last_message)
        ctx.data["entities"] = entities
        return {"entities": entities}

    @step
    async def fetch_relevant_data(self, ctx: Context, ev):
        e = ev["entities"]
        tasks = []
        if e.players: tasks.append(self.fetch_player_data(e.players))
        if e.teams: tasks.append(self.fetch_team_data(e.teams))
        if self.needs_weather_data(e): tasks.append(self.fetch_weather_data(e))
        results = await asyncio.gather(*tasks)
        return {"context_data": results}

    @step
    async def generate_response(self, ctx: Context, ev):
        chat_engine = self.create_contextual_chat_engine(
            vector_retriever=ctx.data["vector_index"].as_retriever(),
            live_context=ev["context_data"]
        )
        response = await chat_engine.achat(ctx.data["user_question"])
        return {"result": response}

Workflow advantages:

Asynchronous data gathering
Conditional fetch logic
Clear step-level observability and error isolation

7. Guardrails and Agent Routing

NeMo Guardrails controls which agent handles the query (sports, betting, or general).

@action(is_system_action=True)
async def route_sports_query(agent_name: str, context, **kwargs):
    agent = {
        "SPORTS": SportsAgent,
        "BETTING": BettingAgent
    }.get(agent_name, SportsAgent)(context, **kwargs)

    if context["stream"]:
        await agent.stream_response()
    else:
        return await agent.generate_response()

This layer enforces conversation discipline and prevents off-topic or unsafe responses.

8. Memory and Context Management

To maintain context without token overload, a summarized buffer memory tracks conversation history.

from langchain.memory import ConversationSummaryBufferMemory
from langchain_openai import OpenAI

class SmartChatMemory:
    def __init__(self, db):
        self.db = db
        self.llm = OpenAI()
        self.max_tokens = 500

    def load_context(self, thread_id):
        data = self.db.get_chat_memory(thread_id)
        return {'summary': data.get('summary', ''), 'history': data.get('history', [])}

    def update_context(self, thread_id, context):
        memory = ConversationSummaryBufferMemory(
            llm=self.llm,
            max_token_limit=self.max_tokens,
            moving_summary_buffer=context['summary'],
            return_messages=True
        )
        for msg in context['history']:
            memory.chat_memory.add_user_message(msg['content']) if msg['role'] == 'human' else memory.chat_memory.add_ai_message(msg['content'])
        self.db.save_chat_memory(thread_id, summary=memory.moving_summary_buffer, recent_history=memory.chat_memory.messages[-10:])

This allows continuity across sessions with minimal token cost.

9. Real-Time Data Integrations

The assistant integrates multiple async APIs for live context:

Example: Betting Lines

async def fetch_current_odds(self, teams, market_types):
    odds_data = []
    for team in teams:
        current_games = await self.odds_api.get_team_games(team, days_ahead=7)
        for game in current_games:
            for market in market_types:
                line_data = await self.odds_api.get_market_odds(game.id, market)
                if line_data:
                    odds_data.append({
                        'game': game,
                        'market': market,
                        'current_line': line_data.current_line,
                        'movement': line_data.movement_trend
                    })
    return odds_data

Each integration returns normalized data objects for unified retrieval.

10. Database Strategy

Purpose	Database	Key Optimization
Vector search	PostgreSQL + pgvector	High-dimensional embedding search
Live sports data	PostgreSQL	Optimized for write-heavy real-time feeds
User memory	PostgreSQL	Fast summaries and history management
Caching	Redis	Short-TTL caching for API rate control

Each database is optimized for its access pattern and failure isolation.

11. Streaming Architecture

Responses stream in real-time via server-sent events (SSE), enabling partial delivery during processing.

async def stream_chat_response(self, context):
    async for event in self.sports_workflow.stream_events():
        if isinstance(event, ChatOutputEvent) and event.has_content():
            yield f"data: {event.text}\n\n"

Streaming improves user experience and perceived intelligence by avoiding long blocking delays.

12. Configuration Management

Configuration is layered to separate environment and secret management:

Defaults (code-level)
config.json
AWS SSM parameters
Local .env overrides

class SmartConfig:
    def __init__(self):
        self.config = {}
        self.load_defaults()
        self.load_from_file("config.json")
        self.load_from_ssm()
        self.load_from_env()

13. Observability

Telemetry is captured at each workflow stage:

Latency (p50/p95)
Retrieval hit rate
Token and cost usage
Error and retry counts

Structured logs are exported to a centralized monitoring stack.

14. Key Insights

Hybrid RAG ensures responses combine historical context with live accuracy.
Entity extraction drives precise retrieval and data scoping.
Asynchronous workflows minimize latency during multi-source fetches.
Streaming enhances responsiveness and engagement.
Separate databases simplify scaling, tuning, and resilience.

15. Summary

The Sports AI Assistant integrates semantic retrieval, structured entity understanding, and real-time data orchestration into a single LLM-driven architecture. Its modular design — built on FastAPI, LlamaIndex, and PostgreSQL — supports low-latency, contextually accurate sports conversations that feel both intelligent and live.

Building a Sports AI Assistant: Architecture Walkthrough

Building a Sports AI Assistant: Architecture Walkthrough

Table of Contents

1. Overview

System Diagram

2. Technology Stack

3. API Layer

4. Retrieval-Augmented Generation (RAG)

4.1 Vector Knowledge Base

4.2 Hybrid Retriever

5. Entity Extraction

Example: Question Processing

Context-Aware Follow-up Processing

6. Workflow Orchestration

7. Guardrails and Agent Routing

8. Memory and Context Management

9. Real-Time Data Integrations

Example: Betting Lines

10. Database Strategy

11. Streaming Architecture

12. Configuration Management

13. Observability

14. Key Insights

15. Summary

Brian Wight