Building a Sports AI Assistant: Architecture Walkthrough

Building a Sports AI Assistant: Architecture Walkthrough

Building a Sports AI Assistant: Architecture Walkthrough

This document provides a technical walkthrough of the Sports AI Assistant — a retrieval-augmented generation (RAG) system designed to handle complex, real-time sports queries using large language models (LLMs), semantic retrieval, and live data integrations.

Table of Contents

1. Overview

The assistant combines domain-specific knowledge with live sports data to produce contextual answers. It integrates structured retrieval, real-time APIs, and workflow orchestration to answer questions about games, players, betting lines, and conditions with both precision and freshness.

System Diagram

User
  ↓
API Router (FastAPI) ──▶ Guardrails (NeMo)
  ↓                         │
  └─────────────▶ Agent Router (Sports/Betting/General)
                            │
                            ▼
                    Orchestrator (LlamaIndex Workflow)
                            │
         ┌─────────── Hybrid Retrieval ─────────────┐
         │      (Vector KB + Real-time Sources)     │
         │  pgvector  + odds/injury/weather APIs    │
         └──────────────────┬───────────────────────┘
                            ▼
                    Contextual Chat Engine (LLM)
                            │
                          Memory
                            │
                         Streaming
                            │
                           UI

2. Technology Stack

LayerTechnologyPurpose
API FrameworkFastAPIAsync web server for request handling and streaming
EmbeddingsHuggingFace (BAAI/bge-small-en-v1.5)Fast, high-recall semantic understanding
OrchestrationLlamaIndexWorkflow and RAG coordination
DatabasePostgreSQL + pgvectorSemantic vector storage and similarity search
LLMOpenAI GPT-4Reasoning and language generation engine
GuardrailsNeMo GuardrailsConversation control and intent routing
MemoryLangChain Buffer MemoryContext summarization and recall
IntegrationsCustom async APIsLive sports, odds, injury, and weather data

3. API Layer

The system exposes two primary endpoints — one for standard JSON responses and one for real-time streaming. FastAPI manages context injection, CORS, and async execution.

@app.post("/api/chat/handle_chat/")
async def handle_chat(request: ChatRequest):
    context = AssistantContext.from_request(request)
    response = await assistant.chat(context)
    return JSONResponse({'response': response.content})

@app.post("/api/chat/stream_chat/")
async def stream_chat(request: ChatRequest):
    context = AssistantContext.from_request(request)
    return StreamingResponse(
        assistant.stream_chat(context),
        media_type="text/event-stream"
    )

Design Note: Embeddings and model configurations are initialized globally at startup to ensure consistency across all sessions.

4. Retrieval-Augmented Generation (RAG)

The RAG pipeline combines stored sports knowledge with live data feeds to produce contextually grounded answers.

4.1 Vector Knowledge Base

Structured data such as team histories, player bios, and game metadata is stored semantically in PostgreSQL using pgvector.

from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.core import VectorStoreIndex

def build_vector_index() -> VectorStoreIndex:
    vector_store = PGVectorStore.from_params(
        database="sports_knowledge",
        host="localhost",
        user="assistant_user",
        password="secure_password",
        table_name="sports_embeddings_v2",
        embed_dim=384
    )
    return VectorStoreIndex.from_vector_store(vector_store)

4.2 Hybrid Retriever

The retriever merges vector search results with live context from APIs (odds, weather, injury data).

class HybridSportsRetriever(BaseRetriever):
    def __init__(self, vector_retriever, live_data_sources):
        self.vector_retriever = vector_retriever
        self.live_data_sources = live_data_sources

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        vector_results = self.vector_retriever.retrieve(query_bundle)
        live_context = []

        for source in self.live_data_sources:
            if source.is_relevant_for_query(query_bundle.query_str):
                data = source.fetch_current_data()
                live_context.append(self.format_as_node(data))

        return vector_results + live_context

This design allows the model to reference current sports data without retraining or re-indexing.

5. Entity Extraction

Every query is first parsed into structured entities (players, teams, leagues, markets, and date references). This enables precision filtering and routing downstream.

from pydantic import BaseModel
from typing import List, Optional

class SportsEntityData(BaseModel):
    players: List[str] = []
    teams: List[str] = []
    leagues: List[str] = []
    date_references: List[str] = []
    betting_markets: List[str] = []
    condensed_question: Optional[str] = None

Example: Question Processing

Input Question: "What are the current point spreads for tonight's Lakers vs Warriors game, and how has LeBron James performed against Golden State this season?"

Extracted Entities:

SportsEntityData(
    players=["LeBron James"],
    teams=["Lakers", "Warriors", "Los Angeles Lakers", "Golden State Warriors"],
    leagues=["NBA"],
    date_references=["tonight", "this season"],
    betting_markets=["point spread"],
    condensed_question="Current point spreads for Lakers vs Warriors tonight and LeBron's performance vs Warriors this season"
)

Context-Aware Follow-up Processing

The system also handles conversational context. Consider this follow-up question:

Follow-up Question: "How has he performed overall this season?"

Without Context: This question lacks essential information - who is "he"?

With Conversation History: The system references the previous question to create a complete query.

Condensed Question Output:

SportsEntityData(
    players=["LeBron James"],  # Resolved from conversation context
    teams=["Lakers"],  # LeBron's current team from context
    leagues=["NBA"],
    date_references=["this season"],
    betting_markets=[],
    condensed_question="How has LeBron James performed overall this season with the Lakers"
)

This context resolution happens before entity extraction, ensuring downstream processes receive complete, actionable queries regardless of conversational ambiguity.

The output of this extraction step drives data retrieval and workflow logic.

6. Workflow Orchestration

LlamaIndex's workflow engine coordinates entity extraction, data gathering, and response generation.

from llama_index.core.workflow import Workflow, step, Context
import asyncio

class SportsAnalysisWorkflow(Workflow):

    @step
    async def analyze_question(self, ctx: Context, ev):
        llm = ctx.data["llm"]
        entities = llm.structured_predict(SportsEntityData, entity_prompt, question=ev.last_message)
        ctx.data["entities"] = entities
        return {"entities": entities}

    @step
    async def fetch_relevant_data(self, ctx: Context, ev):
        e = ev["entities"]
        tasks = []
        if e.players: tasks.append(self.fetch_player_data(e.players))
        if e.teams: tasks.append(self.fetch_team_data(e.teams))
        if self.needs_weather_data(e): tasks.append(self.fetch_weather_data(e))
        results = await asyncio.gather(*tasks)
        return {"context_data": results}

    @step
    async def generate_response(self, ctx: Context, ev):
        chat_engine = self.create_contextual_chat_engine(
            vector_retriever=ctx.data["vector_index"].as_retriever(),
            live_context=ev["context_data"]
        )
        response = await chat_engine.achat(ctx.data["user_question"])
        return {"result": response}

Workflow advantages:

  • Asynchronous data gathering
  • Conditional fetch logic
  • Clear step-level observability and error isolation

7. Guardrails and Agent Routing

NeMo Guardrails controls which agent handles the query (sports, betting, or general).

@action(is_system_action=True)
async def route_sports_query(agent_name: str, context, **kwargs):
    agent = {
        "SPORTS": SportsAgent,
        "BETTING": BettingAgent
    }.get(agent_name, SportsAgent)(context, **kwargs)

    if context["stream"]:
        await agent.stream_response()
    else:
        return await agent.generate_response()

This layer enforces conversation discipline and prevents off-topic or unsafe responses.

8. Memory and Context Management

To maintain context without token overload, a summarized buffer memory tracks conversation history.

from langchain.memory import ConversationSummaryBufferMemory
from langchain_openai import OpenAI

class SmartChatMemory:
    def __init__(self, db):
        self.db = db
        self.llm = OpenAI()
        self.max_tokens = 500

    def load_context(self, thread_id):
        data = self.db.get_chat_memory(thread_id)
        return {'summary': data.get('summary', ''), 'history': data.get('history', [])}

    def update_context(self, thread_id, context):
        memory = ConversationSummaryBufferMemory(
            llm=self.llm,
            max_token_limit=self.max_tokens,
            moving_summary_buffer=context['summary'],
            return_messages=True
        )
        for msg in context['history']:
            memory.chat_memory.add_user_message(msg['content']) if msg['role'] == 'human' else memory.chat_memory.add_ai_message(msg['content'])
        self.db.save_chat_memory(thread_id, summary=memory.moving_summary_buffer, recent_history=memory.chat_memory.messages[-10:])

This allows continuity across sessions with minimal token cost.

9. Real-Time Data Integrations

The assistant integrates multiple async APIs for live context:

Example: Betting Lines

async def fetch_current_odds(self, teams, market_types):
    odds_data = []
    for team in teams:
        current_games = await self.odds_api.get_team_games(team, days_ahead=7)
        for game in current_games:
            for market in market_types:
                line_data = await self.odds_api.get_market_odds(game.id, market)
                if line_data:
                    odds_data.append({
                        'game': game,
                        'market': market,
                        'current_line': line_data.current_line,
                        'movement': line_data.movement_trend
                    })
    return odds_data

Each integration returns normalized data objects for unified retrieval.

10. Database Strategy

PurposeDatabaseKey Optimization
Vector searchPostgreSQL + pgvectorHigh-dimensional embedding search
Live sports dataPostgreSQLOptimized for write-heavy real-time feeds
User memoryPostgreSQLFast summaries and history management
CachingRedisShort-TTL caching for API rate control

Each database is optimized for its access pattern and failure isolation.

11. Streaming Architecture

Responses stream in real-time via server-sent events (SSE), enabling partial delivery during processing.

async def stream_chat_response(self, context):
    async for event in self.sports_workflow.stream_events():
        if isinstance(event, ChatOutputEvent) and event.has_content():
            yield f"data: {event.text}\n\n"

Streaming improves user experience and perceived intelligence by avoiding long blocking delays.

12. Configuration Management

Configuration is layered to separate environment and secret management:

  1. Defaults (code-level)
  2. config.json
  3. AWS SSM parameters
  4. Local .env overrides
class SmartConfig:
    def __init__(self):
        self.config = {}
        self.load_defaults()
        self.load_from_file("config.json")
        self.load_from_ssm()
        self.load_from_env()

13. Observability

Telemetry is captured at each workflow stage:

  • Latency (p50/p95)
  • Retrieval hit rate
  • Token and cost usage
  • Error and retry counts

Structured logs are exported to a centralized monitoring stack.

14. Key Insights

  • Hybrid RAG ensures responses combine historical context with live accuracy.
  • Entity extraction drives precise retrieval and data scoping.
  • Asynchronous workflows minimize latency during multi-source fetches.
  • Streaming enhances responsiveness and engagement.
  • Separate databases simplify scaling, tuning, and resilience.

15. Summary

The Sports AI Assistant integrates semantic retrieval, structured entity understanding, and real-time data orchestration into a single LLM-driven architecture. Its modular design — built on FastAPI, LlamaIndex, and PostgreSQL — supports low-latency, contextually accurate sports conversations that feel both intelligent and live.

Brian Wight

Brian Wight

Technical leader and entrepreneur focused on building scalable systems and high-performing teams. Passionate about ownership culture, data-driven decision making, and turning complex problems into simple solutions.