Building a Sports AI Assistant: Architecture Walkthrough

Building a Sports AI Assistant: Architecture Walkthrough
This document provides a technical walkthrough of the Sports AI Assistant — a retrieval-augmented generation (RAG) system designed to handle complex, real-time sports queries using large language models (LLMs), semantic retrieval, and live data integrations.
Table of Contents
- Overview
- Technology Stack
- API Layer
- Retrieval-Augmented Generation (RAG)
- Entity Extraction
- Workflow Orchestration
- Guardrails and Agent Routing
- Memory and Context Management
- Real-Time Data Integrations
- Database Strategy
- Streaming Architecture
- Configuration Management
- Observability
- Key Insights
- Summary
1. Overview
The assistant combines domain-specific knowledge with live sports data to produce contextual answers. It integrates structured retrieval, real-time APIs, and workflow orchestration to answer questions about games, players, betting lines, and conditions with both precision and freshness.
System Diagram
User
↓
API Router (FastAPI) ──▶ Guardrails (NeMo)
↓ │
└─────────────▶ Agent Router (Sports/Betting/General)
│
▼
Orchestrator (LlamaIndex Workflow)
│
┌─────────── Hybrid Retrieval ─────────────┐
│ (Vector KB + Real-time Sources) │
│ pgvector + odds/injury/weather APIs │
└──────────────────┬───────────────────────┘
▼
Contextual Chat Engine (LLM)
│
Memory
│
Streaming
│
UI
2. Technology Stack
| Layer | Technology | Purpose |
|---|---|---|
| API Framework | FastAPI | Async web server for request handling and streaming |
| Embeddings | HuggingFace (BAAI/bge-small-en-v1.5) | Fast, high-recall semantic understanding |
| Orchestration | LlamaIndex | Workflow and RAG coordination |
| Database | PostgreSQL + pgvector | Semantic vector storage and similarity search |
| LLM | OpenAI GPT-4 | Reasoning and language generation engine |
| Guardrails | NeMo Guardrails | Conversation control and intent routing |
| Memory | LangChain Buffer Memory | Context summarization and recall |
| Integrations | Custom async APIs | Live sports, odds, injury, and weather data |
3. API Layer
The system exposes two primary endpoints — one for standard JSON responses and one for real-time streaming. FastAPI manages context injection, CORS, and async execution.
@app.post("/api/chat/handle_chat/")
async def handle_chat(request: ChatRequest):
context = AssistantContext.from_request(request)
response = await assistant.chat(context)
return JSONResponse({'response': response.content})
@app.post("/api/chat/stream_chat/")
async def stream_chat(request: ChatRequest):
context = AssistantContext.from_request(request)
return StreamingResponse(
assistant.stream_chat(context),
media_type="text/event-stream"
)
Design Note: Embeddings and model configurations are initialized globally at startup to ensure consistency across all sessions.
4. Retrieval-Augmented Generation (RAG)
The RAG pipeline combines stored sports knowledge with live data feeds to produce contextually grounded answers.
4.1 Vector Knowledge Base
Structured data such as team histories, player bios, and game metadata is stored semantically in PostgreSQL using pgvector.
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.core import VectorStoreIndex
def build_vector_index() -> VectorStoreIndex:
vector_store = PGVectorStore.from_params(
database="sports_knowledge",
host="localhost",
user="assistant_user",
password="secure_password",
table_name="sports_embeddings_v2",
embed_dim=384
)
return VectorStoreIndex.from_vector_store(vector_store)
4.2 Hybrid Retriever
The retriever merges vector search results with live context from APIs (odds, weather, injury data).
class HybridSportsRetriever(BaseRetriever):
def __init__(self, vector_retriever, live_data_sources):
self.vector_retriever = vector_retriever
self.live_data_sources = live_data_sources
def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
vector_results = self.vector_retriever.retrieve(query_bundle)
live_context = []
for source in self.live_data_sources:
if source.is_relevant_for_query(query_bundle.query_str):
data = source.fetch_current_data()
live_context.append(self.format_as_node(data))
return vector_results + live_context
This design allows the model to reference current sports data without retraining or re-indexing.
5. Entity Extraction
Every query is first parsed into structured entities (players, teams, leagues, markets, and date references). This enables precision filtering and routing downstream.
from pydantic import BaseModel
from typing import List, Optional
class SportsEntityData(BaseModel):
players: List[str] = []
teams: List[str] = []
leagues: List[str] = []
date_references: List[str] = []
betting_markets: List[str] = []
condensed_question: Optional[str] = None
Example: Question Processing
Input Question: "What are the current point spreads for tonight's Lakers vs Warriors game, and how has LeBron James performed against Golden State this season?"
Extracted Entities:
SportsEntityData(
players=["LeBron James"],
teams=["Lakers", "Warriors", "Los Angeles Lakers", "Golden State Warriors"],
leagues=["NBA"],
date_references=["tonight", "this season"],
betting_markets=["point spread"],
condensed_question="Current point spreads for Lakers vs Warriors tonight and LeBron's performance vs Warriors this season"
)
Context-Aware Follow-up Processing
The system also handles conversational context. Consider this follow-up question:
Follow-up Question: "How has he performed overall this season?"
Without Context: This question lacks essential information - who is "he"?
With Conversation History: The system references the previous question to create a complete query.
Condensed Question Output:
SportsEntityData(
players=["LeBron James"], # Resolved from conversation context
teams=["Lakers"], # LeBron's current team from context
leagues=["NBA"],
date_references=["this season"],
betting_markets=[],
condensed_question="How has LeBron James performed overall this season with the Lakers"
)
This context resolution happens before entity extraction, ensuring downstream processes receive complete, actionable queries regardless of conversational ambiguity.
The output of this extraction step drives data retrieval and workflow logic.
6. Workflow Orchestration
LlamaIndex's workflow engine coordinates entity extraction, data gathering, and response generation.
from llama_index.core.workflow import Workflow, step, Context
import asyncio
class SportsAnalysisWorkflow(Workflow):
@step
async def analyze_question(self, ctx: Context, ev):
llm = ctx.data["llm"]
entities = llm.structured_predict(SportsEntityData, entity_prompt, question=ev.last_message)
ctx.data["entities"] = entities
return {"entities": entities}
@step
async def fetch_relevant_data(self, ctx: Context, ev):
e = ev["entities"]
tasks = []
if e.players: tasks.append(self.fetch_player_data(e.players))
if e.teams: tasks.append(self.fetch_team_data(e.teams))
if self.needs_weather_data(e): tasks.append(self.fetch_weather_data(e))
results = await asyncio.gather(*tasks)
return {"context_data": results}
@step
async def generate_response(self, ctx: Context, ev):
chat_engine = self.create_contextual_chat_engine(
vector_retriever=ctx.data["vector_index"].as_retriever(),
live_context=ev["context_data"]
)
response = await chat_engine.achat(ctx.data["user_question"])
return {"result": response}
Workflow advantages:
- Asynchronous data gathering
- Conditional fetch logic
- Clear step-level observability and error isolation
7. Guardrails and Agent Routing
NeMo Guardrails controls which agent handles the query (sports, betting, or general).
@action(is_system_action=True)
async def route_sports_query(agent_name: str, context, **kwargs):
agent = {
"SPORTS": SportsAgent,
"BETTING": BettingAgent
}.get(agent_name, SportsAgent)(context, **kwargs)
if context["stream"]:
await agent.stream_response()
else:
return await agent.generate_response()
This layer enforces conversation discipline and prevents off-topic or unsafe responses.
8. Memory and Context Management
To maintain context without token overload, a summarized buffer memory tracks conversation history.
from langchain.memory import ConversationSummaryBufferMemory
from langchain_openai import OpenAI
class SmartChatMemory:
def __init__(self, db):
self.db = db
self.llm = OpenAI()
self.max_tokens = 500
def load_context(self, thread_id):
data = self.db.get_chat_memory(thread_id)
return {'summary': data.get('summary', ''), 'history': data.get('history', [])}
def update_context(self, thread_id, context):
memory = ConversationSummaryBufferMemory(
llm=self.llm,
max_token_limit=self.max_tokens,
moving_summary_buffer=context['summary'],
return_messages=True
)
for msg in context['history']:
memory.chat_memory.add_user_message(msg['content']) if msg['role'] == 'human' else memory.chat_memory.add_ai_message(msg['content'])
self.db.save_chat_memory(thread_id, summary=memory.moving_summary_buffer, recent_history=memory.chat_memory.messages[-10:])
This allows continuity across sessions with minimal token cost.
9. Real-Time Data Integrations
The assistant integrates multiple async APIs for live context:
Example: Betting Lines
async def fetch_current_odds(self, teams, market_types):
odds_data = []
for team in teams:
current_games = await self.odds_api.get_team_games(team, days_ahead=7)
for game in current_games:
for market in market_types:
line_data = await self.odds_api.get_market_odds(game.id, market)
if line_data:
odds_data.append({
'game': game,
'market': market,
'current_line': line_data.current_line,
'movement': line_data.movement_trend
})
return odds_data
Each integration returns normalized data objects for unified retrieval.
10. Database Strategy
| Purpose | Database | Key Optimization |
|---|---|---|
| Vector search | PostgreSQL + pgvector | High-dimensional embedding search |
| Live sports data | PostgreSQL | Optimized for write-heavy real-time feeds |
| User memory | PostgreSQL | Fast summaries and history management |
| Caching | Redis | Short-TTL caching for API rate control |
Each database is optimized for its access pattern and failure isolation.
11. Streaming Architecture
Responses stream in real-time via server-sent events (SSE), enabling partial delivery during processing.
async def stream_chat_response(self, context):
async for event in self.sports_workflow.stream_events():
if isinstance(event, ChatOutputEvent) and event.has_content():
yield f"data: {event.text}\n\n"
Streaming improves user experience and perceived intelligence by avoiding long blocking delays.
12. Configuration Management
Configuration is layered to separate environment and secret management:
- Defaults (code-level)
- config.json
- AWS SSM parameters
- Local .env overrides
class SmartConfig:
def __init__(self):
self.config = {}
self.load_defaults()
self.load_from_file("config.json")
self.load_from_ssm()
self.load_from_env()
13. Observability
Telemetry is captured at each workflow stage:
- Latency (p50/p95)
- Retrieval hit rate
- Token and cost usage
- Error and retry counts
Structured logs are exported to a centralized monitoring stack.
14. Key Insights
- Hybrid RAG ensures responses combine historical context with live accuracy.
- Entity extraction drives precise retrieval and data scoping.
- Asynchronous workflows minimize latency during multi-source fetches.
- Streaming enhances responsiveness and engagement.
- Separate databases simplify scaling, tuning, and resilience.
15. Summary
The Sports AI Assistant integrates semantic retrieval, structured entity understanding, and real-time data orchestration into a single LLM-driven architecture. Its modular design — built on FastAPI, LlamaIndex, and PostgreSQL — supports low-latency, contextually accurate sports conversations that feel both intelligent and live.
Brian Wight
Technical leader and entrepreneur focused on building scalable systems and high-performing teams. Passionate about ownership culture, data-driven decision making, and turning complex problems into simple solutions.