> arize
You are an expert in Arize and its open-source Phoenix library for AI observability. You help developers monitor LLM applications with tracing, evaluation, embedding analysis, drift detection, and retrieval quality metrics — using Phoenix for local development (open-source, self-hosted) and Arize platform for production monitoring at scale.
curl "https://skillshub.wtf/TerminalSkills/skills/arize?format=md"Arize (Phoenix) — AI Observability Platform
You are an expert in Arize and its open-source Phoenix library for AI observability. You help developers monitor LLM applications with tracing, evaluation, embedding analysis, drift detection, and retrieval quality metrics — using Phoenix for local development (open-source, self-hosted) and Arize platform for production monitoring at scale.
Core Capabilities
Phoenix Local Setup
import phoenix as px
from phoenix.otel import register
# Launch Phoenix locally (browser UI on localhost:6006)
px.launch_app()
# Register as OpenTelemetry trace provider
tracer_provider = register(project_name="my-llm-app")
# Auto-instrument OpenAI
from openinference.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
# Now all OpenAI calls are traced
import openai
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain CRDT to a junior dev"}],
)
# Open localhost:6006 — see traces, latency, tokens, cost
RAG Evaluation
from phoenix.evals import (
HallucinationEvaluator,
QAEvaluator,
RelevanceEvaluator,
run_evals,
)
from phoenix.evals.models import OpenAIModel
eval_model = OpenAIModel(model="gpt-4o")
# Evaluate RAG quality on your traces
hallucination_eval = HallucinationEvaluator(eval_model)
qa_eval = QAEvaluator(eval_model)
relevance_eval = RelevanceEvaluator(eval_model)
# Pull traces from Phoenix
traces_df = px.Client().get_spans_dataframe(
filter_condition="span_kind == 'LLM'",
)
# Run evaluations
results = run_evals(
dataframe=traces_df,
evaluators=[hallucination_eval, qa_eval, relevance_eval],
provide_explanation=True,
)
# Results: per-trace hallucination scores, QA accuracy, retrieval relevance
# All visible in Phoenix UI with explanations
Embedding Analysis
import phoenix as px
import pandas as pd
# Analyze embedding drift and clustering
embeddings_df = pd.DataFrame({
"text": documents,
"embedding": embeddings, # numpy arrays
"category": categories,
})
# Launch with embedding visualization
session = px.launch_app(
primary=px.Inferences(embeddings_df, schema=px.Schema(
embedding=px.EmbeddingColumnNames(
vector_column_name="embedding",
raw_data_column_name="text",
),
tag_column_names=["category"],
)),
)
# UMAP visualization in browser — see clusters, outliers, drift
Production Monitoring (Arize Platform)
from arize.pandas.logger import Client
from arize.utils.types import ModelTypes, Environments
arize_client = Client(
space_key=os.environ["ARIZE_SPACE_KEY"],
api_key=os.environ["ARIZE_API_KEY"],
)
# Log predictions for monitoring
arize_client.log(
dataframe=predictions_df,
model_id="support-chatbot-v2",
model_version="2.1.0",
model_type=ModelTypes.GENERATIVE_LLM,
environment=Environments.PRODUCTION,
schema=arize_schema,
)
# Arize platform: drift detection, performance dashboards, alerting
Installation
pip install arize-phoenix # Open-source local
pip install arize # Arize platform client
pip install openinference-instrumentation-openai # Auto-instrumentation
Best Practices
- Phoenix for dev — Run locally with
px.launch_app(); free, open-source, no data leaves your machine - Auto-instrumentation — Use OpenInference instrumentors for OpenAI, LangChain, LlamaIndex; zero code changes
- RAG evaluations — Run hallucination + relevance + QA evals on production traces; catch quality regressions
- Embedding viz — Use UMAP visualization to find clusters, outliers, and distribution drift in your data
- OpenTelemetry native — Phoenix is an OTLP collector; integrates with existing observability stacks
- Arize for production — Scale to millions of traces; automated drift detection and alerting
- LLM-as-judge — Built-in evaluators use GPT-4 to score hallucination, relevance; provide explanations
- Trace filtering — Filter by span kind, model, latency, error; drill into problematic traces
> related_skills --same-repo
> zustand
You are an expert in Zustand, the small, fast, and scalable state management library for React. You help developers manage global state without boilerplate using Zustand's hook-based stores, selectors for performance, middleware (persist, devtools, immer), computed values, and async actions — replacing Redux complexity with a simple, un-opinionated API in under 1KB.
> zod
You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.
> xero-accounting
Integrate with the Xero accounting API to sync invoices, expenses, bank transactions, and contacts — and generate financial reports like P&L and balance sheet. Use when: connecting apps to Xero, automating bookkeeping workflows, syncing accounting data, or pulling financial reports programmatically.
> windsurf-rules
Configure Windsurf AI coding assistant with .windsurfrules and workspace rules. Use when: customizing Windsurf for a project, setting AI coding standards, creating team-shared Windsurf configurations, or tuning Cascade AI behavior.