Python
Anomstack supports Python as a data source for your metrics. This allows you to create custom data ingestion logic using Python's rich ecosystem of libraries.
Configuration
Configure Python in your metric batch's config.yaml
:
metric_batch: "your_metric_batch_name"
table_key: "your_table_key"
ingest_cron_schedule: "45 6 * * *" # When to run the ingestion
ingest_fn: >
{% include "./path/to/your/python/file.py" %}
Default Configuration
Many configuration parameters can be set in metrics/defaults/defaults.yaml
to apply across all metric batches. Key defaults include:
db: "duckdb" # Default database type
table_key: "metrics" # Default table name
ingest_cron_schedule: "*/3 * * * *" # Default ingestion schedule
model_path: "local://./models" # Default model storage location
alert_methods: "email,slack" # Default alert methods
You can override any of these defaults in your metric batch's configuration file.
Customizing Default Templates
Anomstack uses several default templates for preprocessing, SQL queries, and other operations. You can customize these by modifying the files in:
-
Python Templates (
metrics/defaults/python/
):preprocess.py
: Customize how metrics are preprocessed before anomaly detection- Add your own Python functions for custom processing
-
SQL Templates (
metrics/defaults/sql/
):train.sql
: SQL for training data preparationscore.sql
: SQL for scoring data preparationalerts.sql
: SQL for alert generationchange.sql
: SQL for change detectionplot.sql
: SQL for metric visualizationllmalert.sql
: SQL for LLM-based alertsdashboard.sql
: SQL for dashboard datadelete.sql
: SQL for data cleanupsummary.sql
: SQL for summary reports
To use custom templates, modify the corresponding files in these directories. The changes will apply to all metric batches unless overridden in specific batch configurations.
Example: HackerNews Top Stories
Here's a complete example that fetches metrics from HackerNews top stories:
import pandas as pd
import requests
def ingest(top_n=10) -> pd.DataFrame:
# Hacker News API endpoint for top stories
url = "https://hacker-news.firebaseio.com/v0/topstories.json"
# Get top story IDs
response = requests.get(url)
story_ids = response.json()[:top_n]
# Calculate metrics
min_score = float("inf")
max_score = 0
total_score = 0
for story_id in story_ids:
story_url = f"https://hacker-news.firebaseio.com/v0/item/{story_id}.json"
story = requests.get(story_url).json()
score = story.get("score", 0)
min_score = min(min_score, score)
max_score = max(max_score, score)
total_score += score
avg_score = total_score / len(story_ids)
# Create DataFrame with metrics
data = [
[f"hn_top_{top_n}_min_score", min_score],
[f"hn_top_{top_n}_max_score", max_score],
[f"hn_top_{top_n}_avg_score", avg_score],
[f"hn_top_{top_n}_total_score", total_score],
]
df = pd.DataFrame(data, columns=["metric_name", "metric_value"])
df["metric_timestamp"] = pd.Timestamp.utcnow()
return df
How It Works
-
Configuration:
metric_batch: "hn_top_stories_scores"
table_key: "metrics_hackernews"
ingest_cron_schedule: "45 6 * * *"
ingest_fn: >
{% include "./examples/hackernews/hn_top_stories_scores.py" %} -
Function Definition: The
ingest()
function takes atop_n
parameter to specify how many top stories to analyze. -
Data Collection:
- Fetches top story IDs from HackerNews API
- Retrieves details for each story
- Calculates min, max, average, and total scores
-
DataFrame Creation:
- Creates a DataFrame with required columns:
metric_name
,metric_value
, andmetric_timestamp
- Each metric is a separate row in the DataFrame
- Timestamps are in UTC
- Creates a DataFrame with required columns:
Best Practices
- Return a pandas DataFrame with required columns
- Include proper error handling
- Use type hints for better code clarity
- Document your functions
- Handle API rate limits and timeouts
- Use environment variables for sensitive data
Required DataFrame Structure
Your Python function must return a pandas DataFrame with these columns:
metric_name
: String identifier for the metricmetric_value
: Numeric value of the metricmetric_timestamp
: UTC timestamp of when the metric was collected
Limitations
- Python environment must have required dependencies installed
- Function execution time limits
- Memory usage considerations
- API rate limits for external services