01SAP HANA workload triage from noisy telemetry
Problem
HANA already logs sessions, statements, and waits in detail. Operators still drowned in alerts, took too long to find root cause, and governance got shaky when people acted on a half-formed theory.
Solution
Python pipeline: extract features from SQL and SQLScript, combine rules with confidence-ranked root-cause guesses, spell out impact, and only run remediations from an allowlist with human approval and rollback. If we plug in an LLM, it polishes narrative text. It does not decide what is safe or what ranks first.
Evaluation & metrics
- Offline eval: precision@k on root cause, false-alert checks, zero tolerance on allowlist violations, narrative rubric completeness
- scripts/eval_run.py writes baseline vs improved profiles (e.g. alert dedup) to reports/eval-baseline-vs-v2.md
Security & reliability
No open-ended auto-fix. High-risk steps need a human. Decisions and actions append to an audit log. Secrets stay in .env or BTP destinations, not in git.
System architecture
Telemetry (HANA / staging / CSV fixtures)
→ Ingest → store
→ Detect → rank (rules + confidence) → impact
→ Plan actions (allowlist YAML)
→ Safety (approval gate, rollback)
→ Narrative (template-first) → audit log