GentBench

Published:

Benchmark and Evaluation for Gentopia.

A good ALM Benchmark should aim to solve problems and tasks hard/unsolvable by LLMs (else thereโ€™s no meaning to pay tool tax).

Tasks in Gentbench will be half-public and half-private. We open-source a demonstrative public benchmark to encourage agent tuning, but will use a private bench (of similar distribution) for fair eval.

image

Eval Tasks

We are still expanding! Check ./gentbench/benchmark/public/ for some available public benchmarks.

Reasoning
 - Math 
 - Coding
 - Planning
 - Commonsense

Knowledge
 - World_Knowledge (Offline)
 - Domain_Specific_Knowledge (Offline)
 - Web_Retrieval (Online)

Safety
 - Integrity (Jailbreaks, Ethics, ..)
 - Harmlessness (Toxicity, Bias, ..)

Multilingual 
 - Translation
 - Understanding
Robustness
 - Consistency 
 - Resilience (Self-correct when presented with false info.

Memory 
 - Context Length
 - Accuracy

Efficiency 
 - Token usage.
 - Run time.

Eval pipeline run

from gentopia.assembler import AgentAssembler
from gentbench.eval import EvalPipeline
import os
import dotenv

# Store OPENAI_API_KEY in .env file
dotenv.load_dotenv(".env")

eval = EvalPipeline(eval_config="gentbench/eval/config/eval_config.yaml")
agent = AgentAssembler(file="gentbench/eval/config/chatgpt.yaml").get_agent()

eval.run_eval(agent)

Leave a Comment