Overview
CRAB (Cross-environment Agent Benchmark) is an innovative framework created to assess multimodal language model agents' performance in various computational environments. Developed by a collaborative team of researchers from prestigious institutions, CRAB offers a thorough platform for evaluating AI agents through comprehensive testing.
Key Features
- Cross-environment support: Enables seamless agent adaptation.
- Graph-based evaluator: Allows detailed performance analysis.
- Automated task generation: Creates tasks using complex sub-task combinations.
- Easy-to-use Python-based configuration: Simplifies setup and usage.
- Support for multiple communication and agent structures: Enhances versatility.
- Includes 120 tasks: Spans Ubuntu and Android environments.
Use Cases
- Evaluating multimodal AI agents' capabilities.
- Comparing performance of different language models.
- Testing AI agents' adaptability in realistic scenarios.
- Generating dynamic task sequences for AI testing.
- Benchmarking agent performance on diverse platforms.
Technical Specifications
- Environments: Ubuntu, Android.
- Supported Models: GPT-4o, Claude 3, Gemini 1.5 Pro, and open-source models.
- Evaluation Metrics: Completion Ratio, Success Rate, Termination Reason Analysis.
- Communication Settings: Single and Multi-agent.
- Visual Prompt Technique: Scene of Manipulation (SoM).