mirror of https://github.com/cline/cline.git synced 2025-06-03 03:59:07 +00:00

History

pashpashpash 90b0d6a73b evals formatting (#3105 ) Co-authored-by: Cline Evaluation <cline@example.com>		2025-04-24 15:26:12 -07:00
..
cli	evals formatting (#3105 )	2025-04-24 15:26:12 -07:00
.gitignore	CLI for orchastrating automated evals (#2874 )	2025-04-14 19:29:49 -07:00
README.md	activating extension with evals.env (#3041 )	2025-04-21 15:06:13 -07:00

README.md

Cline Evaluation System

This directory contains the evaluation system for benchmarking Cline against various coding evaluation frameworks.

Overview

The Cline Evaluation System allows you to:

Run Cline against standardized coding benchmarks
Collect comprehensive metrics on performance
Generate detailed reports on evaluation results
Compare performance across different models and benchmarks

Architecture

The evaluation system consists of two main components:

Test Server: Enhanced HTTP server in src/services/test/TestServer.ts that provides detailed task results
CLI Tool: Command-line interface in evals/cli/ for orchestrating evaluations

Directory Structure

cline-repo/
├── src/
│   ├── services/
│   │   ├── test/
│   │   │   ├── TestServer.ts         # Enhanced HTTP server for task execution
│   │   │   ├── GitHelper.ts          # Git utilities for file tracking
│   │   │   └── ...
│   │   └── ...
│   └── ...
├── evals/                            # Main directory for evaluation system
│   ├── cli/                          # CLI tool for orchestrating evaluations
│   │   ├── src/
│   │   │   ├── index.ts              # CLI entry point
│   │   │   ├── commands/             # CLI commands (setup, run, report)
│   │   │   ├── adapters/             # Benchmark adapters
│   │   │   ├── db/                   # Database management
│   │   │   └── utils/                # Utility functions
│   │   ├── package.json
│   │   └── tsconfig.json
│   ├── repositories/                 # Cloned benchmark repositories
│   │   ├── exercism/                 # Modified Exercism (from pashpashpash/evals)
│   │   ├── swe-bench/                # SWE-Bench repository
│   │   ├── swelancer/                # SWELancer repository
│   │   └── multi-swe/                # Multi-SWE-Bench repository
│   ├── results/                      # Evaluation results storage
│   │   ├── runs/                     # Individual run results
│   │   └── reports/                  # Generated reports
│   └── README.md                     # This file
└── ...

Getting Started

Prerequisites

Node.js 16+
VSCode with Cline extension installed
Git

Activation Mechanism

The evaluation system uses an evals.env file approach to activate test mode in the Cline extension. When an evaluation is run:

The CLI creates an evals.env file in the workspace directory
The Cline extension activates due to the workspaceContains:evals.env activation event
The extension detects this file and automatically enters test mode
After evaluation completes, the file is automatically removed

This approach eliminates the need for environment variables during the build process and allows for targeted activation only when needed for evaluations. The extension remains dormant during normal use, only activating when an evals.env file is present. For more details, see Evals Env Activation.

Installation

Build the CLI tool:

cd evals/cli
npm install
npm run build

Usage

Setting Up Benchmarks

cd evals/cli
node dist/index.js setup

This will clone and set up all benchmark repositories. You can specify specific benchmarks:

node dist/index.js setup --benchmarks exercism

Running Evaluations

node dist/index.js run --model claude-3-opus-20240229 --benchmark exercism

Options:

--model: The model to evaluate (default: claude-3-opus-20240229)
--benchmark: Specific benchmark to run (default: all)
--count: Number of tasks to run (default: all)

Generating Reports

node dist/index.js report

Options:

--format: Report format (json, markdown) (default: markdown)
--output: Output path for the report

Managing Test Mode Activation

The CLI provides a command to manually manage the evals.env file for test mode activation:

node dist/index.js evals-env create  # Create evals.env file in current directory
node dist/index.js evals-env remove  # Remove evals.env file from current directory
node dist/index.js evals-env check   # Check if evals.env file exists in current directory

Options:

--directory: Specify a directory other than the current one

Benchmarks

Exercism

Modified Exercism exercises from the pashpashpash/evals repository. These are small, focused programming exercises in various languages.

SWE-Bench (Coming Soon)

Real-world software engineering tasks from the SWE-bench repository.

SWELancer (Coming Soon)

Freelance-style programming tasks from the SWELancer benchmark.

Multi-SWE-Bench (Coming Soon)

Multi-file software engineering tasks from the Multi-SWE-Bench repository.

Metrics

The evaluation system collects the following metrics:

Token Usage: Input and output tokens
Cost: Estimated cost of API calls
Duration: Time taken to complete tasks
Tool Usage: Number of tool calls and failures
Success Rate: Percentage of tasks completed successfully
Functional Correctness: Percentage of tests passed

Reports

Reports are generated in Markdown or JSON format and include:

Overall summary
Benchmark-specific results
Model-specific results
Tool usage statistics
Charts and visualizations

Development

Adding a New Benchmark

Create a new adapter in evals/cli/src/adapters/
Implement the BenchmarkAdapter interface
Register the adapter in evals/cli/src/adapters/index.ts

Extending Metrics

To add new metrics:

Update the database schema in evals/cli/src/db/schema.ts
Add collection logic in evals/cli/src/utils/results.ts
Update report generation in evals/cli/src/commands/report.ts