Live benchmark • Updated daily

Which AI Plays Piano Best?

The open benchmark ranking every LLM on musical intelligence. From simple melodies to complex compositions.

Open Source Bridge
100% Transparent Scoring
Community Driven
Live Rankings

Current Leaderboard

Last updated: Today at 09:42 UTC
Rank
Model
Overall Score
🥇
Claude Sonnet 4
94.2
🥈
GPT-4o
91.8
🥉
Gemini 2.0
89.5
4
Claude Haiku 3.5
85.3
5
Llama 3.3 70B
82.1
127
Models Tested
15.4K
Evaluations Run
89
Contributors
2.1K
GitHub Stars

How It Works

A transparent, reproducible process for evaluating AI musical capabilities

Step 1

Input

We provide a song (MIDI/sheet music)

Step 2

Challenge

LLM creates a piano version using our open bridge

Step 3

Evaluation

Automated scoring: accuracy, timing, completeness

Step 4

Ranking

Models ranked on our public leaderboard

🎹

Diverse Test Set

From nursery rhymes to Liszt études—200+ songs across 10 difficulty levels

Real-time Scoring

Results published within minutes of evaluation completion

🔄

Reproducible

Every evaluation uses versioned prompts and deterministic scoring

Why This Matters

Music is more than notes—it's structure, emotion, and creativity. Measuring how AI handles music reveals deep insights about intelligence.

For Researchers

Benchmark musical intelligence alongside text and reasoning. Explore how different architectures handle structured musical output.

For Developers

Test your model's multimodal capabilities with structured output. Integrate our scoring API into your evaluation pipelines.

For Music Tech

Advance AI's understanding of music theory and performance. Help define what musical intelligence means for machines.

Test Suite

Sample Test Cases

From nursery rhymes to concert masterpieces—see how AI models perform across difficulty levels

Beginner

Twinkle Twinkle Little Star

30 sec Simple melody, single hand

Tests:

Basic note accuracy Rhythm consistency Tempo stability

Best Score

98.7

Top Model

Claude Sonnet 4

Intermediate

Für Elise

Beethoven

3 min Right hand melody with left hand accompaniment

Tests:

Both hands coordination Dynamics Pedal marking

Best Score

94.2

Top Model

GPT-4o

Advanced

La Campanella

Liszt

5 min Complex virtuosic piece with rapid passages

Tests:

Speed accuracy Wide interval jumps Musical expression

Best Score

78.4

Top Model

Claude Sonnet 4

200+ test cases across 10 difficulty levels

Browse All Test Cases
100% Open Source

Built on Open Source

The entire LMPiano benchmark—from the scoring algorithms to the datasets—is open source. Inspect our methods, contribute improvements, or build your own evaluations.

Transparent scoring algorithms
Public datasets and benchmarks
Community contributions welcome
Use our bridge in your own research
2.1K Stars
342 Forks
89 Contributors
View on GitHub
evaluate.py
from lmpiano import Bridge, Scorer

# Initialize the evaluation bridge
bridge = Bridge(
    model="claude-sonnet-4",
    api_key=os.environ["API_KEY"]
)

# Load a test piece
piece = bridge.load_piece("fur_elise")

# Generate piano transcription
result = bridge.transcribe(piece)

# Score the output
scorer = Scorer()
scores = scorer.evaluate(
    original=piece,
    transcription=result,
    metrics=["accuracy", "timing", "expression"]
)

print(scores)
# Output: Score(overall=94.2, accuracy=96, timing=92)
pip install lmpiano

Get Involved

Join the community building the future of AI music evaluation

Submit a Model

Add your LLM to the benchmark. Get scored on our full test suite and join the leaderboard.

  • API integration
  • Automated testing
  • Public ranking
Submit Now

Contribute to Bridge

Help improve our evaluation framework. Fix bugs, add features, or expand test coverage.

  • 23 open issues
  • Good first issues
  • Active maintainers
Start Contributing

Explore Data

Download our datasets, review methodology, and use our benchmarks in your research.

  • 200+ test pieces
  • Detailed metrics
  • Citation ready
Browse Datasets