Live benchmark • Updated daily

Which AI Plays Piano Best?

The open benchmark ranking every LLM on musical intelligence. From simple melodies to complex compositions.

View Leaderboard Submit Your Model

◆ Open Source Bridge

◆ 100% Transparent Scoring

◆ Community Driven

Scroll to explore

Live Rankings

Current Leaderboard

Last updated: Today at 09:42 UTC

Rank

Model

Overall Score

Accuracy

Timing

Tested

🥇

Claude Sonnet 4

94.2

96%

92%

2 days ago

🥈

GPT-4o

91.8

94%

90%

1 day ago

🥉

Gemini 2.0

89.5

91%

88%

3 days ago

Claude Haiku 3.5

85.3

88%

82%

4 days ago

Llama 3.3 70B

82.1

85%

79%

1 week ago

View full leaderboard with 50+ models

127

Models Tested

15.4K

Evaluations Run

Contributors

2.1K

GitHub Stars

How It Works

A transparent, reproducible process for evaluating AI musical capabilities

Step 1

Input

We provide a song (MIDI/sheet music)

Step 2

Challenge

LLM creates a piano version using our open bridge

Step 3

Evaluation

Automated scoring: accuracy, timing, completeness

Step 4

Ranking

Models ranked on our public leaderboard

🎹

Diverse Test Set

From nursery rhymes to Liszt études—200+ songs across 10 difficulty levels

⚡

Real-time Scoring

Results published within minutes of evaluation completion

🔄

Reproducible

Every evaluation uses versioned prompts and deterministic scoring

Why This Matters

Music is more than notes—it's structure, emotion, and creativity. Measuring how AI handles music reveals deep insights about intelligence.

For Researchers

Benchmark musical intelligence alongside text and reasoning. Explore how different architectures handle structured musical output.

For Developers

Test your model's multimodal capabilities with structured output. Integrate our scoring API into your evaluation pipelines.

For Music Tech

Advance AI's understanding of music theory and performance. Help define what musical intelligence means for machines.

Test Suite

Sample Test Cases

From nursery rhymes to concert masterpieces—see how AI models perform across difficulty levels

Beginner

Twinkle Twinkle Little Star

30 sec • Simple melody, single hand

Tests:

Basic note accuracy Rhythm consistency Tempo stability

Best Score

98.7

Top Model

Claude Sonnet 4

Intermediate

Für Elise

Beethoven

3 min • Right hand melody with left hand accompaniment

Tests:

Both hands coordination Dynamics Pedal marking

Best Score

94.2

Top Model

GPT-4o

Advanced

La Campanella

Liszt

5 min • Complex virtuosic piece with rapid passages

Tests:

Speed accuracy Wide interval jumps Musical expression

Best Score

78.4

Top Model

Claude Sonnet 4

200+ test cases across 10 difficulty levels

Browse All Test Cases

100% Open Source

Built on Open Source

The entire LMPiano benchmark—from the scoring algorithms to the datasets—is open source. Inspect our methods, contribute improvements, or build your own evaluations.

Transparent scoring algorithms

Public datasets and benchmarks

Community contributions welcome

Use our bridge in your own research

2.1K Stars

342 Forks

89 Contributors

View on GitHub

evaluate.py

from lmpiano import Bridge, Scorer

# Initialize the evaluation bridge
bridge = Bridge(
    model="claude-sonnet-4",
    api_key=os.environ["API_KEY"]
)

# Load a test piece
piece = bridge.load_piece("fur_elise")

# Generate piano transcription
result = bridge.transcribe(piece)

# Score the output
scorer = Scorer()
scores = scorer.evaluate(
    original=piece,
    transcription=result,
    metrics=["accuracy", "timing", "expression"]
)

print(scores)
# Output: Score(overall=94.2, accuracy=96, timing=92)

pip install lmpiano

Get Involved

Join the community building the future of AI music evaluation

Submit a Model

Add your LLM to the benchmark. Get scored on our full test suite and join the leaderboard.

API integration
Automated testing
Public ranking

Submit Now

Contribute to Bridge

Help improve our evaluation framework. Fix bugs, add features, or expand test coverage.

23 open issues
Good first issues
Active maintainers

Start Contributing

Explore Data

Download our datasets, review methodology, and use our benchmarks in your research.

200+ test pieces
Detailed metrics
Citation ready

Browse Datasets

API Documentation

Full reference guide

Report an Issue

Bug reports welcome

Scoring Methodology

How we measure