Which AI Plays Piano Best?
Current Leaderboard
How It Works
A transparent, reproducible process for evaluating AI musical capabilities
Input
We provide a song (MIDI/sheet music)
Challenge
LLM creates a piano version using our open bridge
Evaluation
Automated scoring: accuracy, timing, completeness
Ranking
Models ranked on our public leaderboard
Diverse Test Set
From nursery rhymes to Liszt études—200+ songs across 10 difficulty levels
Real-time Scoring
Results published within minutes of evaluation completion
Reproducible
Every evaluation uses versioned prompts and deterministic scoring
Why This Matters
Music is more than notes—it's structure, emotion, and creativity. Measuring how AI handles music reveals deep insights about intelligence.
For Researchers
Benchmark musical intelligence alongside text and reasoning. Explore how different architectures handle structured musical output.
For Developers
Test your model's multimodal capabilities with structured output. Integrate our scoring API into your evaluation pipelines.
For Music Tech
Advance AI's understanding of music theory and performance. Help define what musical intelligence means for machines.
Sample Test Cases
From nursery rhymes to concert masterpieces—see how AI models perform across difficulty levels
Twinkle Twinkle Little Star
Tests:
Best Score
Top Model
Claude Sonnet 4
Für Elise
Beethoven
Tests:
Best Score
Top Model
GPT-4o
La Campanella
Liszt
Tests:
Best Score
Top Model
Claude Sonnet 4
200+ test cases across 10 difficulty levels
Browse All Test CasesBuilt on Open Source
The entire LMPiano benchmark—from the scoring algorithms to the datasets—is open source. Inspect our methods, contribute improvements, or build your own evaluations.
from lmpiano import Bridge, Scorer
# Initialize the evaluation bridge
bridge = Bridge(
model="claude-sonnet-4",
api_key=os.environ["API_KEY"]
)
# Load a test piece
piece = bridge.load_piece("fur_elise")
# Generate piano transcription
result = bridge.transcribe(piece)
# Score the output
scorer = Scorer()
scores = scorer.evaluate(
original=piece,
transcription=result,
metrics=["accuracy", "timing", "expression"]
)
print(scores)
# Output: Score(overall=94.2, accuracy=96, timing=92) Get Involved
Join the community building the future of AI music evaluation
Submit a Model
Add your LLM to the benchmark. Get scored on our full test suite and join the leaderboard.
- API integration
- Automated testing
- Public ranking
Contribute to Bridge
Help improve our evaluation framework. Fix bugs, add features, or expand test coverage.
- 23 open issues
- Good first issues
- Active maintainers
Explore Data
Download our datasets, review methodology, and use our benchmarks in your research.
- 200+ test pieces
- Detailed metrics
- Citation ready