BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

View PDF HTML (experimental)

Abstract:We introduce BenGER (Benchmark for German Law), a benchmark and dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The dataset combines 596 exam-style free-text legal case tasks across multiple levels of legal education and 531 short doctrinal reasoning tasks. It includes a controlled validation subset of timed human-written solutions under both unaided and human-AI co-creation conditions. We evaluate 12 contemporary LLM systems - closed flagship, efficiency-oriented, and open-weight - with a rubric-aligned LLM-as-a-Judge cross-validated against a multi-rater human-grading layer (three blind reviews per solution, six judge families benchmarked against the human pool). Closed-flagship systems lead the leaderboard across all three corpora, human-AI co-creation measurably improves on unaided human work, and the LLM judge tracks human grading at Pearson r=0.76 and Cohen's \k{appa}=0.60. System rankings are stable across judge families and two judges from independent providers clear the Calderon single-reviewer replacement bar on human-authored solutions.

Submission history

From: Sebastian Nagl [view email]
[v1] Wed, 27 May 2026 09:03:59 UTC (205 KB)
[v2] Mon, 1 Jun 2026 15:01:48 UTC (376 KB)
[v3] Tue, 30 Jun 2026 13:35:37 UTC (1,309 KB)

View PDF HTML (experimental)

Abstract:We introduce BenGER (Benchmark for German Law), a benchmark and dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The dataset combines 596 exam-style free-text legal case tasks across multiple levels of legal education and 531 short doctrinal reasoning tasks. It includes a controlled validation subset of timed human-written solutions under both unaided and human-AI co-creation conditions. We evaluate 12 contemporary LLM systems - closed flagship, efficiency-oriented, and open-weight - with a rubric-aligned LLM-as-a-Judge cross-validated against a multi-rater human-grading layer (three blind reviews per solution, six judge families benchmarked against the human pool). Closed-flagship systems lead the leaderboard across all three corpora, human-AI co-creation measurably improves on unaided human work, and the LLM judge tracks human grading at Pearson r=0.76 and Cohen's \k{appa}=0.60. System rankings are stable across judge families and two judges from independent providers clear the Calderon single-reviewer replacement bar on human-authored solutions.

Submission history

From: Sebastian Nagl [view email]
[v1] Wed, 27 May 2026 09:03:59 UTC (205 KB)
[v2] Mon, 1 Jun 2026 15:01:48 UTC (376 KB)
[v3] Tue, 30 Jun 2026 13:35:37 UTC (1,309 KB)