Why the four criteria actually matter
Candidates who fail the Goethe-Zertifikat Schreiben rarely fail at “grammar in general”. They fail because one of four named criteria slips under the 60% threshold. That is exactly the thing generic AI tools miss — they correct grammar and style, but they do not tell you which of the four criteria is at risk right now.
The official criteria live in the Goethe-Institut Modellsatz and Prüfungsordnung: Aufgabenerfüllung, Kohärenz, Wortschatz, Strukturen. Examiners apply these four labels at every level from A1 to C2. They are the only yardstick that counts toward your score.
If you have already read Spoke 1's hybrid feedback model — AI scoring plus human validation — you are a step ahead. This article goes one layer deeper: what each criterion actually tests, where candidates typically lose points, and how AI feedback has to be built so it targets each criterion independently. For the broader tool framing, see our analysis: AI vs Human — the definitive guide to Goethe writing feedback.
CTA: Try GoetheCoach free — your practice text scored separately against each of the four official criteria. Start free
Criterion 1 — Aufgabenerfüllung: did you fulfil the brief?
Aufgabenerfüllung is the criterion that fails the most candidates without them realising. It checks three things: Leitpunkte coverage, text type, and word count plus format. If you forget a Leitpunkt, you immediately lose a large chunk of points on this criterion — and often the whole exam with it.
Leitpunkte. Every writing task lists three to five bullet points you must address explicitly. “Describe your experience — justify your opinion — propose improvements — respond to a counter-position.” If you cover three out of four, you lose measurable points. This is not taste — it is points allocation.
Text type. Forumsbeitrag, formal Brief, informal email, Stellungnahme, Erörterung, analytical essay — each has its own conventions. “Sehr geehrte Damen und Herren” in an informal forum post costs points. Practice depth for B2: the B2 Forumsbeitrag step-by-step. For B1 letters: B1 letter writing.
Word count. Under the minimum: points off. Significantly over the maximum: points off. Examiners apply this with surprising strictness — the number in the Modellsatz is a corridor, not a guideline.
How should AI feedback target this? An exam-tuned AI first checks whether every Leitpunkt appears in the text — not by keyword search but semantically (was the point actually answered). It warns you that you skipped Leitpunkt 3 before correcting a single comma. Generic AI tools almost never do this.
Criterion 2 — Kohärenz: does your text hold together?
Kohärenz scores whether your text functions as a single train of thought. The examiner watches three layers: sentence connections, paragraph organisation, and the argumentative arc. Generic AI mostly only scores the first layer — are there Konnektoren? — and ignores the other two.
Sentence connections. Are Konnektoren used functionally? Does “weil” actually introduce a cause, or is it there to satisfy a Konnektor quota? Are “aber”, “jedoch”, “dennoch” varied with intent or repeated mechanically?
Paragraph organisation. Each paragraph should address one clear aspect, with opening, development, and transition. Random line breaks give you formal paragraphs but not coherent structure.
Argumentative arc. At B2 and C1, examiners expect a recognisable arc: thesis, argument, counter-argument, conclusion. A pretty collection of observations without a line loses points here. Full Konnektoren reference for these levels: Redemittel and Konnektoren for B2/C1.
How should AI feedback target this? An exam-tuned AI scores not just the presence of Konnektoren but whether they function. It flags “weil” clauses that do not introduce causes and “aber” clauses that do not mark contrast. That is a different requirement from generic grammar checking.
Criterion 3 — Wortschatz: is your vocabulary at level?
Wortschatz scores whether you use vocabulary at the expected level. The gap between B1, B2 and C1 expectations is brutal here, and this is exactly where candidates lose points by writing at too-simple a lexical layer.
| Level | Wortschatz expectation | Example phrasings |
|---|---|---|
| A2 | core vocabulary, simple adjectives | „gut”, „schlecht”, „wichtig” |
| B1 | extended, functional, early abstraction | „meiner Meinung nach”, „im Allgemeinen”, „aus diesem Grund” |
| B2 | differentiated, with Redemittel and abstractions | „in Bezug auf”, „im Hinblick darauf”, „vor diesem Hintergrund” |
| C1 | argumentative, with nuance and idiomatic depth | „demgegenüber”, „infolgedessen”, „unter Berücksichtigung von” |
| C2 | stylistic variety, register-aware, idiomatic | „im Lichte von”, „nicht zuletzt deshalb”, „mit Blick auf” |
Common pitfalls. On B2, you should be confident with “in Bezug auf” — not just “über”. On C1, you should reach for argumentative phrasings like “unter Berücksichtigung von” or “demgegenüber” rather than staying at B2 standards.
How should AI feedback target this? An exam-tuned AI grades level-awarely: B1 vocabulary in a B2 text is flagged as a deficit. It does not suggest “more correct” words but level-appropriate ones. That is the most important distinction between grammar correction and exam-grade feedback.
Criterion 4 — Strukturen: do grammar and complexity meet the bar?
Strukturen rates the breadth of grammatical means you handle confidently. It is not just about avoiding errors — it is about showing exam-appropriate complexity. A B2 candidate who only writes simple main clauses loses points, even if those clauses are error-free.
| Structure layer | B1 expectation | B2 expectation | C1 expectation |
|---|---|---|---|
| Subordinate-clause word order | correct in simple dass-/weil- clauses | correct in multiply-nested subordination | stylistically secure with complex subordination |
| Konjunktiv II | in fixed politeness phrasings | active in hypotheses and polite proposals | in indirect speech, argumentative turns |
| Separable verbs | correct in main clauses | correct in subordinate clauses too | stylistically secure in complex sentences |
| Passive / passive paraphrase | rudimentary | actively deployed | differentiated with passive variants |
| Konnektoren complexity | und/aber/weil/wenn | obwohl/während/sodass | demgegenüber/infolgedessen/sofern |
How should AI feedback target this? An exam-tuned AI checks what is not there too: did you use Konjunktiv II at all? Does your sentence complexity stay below B2 throughout? Generic grammar checkers only flag what is wrong — exam-grade AI flags what is too simple.
How the four criteria interact — weighting and thresholds
The four criteria are scored independently. A particularly strong Aufgabenerfüllung does not rescue weak Strukturen. Candidates who fall under 60% on any one of the four can lose the writing module — even if the other three are brilliant.
That has a direct tactical consequence: do not optimise your strongest criterion, lift your weakest. An exam-tuned AI shows you a per-criterion score after every practice text — so you know where to drill. For a compressed weekly framework, see our 14-day final-prep plan for the Goethe-Zertifikat B2.
CTA: Have GoetheCoach score your next practice text per criterion. Try free
How AI feedback should target all four criteria
An exam-tuned AI delivers four separate ratings. It does not just say “grammar OK” — it says “Aufgabenerfüllung 80%, Kohärenz 65%, Wortschatz 55%, Strukturen 70%” and flags what is pulling the Wortschatz score down. Here is the minimum bar a tool needs to clear to call itself exam-grade:
- Leitpunkte check: per practice text, explicit verification that each bullet from the brief was semantically answered.
- Text-type check: register, salutation, sign-off, format compared against the required text type.
- Word-count check: minimum and maximum boundaries verified, with point-deduction pre-warning.
- Konnektoren functionality test: not just presence, but whether the connection holds semantically.
- Level-appropriate Wortschatz audit: vocabulary compared against B1/B2/C1 expectation, with level-matched suggestions.
- Strukturen complexity audit: distribution of main vs subordinate clauses, Konjunktiv II appearances, passive ratio.
- Per-criterion score: four separate values with a flag on the weakest — that is your drill focus.
Candidates using a tool that does not meet this minimum bar are training themselves on generalities. That is the core diagnosis from our analysis AI vs Human.
What the 2026 format changes for the criteria
The modernised 2026 Modellsatz shifts the weights: digital writing tasks with tighter word counts, more Forumsbeiträge, more frequent semi-formal emails. The four criteria stay the same, but Aufgabenerfüllung (Leitpunkte fidelity) and Kohärenz (argument arc in less space) are tested harder. A criterion-based AI absorbs the shift automatically. Full breakdown in our Goethe Exam 2026: what changed pillar.
Key Takeaways
- Goethe-Zertifikat Schreiben is scored against exactly four criteria: Aufgabenerfüllung, Kohärenz, Wortschatz, Strukturen.
- Each criterion is rated independently — a weak one alone can sink the writing module.
- Aufgabenerfüllung most often fails on a forgotten Leitpunkt — not a grammar error but a content gap.
- Kohärenz scores whether Konnektoren are functional, not whether they are present — generic AI only checks presence.
- Wortschatz is level-aware: B2 expects phrases like “in Bezug auf”, C1 expects “demgegenüber” or “infolgedessen”.
- Strukturen also penalises the absence of complexity — consistently simple sentences punish B2/C1 writing.
- Exam-grade AI delivers four separate scores; generic AI delivers a fuzzy overall impression.
Frequently Asked Questions
What are the four criteria scored on Goethe writing?
Aufgabenerfüllung (Leitpunkte and text type), Kohärenz (logical structure and functional Konnektoren), Wortschatz (level-appropriate vocabulary), and Strukturen (breadth and complexity of grammar). Each is scored independently, and all four appear in the official Goethe-Institut Modellsatz and Prüfungsordnung.
What does Aufgabenerfüllung mean exactly?
Aufgabenerfüllung checks three things: whether every Leitpunkt in the brief is covered, whether you used the correct text type (Forumsbeitrag, Brief, Stellungnahme, Erörterung), and whether word count and format are right. A missed Leitpunkt costs the most points on this criterion.
Can a high score on one criterion rescue a weak one elsewhere?
No. The four criteria are scored independently. Slipping below 60% on any one of them risks losing the writing module — even with brilliant scores on the other three. Lift the weakest criterion, not the strongest.
How does Wortschatz differ across B1, B2 and C1?
B1 expects functional vocabulary with emerging abstraction; B2 requires Redemittel like “in Bezug auf” and “im Hinblick darauf”; C1 expects argumentative turns like “demgegenüber” and “infolgedessen”. Writing at B2 with B1 vocabulary loses points even when the grammar is flawless.
How is Kohärenz different from grammar?
Kohärenz scores whether sentences and paragraphs are functionally connected — whether a “weil” actually introduces a cause, whether paragraphs treat one clear aspect, whether the text has an argumentative arc. Grammar scores form; Kohärenz scores logic.
How should AI feedback target all four criteria separately?
An exam-tuned AI returns four separate scores plus per-criterion rationale. It checks Leitpunkte coverage semantically, scores Konnektoren by function, audits Wortschatz against level expectation, and measures Strukturen complexity. GoetheCoach is explicitly built this way.
Where do I find the official scoring criteria?
In the Goethe-Institut Modellsätze and Prüfungsordnung (goethe.de). We recommend reading one full Modellsatz per level — it turns “I am practising writing” into “I am practising what is tested”.
What does the 2026 format change about the criteria?
The four criteria stay the same, but the 2026 Modellsatz sharpens demands on Aufgabenerfüllung (digital text types with cleaner Leitpunkte) and Kohärenz (argument arc in less space). More in our 2026-format pillar.
How many practice texts should I write per criterion?
If you have spotted a weakness — say Wortschatz at B2 — write at least five targeted practice texts where you deliberately deploy level-appropriate Redemittel. With hybrid AI scoring, that is two to three weeks of focused work.
Cited Sources
- Goethe-Institut: official Modellsätze A1–C2 — goethe.de
- Goethe-Institut: Prüfungsordnung Goethe-Zertifikat — goethe.de
- Goethe-Institut: Bewertungskriterien Schreiben (Modellsatz appendix) — goethe.de
- GoetheCoach: four-criteria scoring methodology — goethecoach.de
- GoetheCoach: AI vs Human — the definitive guide — goethecoach.de
- Goethe-Institut: 2026 Modellsatz and format changes — goethe.de
JSON-LD schema blocks (embed in HTML head):
Article JSON-LD
FAQPage JSON-LD
BreadcrumbList JSON-LD
---
Practise writing with per-criterion scoring
Aufgabenerfüllung, Kohärenz, Wortschatz, Strukturen — each scored separately.
Start Free