# Lecture 7 - CYK Algorithm and Chomsky Normal Form on Context Free Grammars

Consider the CYK algorithm for parsing context-free grammars

## Context Free Grammars

A **context free grammar** consists of:

- A finite set $V$ of variables or non-terminals
- A finite set $T$ of terminals, $V \cap T = \empty$
- A finite set $P$ of productions or rules. Each production is of the form $H \rightarrow \beta$ where the head $H$ is a variable and the body $\beta$ is a string $(V \cup T)^*$
- A start symbol $S \in V$

We define a binary relation $\rightarrow G$ on $(V \cup T)^*$ by:

### Example: CFG for generating arithmetic expressions

**Variables:**

- $E$ expression, also the start symbol
- $T$ term
- $F$ factor

**Terminals:**

- $+*$ arithmetic expressions
- $()$ brackets
- $n,i$ numbers / identifiers (not to be confused with the variables in the CFG)

**Productions:**

We can create an expression parsing the tree using the productions we see:

## Chomsky Normal Form

A CFG is in chomsky normal form if every production $P$ is of the form $A \rightarrow BC$ or $A \rightarrow a$ where $A,B,C \in V$ and $a \in T$

For every CFG $G$ there is a CFG $G’$ in Chomsky Normal Form (CNF) such that $L(G) = L(G’)$, so we can modify any CFG to be in Chomsky Normal Form

Example productions for a regular CFG:

After converting to Chomsky Normal Form:

This is a more restricted form but allows more generalisation.

## CYK Algorithm

Stands for **Cocke-Younger-Kasami** and is a parsing algorithm for context free grammars in Chomsky Normal Form

### Membership Problem

For a fixed CFG in CNF, if we are given a string $s$ consisting of $n$ terminals, is there a derivation $S \Rightarrow^* s$?

We could solve this by exhaustive enumeration, but this is very inefficient.

### Recurrence Relation

For $i$ and $k$ with $i \leq i \leq k \leq n$ we consider the set $V(i,k) \subseteq V$ defined by: \(V(i,k) = \{A \in V \mid A \Rightarrow ^* x_ix_{i+1}...x_k\}\) We have:

We go through all productions to check which generates $x_i$, adding the heads of these productions to $V(i,i)$

If $i \lt k$, then we apply the production $A \rightarrow BC$, where $B$ produces the part in $V(i,j)$ and $C$ produces the part in $V(j+1,k)$.

The string $s$ is derived if $S \in V(1,n)$

### Pseudocode

```
begin
for i <- 1 to n do
V(i,i) <- { A in V | (A -> xi) in P }
for b <- 1 to n - 1 do
for i <- 1 to n - b do
k <- i + b
V(i,k) <- empty
for j <- i to k - 1 do
for (A -> BC) in P do
if B in V(i, j) and C in V(j + 1, k) then
V(i, k) <- V(i, k) union { A }
if S in V(1, n) then accept else reject
```

### Example

We have a $CFG$ with $T = {a,b}$, $V = {S,A,B,C}$ and the productions:

To compute $s = baaba$ we compute the values $V(i,k)$. We can trace this with a table, starting with the diagonals:

Next we look at $i = k-1$:

$V(1,3) = \empty$ since $BB$ (combining the black boxes) is not in the productions, and neither is any combination of blue boxes ($SA, AA, SC, AC$).

## Next time, on Algorithms II

Travelling salesman lmao