Declare-Before-Use Is Non-Context-Free

Albert Y. C. Lai, trebla [at] vex [dot] net

February 1, 2008

This article is mathematical and theoretical; its content is not an opinion, but a mathematical proof I have always wanted to do.

Most programming languages and formal languages require variables to be declared before use. This requirement makes the languages non-context-free. To get a glimpse of why, I will define a dumb language that has the declare-before-use requirement (and pretty much nothing else) and prove it non-context-free by the pumping lemma. This of course just gives a glimpse, since I just treat one dumb language. To treat other languages, one must repeat a similar argument per language or prove a reduction theorem from them to my dumb language.

Then, why do we define programming languages by context-free grammars? The answer is we never really do. We use a context-free grammar to check algebraic expressions, parentheses, braces, etc. This context-free grammar cannot possibly check declare-before-use; for that, we specify a side condition and run an extra checker. This means we really define a programming language by a context-free grammar plus the declare-before-use condition plus this condition plus that condition... We do this because it is more well-structured and easier to implement piecemeal. We could try a non-context-free grammar that captures them all, but it would be monolithic and hard to manage.

My dumb language goes like this. The alphabet consists of four characters ; , a b . The grammar goes as

S	::=	Decl `;` Use
Decl	::=	( Var ( `,` Var )* )?
Use	::=	( Var ( `,` Var )* )?
Var	::=	( `a` \| `b` )+

with the side condition that Var strings occurring in the Use part should also occur in the Decl part.

In words: you declare zero or more variables (delimited by commas), then use zero or more of them (delimited by commas); semicolon separates the declaration section and the use section; variable names consist of a's and b's.

This language is kept dumb to ease the proof later and expose the essence. Therefore it lacks keywords and punctuations that make the language more ergonomic, or operators and commands that make the use section more useful, etc.

Note that without the side condition, the language starting at S is context-free, in fact regular. The side condition will cut it down to a non-context-free language.

I now recall the pumping lemma for context-free languages:

premise:

L is context-free

conclusion:

exist p≥1, q≥1 :

forall z with z in L, |z|≥p :

exist u,v,w,x,y with z=uvwxy, |vwx|≤q, |vx|≥1 :

forall i≥0 :

uvⁱwxⁱy in L

Of course I want to prove my dumb language not context-free. So I will use the lemma in contraposition:

conclusion:

L is not context-free

premise:

forall p≥1, q≥1 :

exist z with z in L, |z|≥p :

forall u,v,w,x,y with z=uvwxy, |vwx|≤q, |vx|≥1 :

exist i≥0 :

uvⁱwxⁱy not in L

Either way, the pumping lemma is a daunting tool to understand and use. A great deal of its intimidation comes from the nesting of four alternating quantifiers (forall, exist, forall, exist). It goes even further than your nightmarish calculus class: the definition of limits has just three (forall, exist, forall). But here is how I will fulfill my proof obligation:

I'm given p, q. I'm told they're positive, but otherwise I don't know their actual values. But I can use them, e.g., later when it's my turn to choose things, my choice can depend on their values.
I have to choose a string z. It has to be in L, and its length has to be at least p. Hopefully I will choose a useful one. If necessary the string can depend on p and q, e.g., repeat something p times, repeat something else q times.
Now I'm given a decomposition of my string z into five parts u, v, w, x, y. (Some of them may be empty.) I'm told the middle parts vwx is no longer than q, and that v is non-empty or x is non-empty; but otherwise I don't know the actual decomposition. Most likely I will have to consider all possible decompositions (hopefully they all fall under just a few cases because my language L is dumb).
I have to choose a value for i (my choice can depend on all of the above) and show that uvⁱwxⁱy is not in L. I will talk of uvⁱwxⁱy as some edit of uvwxy and how it messes up.

Whew! Quite a bit of discussion of my duties and rights. I hope it benefits some reader. In general, I wish everyone were taught formal logic, and the above discussion were common knowledge.

So here is my proof:

I'm given p, q.
I let m=max(p,q) and choose z to be a^mb^m;a^mb^m, i.e., I declare just one variable, then just use it once. The variable name contains this twice: m times of a followed by m times of b.
The choice is motivated by the following foresight. It involves just one variable to keep things simple. It needs to be longer than p (though out of laziness I make it much longer than necessary). The variable name is at least twice as long as q and has two distinguishable parts, since I want to restrict possible decompositions later (recall the decomposition has a middle part no longer than q, which will be useful). (It is possible to choose a shorter variable, but the proof will be harder.)
My string is decomposed into uvwxy. Because |vwx|≤q≤m, that middle part vwx cannot receive more than half of the variable name. This motivates much of my choice of z. Also keep in mind that at least one of v,x is non-empty, and so uvⁱwxⁱy with i≠1 means adding or deleting at least one character. These factoids will be useful.
Finally I have to choose i and show that the new string uvⁱwxⁱy is not in L. I will do it with a case analysis over the decomposition uvwxy.
- vwx is entirely inside the variable before semicolon. Using i=0, the new string deletes characters from the variable declared without adjusting the variable used!
- vwx is entirely inside the variable after semicolon. Similar to the above, the new string edits the variable used without adjusted the variable declared!
- vwx sits over the semicolon, and v or x receives the semicolon. Using i=0, the new string lacks the compulsory semicolon!
- vwx sits over the semicolon, and w receives the semicolon. v is part of the variable declared, and x is part of the variable used. This is where I need |vwx| shorter than half of the variable name. So v is in the b half of the variable declared, and x is in the a half of the variable used. Using i=0, some b's are missing in the declaration, or some a's are missing in the use, or both!
(It so happens that I choose i=0 throughout. I do not have to. If you want, you can certainly use a different i for a different decomposition.)

This proof makes do with just having one variable, but it also relies on a bit of variety of characters allowed in variable names.

As a matter of curiosity, I wonder what if variable names can contain only a's, i.e., only lengths matter. (Not to say that it happens in practice!) The above proof will fail — the last decomposition case will be inconclusive, since the edit may luck out and synchronize lengths. But I can design z differently. So here is a second proof. This time you and definitely I are familiar with the drill, and I can go much faster.

Let m=max(p,q) and choose z = a^m;a^m,a^m. I declare a variable and then use it twice. This time by having |vwx|≤m, vwx and the edits vⁱwxⁱ will fail to cover the declaration or the second use. There are also some minor cases of missing or spurious semicolons or commas.

This idea is inspired by the famous theorem and proof that { aⁿbⁿcⁿ | n≥0 } is non-context-free. If there are three parts to be synchronized, most likely a context-free grammar can only manage two of them.

Here is yet another what-if, and the last important one in this article. What if variable names can contain only a's, and each variable can be used at most once? This time declare two variables and use them in the same order as declared. By choosing their lengths long enough and different enough, local edits will miss a declaration or a use.

Further what-if's seem to lead to context-free languages, e.g., only a's and only one variable allowed in the use section. So each proof uses one of several essential features: variable names are interesting, or variables can be used several times, or several variables can be used. All proofs use this essential feature: variables are arbitrarily long. If the variable supply is bounded, you can just use finitely many states to remember which variables have been declared, and the language may become regular or context-free.

I do not know what happens if variables are fixed length but the alphabet is infinite (so as to still provide an infinite supply of variables). The whole theory of formal languages assumes a finite alphabet. How to define context-free for infinite alphabets is unclear.

Go to Blog of Albert Y. C. Lai