This article is mathematical and theoretical; its content is not an opinion, but a mathematical proof I have always wanted to do.
Most programming languages and formal languages require variables to be declared before use. This requirement makes the languages non-context-free. To get a glimpse of why, I will define a dumb language that has the declare-before-use requirement (and pretty much nothing else) and prove it non-context-free by the pumping lemma. This of course just gives a glimpse, since I just treat one dumb language. To treat other languages, one must repeat a similar argument per language or prove a reduction theorem from them to my dumb language.
Then, why do we define programming languages by context-free grammars? The answer is we never really do. We use a context-free grammar to check algebraic expressions, parentheses, braces, etc. This context-free grammar cannot possibly check declare-before-use; for that, we specify a side condition and run an extra checker. This means we really define a programming language by a context-free grammar plus the declare-before-use condition plus this condition plus that condition... We do this because it is more well-structured and easier to implement piecemeal. We could try a non-context-free grammar that captures them all, but it would be monolithic and hard to manage.
My dumb language goes like this. The alphabet consists of four
characters ;
,
a
b
. The grammar goes as
S | ::= | Decl ; Use |
Decl | ::= | ( Var ( , Var )* )? |
Use | ::= | ( Var ( , Var )* )? |
Var | ::= | ( a | b )+ |
with the side condition that Var strings occurring in the Use part should also occur in the Decl part.
In words: you declare zero or more variables (delimited by commas),
then use zero or more of them (delimited by commas); semicolon
separates the declaration section and the use section; variable names
consist of a
's and b
's.
This language is kept dumb to ease the proof later and expose the essence. Therefore it lacks keywords and punctuations that make the language more ergonomic, or operators and commands that make the use section more useful, etc.
Note that without the side condition, the language starting at S is context-free, in fact regular. The side condition will cut it down to a non-context-free language.
I now recall the pumping lemma for context-free languages:
Of course I want to prove my dumb language not context-free. So I will use the lemma in contraposition:
Either way, the pumping lemma is a daunting tool to understand and use. A great deal of its intimidation comes from the nesting of four alternating quantifiers (forall, exist, forall, exist). It goes even further than your nightmarish calculus class: the definition of limits has just three (forall, exist, forall). But here is how I will fulfill my proof obligation:
Whew! Quite a bit of discussion of my duties and rights. I hope it benefits some reader. In general, I wish everyone were taught formal logic, and the above discussion were common knowledge.
So here is my proof:
a
mb
m;
a
mb
m,
i.e., I declare just one variable, then just use it once. The
variable name contains this twice: m times of a
followed
by m times of b
.
The choice is motivated by the following foresight. It involves just one variable to keep things simple. It needs to be longer than p (though out of laziness I make it much longer than necessary). The variable name is at least twice as long as q and has two distinguishable parts, since I want to restrict possible decompositions later (recall the decomposition has a middle part no longer than q, which will be useful). (It is possible to choose a shorter variable, but the proof will be harder.)
b
half of the variable declared, and x
is in the a
half of the variable used. Using i=0, some
b
's are missing in the declaration, or some
a
's are missing in the use, or both!(It so happens that I choose i=0 throughout. I do not have to. If you want, you can certainly use a different i for a different decomposition.)
This proof makes do with just having one variable, but it also relies on a bit of variety of characters allowed in variable names.
As a matter of curiosity, I wonder what if variable names can
contain only a
's, i.e., only lengths matter. (Not to say
that it happens in practice!) The above proof will fail — the
last decomposition case will be inconclusive, since the edit may luck
out and synchronize lengths. But I can design z differently. So here
is a second proof. This time you and definitely I are familiar with
the drill, and I can go much faster.
Let m=max(p,q) and choose z =
a
m;
a
m,
a
m.
I declare a variable and then use it twice. This time by having
|vwx|≤m, vwx and the edits viwxi will fail to
cover the declaration or the second use. There are also some minor
cases of missing or spurious semicolons or commas.
This idea is inspired by the famous theorem and proof that
{ a
nb
nc
n | n≥0 }
is non-context-free. If there are three parts to be synchronized,
most likely a context-free grammar can only manage two of them.
Here is yet another what-if, and the last important one in this
article. What if variable names can contain only a
's,
and each variable can be used at most once? This time declare two
variables and use them in the same order as declared. By choosing
their lengths long enough and different enough, local edits will miss
a declaration or a use.
Further what-if's seem to lead to context-free languages, e.g.,
only a
's and only one variable allowed in the use
section. So each proof uses one of several essential features:
variable names are interesting, or variables can be used several
times, or several variables can be used. All proofs use this
essential feature: variables are arbitrarily long. If the variable
supply is bounded, you can just use finitely many states to remember
which variables have been declared, and the language may become
regular or context-free.
I do not know what happens if variables are fixed length but the alphabet is infinite (so as to still provide an infinite supply of variables). The whole theory of formal languages assumes a finite alphabet. How to define context-free for infinite alphabets is unclear.