Monday 21 April 2014

What is wrong with Unix?

The problem with Unix is that "everything is a text file." Unix systems typically use linear text to represent structured data. This is a mistake, because there are many different linear sequences of characters that represent any particular piece of structured data, so the interpretation of data files is complex and therefore slow.

A parser is a program which takes a linear sequence of lexical tokens, which is a concrete representation, and produces a data structure, which is an abstract representation. Because there are many different concrete representations of any given abstract representation, the parser effectively quotients or partitions the space of possible concrete representations into equivalence classes each identified by the abstract representation of all the equivalent concrete representations in that class.

The problem is noise: a parser converting a text file into abstract syntax is in fact doing the same thing as an error correcting decoder on a communications channel. We can think of the parser as a decoder, receiving a message on the input, and interpreting it to produce an abstract representation of the semantics of the message. Just as in the case of a hamming code, in which there are many redundant forms of transmission which lead to the same message, the parser translates each of the many different possible concrete representations of the abstract syntax into the same message.

The translation process involves thermodynamic work, because the parser must reduce the information in the machine state: there is always more information in the input textual representation of the source code than there is in the output abstract syntax representation, so the system, in parsing the input, reduces the information of the machine's state, and therefore it must produce entropy in the machine's environment. And any mechanical process which produces entropy is a process which takes time and consumes energy.

In the C programming language, for example, the term 'white space' denotes any sequence consisting entirely of space, tab or newline characters. As far as the parser of the C compiler is concerned, all sequences of white space are equivalent. Therefore white space makes the character sequence representation of programs redundant: we can replace any white space sequence of a program file with a different white space sequence, without changing the abstract syntax that program represents. Another source of redundancy in C language source code are the operator precedence and associativity rules which mean that in certain circumstances parentheses may be redundant. So for example any program text contained by well-balanced parentheses can be replaced by the same text with one extra pair of parentheses. The following four files all define the same abstract syntax:
#include <stdio.h>
void main(int argc, char **argv)
 {
   printf ("Hello, World.\n");
 }

#include <stdio.h>
void main(int argc,
          char **argv)
   {printf ("Hello, World.\n");}

#include <stdio.h>
void main(int argc, char **argv)
{printf (("Hello, World.\n"));}

#include <stdio.h>
#define S "Hello, World.\n"
void main(int argc, char **argv) {
   printf (S);
}

Textual representation seems like a good idea, because it offers a  very simple and elegant solution to the problem of concrete representation, which is in some sense the essence of the problem which the idea of an operating system is an attempt to solve.

For example, a C compiler may represent the number 365 as a sixteen bit binary integer 0000000101101101. On an Intel i386 system, this number would be stored in two consecutive bytes, with the least significant byte first: 01101101 00000001. But on a Motorola 68000 system, the same number would be stored with the most significant byte first: 00000001 01101101. So, if these systems were to exchange the blocks of memory representing the number 365, over a network connection for example, then each would receive the number 28,032 from the other. If instead, numbers were represented as null-terminated ASCII strings then the number 365 would be represented as four consecutive bytes 00110011 00110110 00110101 00000000 on both systems.

Unix systems have a command line editor, ed, which can be used to edit text files. For example, the following shows the creation of a text file hello, containing just one line, which is the text Hello, world.:

$ ed
a
Hello, world.
.
w hello
14
q

The first line invokes the editor program ed from the shell command prompt '$'. The editor is invoked with an empty text buffer. The second line is the command 'a' to append lines at the current point. The third line is the line to append, and the fourth line, a single full stop, signals the end of the append operation. The next line 'w hello' writes the buffer to the file hello, and outputs 14, which is the number of characters written. Finally the 'q' command terminates the editor and returns the shell command prompt.

On any Unix system, the C compiler, cc, takes as an argument, the name of a text file representing a C program, and compiles it to machine code, writing the output to another file. For example, the following program parses a list of numbers from the command line adds them, and then prints the result. We can use ed to create a text file add.c containing the lines:
#include <stdio.h>

int stoi (char *s) {
    int r=0;
    char c;
    while (c = *s++) {
       if (c >= '0' && c <= '9') {
          r = r * 10 + (c - '0');
       }
    }
    return r;
}

void main (int argc, char **argv) {
    int r=0;
    while (--argc) {
        r += stoi (*++argv);
    }
    printf ("%d\n", r);
}
We can compile this program by typing the following command at the '$' shell prompt:
$ cc add.c
The resulting executable machine code program in the file add, can then be run from a shell script, which is a text file containing a sequence of shell commands. For example, we could use ed to create the shell script file test.sh containing just one line:
./add 300 65
And the we could run the shell script test.sh from the shell prompt, thus:
$ . ./test.sh
365
Now everything here was done by using text file representations of data and programs and so this sequence of representations will work on any Unix machine, regardless of the convention for storing binary integers in memory, and regardless of the encoding used to represent characters. And this shows the underlying machine representations of the data are arbitrary.

In Unix systems, the kernel, the shell sh, the editor ed, the C compiler, cc, and all the other system commands and applications are each just C programs, represented by some text files. So 'the Unix philosophy' of representing all data as text files and using system programming languages which interpret these text files, achieves independence from the underlying machine representation of data in memory. That is how Ken Thompson and Dennis Ritchie, the designers of Unix, abstracted the operating system from the particular accidents of the physical hardware, and it is a clever idea.

But as Thompson himself points out in his famous ACM Turing Award lecture "Reflections on Trusting Trust," this very fact means that the correspondence between the textual representation of a program's source code, and the actual operations carried out by the Unix system as it executes that program, is entirely arbitrary. In other words, the actions performed by the machine are completely independent of the concrete syntactic representation of the system. In yet other words, in case you still don't get it: the sequences of characters one sees in the source code text files are in fact utterly meaningless, and Unix is therefore, in principle, an insecure Operating System. And this was presumably the substance of the United States Air Force report on the security of an early implementation of the Multics operating system, to which Thompson attributes this idea.

As Thompson says in the lecture, the problem is a profound one, and it stems from the fact that the source code of any particular Unix system has a unique abstract syntax representation. It is essentially the uniqueness of the abstract syntax that enables the parser of the C compiler to identify the source of any system command, and thereby choose an arbirtary abstract syntax representation, which could be of any operation whatsoever: Trojan semantics.

The solution turns the problem up-side-down: if we use an abstract syntax representation of the program source, we don't need the parser, and therefore there is no opportunity for the parser to choose an arbitrary operational semantics for the source code. Then the process of compilation need not involve thermodynamic work: in converting the abstract syntax into machine code, there is not neccesarily any reduction in information, because it is possible that the particular abstract syntax representation is the smallest information (i.e. the shortest message) which could produce the intended operational semantics. So the system source code is not the source code of any particular implementation, it is the source code of the general implementation.

Why is this turning the problem up-side-down? It is because by doing this, we can choose an arbitrary abstract syntax representation for any given operational semantics. So the tables are turned: the bad guys can no longer choose an arbitrary operational semantics for a given abstract syntax representation, because the good guys can choose an arbitrary abstract syntax representation of a given operational semantics. Rather than reflecting on trusting trust, we are reflecting on reflecting on trusting trust, which is the same as simply trusting trust.

But how can this be done in practice? We can use metaprogramming. As Thompson points out, the problem with Unix is that you cannot trust any program that you did not write yourself. But using metaprogramming, we don't write the programs we use, we write programs that write the programs we use. So all we need to do is to make sure that we wrote the top level program. But the top-level program is just the program which interprets an arbitrary abstract syntax representation according an arbitrary abstract representation of the semantics. Then when that abstract syntax represents arbitrary abstract syntax, and that semantics is also arbitrary abstract semantics, this top-level program will produce itself, or a program which, when fed the same abstract syntax and semantics, will produce itself. This will then be the greatest fixed point and we will know the system is sound and complete.

Having produced one such system in some specific target language, sheme, say, it would be easier to define a system for another language, such as Standard ML, because we could use the first system to compile the second. We could then construct the greatest fixedpoint again, by crossing over the abstract syntax compilation steps, so that each system compiled the input of the other. We would then have a pair of fixed points which we knew were as reliable as each other.

We could then add semantics for another language, JavaScript, say, and construct a third fixedpoint, which would be each of the three systems crossed with each of the other two. Even though the encodings of the same abstract syntax in the three cases may be entirely different, these three systems could share abstract syntax representations, and we could be certain that all three representations, though different as representations, would have identical operational semantics.

The crucial step is the use of abstract syntax using an arbitrary encoding. Because of this, there is no possibility whatsoever of a Trojan recognising which program is being compiled, so there is no possibility of it reproducing itself in the encoded representation.

This is the foundation from which one could proceed to define abstract types representing the full set of instructions for each and every processor, and the associated encoding specifications which would encode those instructions with the given arguments. From this abstract specification, one could produce tools such as Intel's XED, automatically, for all processors, and in any language. And using that, one could specify an assembler. The next stage would be to specify a code generator for a generic bytecode language, and from there one could formally specify other languages such as C.

The purpose of this higher level development is not to generate a better C compiler: it is unlikely that a C compiler specified this way would be as good as, say, GCC. The purpose of adding metaprogrammed higher-level languages would be to enable more fixedpoints to be established. If the encoding representations can be produced in C, then the C compiler would be included in the fixedpoint, and the foundation for generating GCC binaries that would be certain not to have any Trojan semantics. This would then extend to the entire GCC toolchain, and any operating systems built with GNU tools.

The aim, however, is not to secure operating systems: that is only a temporary measure. The aim is to metaprogram all applications and algorithms, so that we don't need any operating systems at all; just the one global commons, the Mother of all programs: the Foundation.

No comments:

Post a Comment