As we use the machine, we necessarily interact with it, which means that we adjust the machine state according to our intentions. We are typically choosing from lists of available options in the given input state: in other words, through our conscious intentions, we are choosing or determining the actual state transitions from amongst the possible state transitions. So the user interface must allow us to know what is the machine state, and to alter it according to our wishes. The machine state has three parts: the structure being deconstructed, which is the input; the program that is doing the deconstruction; which is the control, and the structure it is constructing in the process, which is the output. Therefore the requirements of a user interface are that we need to see, or at least be aware of, what are the possible state transitions at any given instant. Then we need to be able to choose a new state from amongst those possible, and we then need to see, or at least to be aware of, the subsequent actual state, because this will be the start of the next interaction with the machine. This is the principle of a feedback loop, and it is fundamental to any cybernetic system.
Of course it is not only the user who is interacting with the machine; the environment does too, through various connections such as internal thermal sensors, when the ambient temperature changes; rapid changes in air pressure interact with it via a microphone; and visible electromagnetic radiation interacts with it via a camera. Other machines interact with it too. For example, a phone interacts with it via a short-range radio connection such as WiFi or BlueTooth, another computer interacts with it via a network connection, and a satellite interacts with it via a GPS radio receiver connected to a USB interface. These interactions are all composed of synchronous events, so-called because they are represented simultaneously on both sides of the interface. There are also asynchronous events, when the machine state changes spontaneously, marking the passage of internal time according to some cyclical frequency standard: in these cases, the new actual state is one from which a transition to the subsequent state in the cycle is possible. Asynchronous events are are not interactions, they are just actions. There are other types of asynchronous event such as the generation of hardware random numbers, which is typically done by counting the number of synchronous events that occur between consecutive events in an asynchronous cycle.
The phase space of a system is the set of all its possible internal states, and the system state at any moment is one single point in that space. Now in order to use a machine, we need to determine what it is supposed to be doing. So we need to be able to interpret the information we have representing the internal state. The act of interpretation of that information is what we call the meaning or significance of the computation or communication. The term significance shares a common root with words like sign, and in an automatic system, it is the significance of the state, combined with our intentions, which determine the subsequent state. This whole process of interpreting information and intentionally determining the subsequent state transitions is called operational semantics.
The amount of information we need to interpret to determine the meaning of a single point in the system phase space is effectively infinite, because that state includes not only the internal state of the machine's memory and storage, but also the state of the environment with which the system interacts. This interaction could be via waste heat and temperature sensors, via network interfaces, via stray electromagnetic emissions and radio transmitters and receivers, via loudspeakers and microphones, or via video displays and cameras, etc., etc.
Therefore, to do useful work with a computer, we need to partition the phase space into finitely many classes, such that any of the system states within any one class have the same significance. But how we choose to make these partitions depends crucially on our intentions: i.e. what we are expecting the system to do. In yet other words, it depends on what the system is supposed to do. One way to make such a partition is to conceptually divide the whole into subsystems. A subsystem is a projection of the phase space of the whole system onto a lower-dimensional space. Then we can choose to make a measurement of the state of a subsystem by determining a point of this lower-dimensional space. Each single point of the subspace then corresponds to a region (a multi-dimensional volume consisting in many points) of the phase space of the whole system.
Some subsystem divisions are called orthogonal, because they affect a partition of the phase space of the whole system into disjoint equivalence classes, which are independent in the sense that a change of sub-state in one subspace does not effect the sub-state in any other subspace. In such a partitioning of a system, learning about the substate of one part tells you nothing about that of another part. For example, one typically does not expect to be able to learn anything about the keys pressed on the keyboard, from observations of the state of the computer's audio input. In this sense, then, the two subsystems could be said to be orthogonal. But this is an abstraction we make, and it's not always the case, because if there is a microphone connected to the audio input then these two particular subsystems are coupled through the environment, thermally as well as electromagnetically. It is the entanglement of the discrete, finite, internal states of the system with the whole (non-determinable) environmental state that makes the actual phase space of the whole infinite.
This is not a problem, however, because it is only the infinite information in the environment which makes the observable behaviour of computers interesting: we can use them to measure things. And the environment, don't forget, includes the mind of the user of the machine who can influence the state transitions of the system according to what they actually know. So for example, the system state can tell us things that are consequences of our knowledge, but ones of which we were not previously aware. For example, it may formally prove a new mathematical theorem.
In any act of measurement, which includes any computation or communication, we interpret the information input from the environment, and that tells us about temperature, intesity of laser light, stock market prices, news headlines, mathematical theorems and so on and so forth. The act of interpretation is crucial to the whole enterprise, but, presumably because it is so natural and instinctive, it is all too often forgotten by theorists. Quantum physicists seem particularly prone to this omission. It is very rare to see the question of interpretation of physical measurements mentioned in introductory texts on quantum mechanics: almost without exception they seem to take it for granted. But without this possibilty of interpreting the environmental entanglement of the system state, computers and clocks would be deterministic and, like Turing's a-machines, utterly incapable of telling us anything we didn't already know.
The process of measurement is always a restriction to a finite amount of information representing some part of the whole environmental state of the system. For example, a microphone input measures the differences in air pressure at 48kHz intervals, within some particular range, and interprets them as tiny voltage fluctuations which an ADC measures and interprets as binary values on a 16 bit digital input port register. Note that the finite fidelity of the input information is important: without this, we would not be able to assign any significance to a measurement, because, continuing this example, we wouldn't know whether we were measuring the aging of the microphone diaphragm, the barometric pressure change caused by an incoming weather-front, an audible sound, or an asteroid impact.
Therefore the user interface must allow us to determine the programmed responses of the system to the limited information representing the subsystems we have abstracted from its environment, because it is only by changing these programmed responses that we can determine the subsystems that we are measuring, and by which we attach significance to the information we receive.
For example, in order to receive a written message by e-mail, we typically restrict our attention to a limited part of a screen, in which the strings of character glyphs representing the written words of the message are represented by, say, black pixels on a white background. In determining the significance of the message, we ignore the rest of the screen, and all the rest of environment with which the system state is entangled. So, for example, the intended significance of the message is not typically affected by the sound that happens to be playing through the speaker at the time, nor by the contents of messages subsequently received, but as yet unread. Of course the significance of the message may be affected by these things, but we say not typically to emphasise the importance of the intention of the user in choosing the representation of the channel this way. She could have chosen to have some of her e-mail messages read out aloud by speech synthesis, then the audio output would typically affect the significance of the messages.
Now let us give an extended example of computer use. Imagine Alice is supposed to produce a series of satellite images, overlaid with graphics representing a road network developing over a period of fifteen years or so. In her home directory, she has an ASCII text data file, giving sectors of roads of different types and different construction dates, represented as sequences of points represented as eastings and northings from a certain false origin on a transverse Mercator grid with a datum on the Airey spheroid. The grid coordinates are units of kilometers, given to 3 decimal places. The road types are just numbers 1, 2 and 3, and the dates are numeric, in the form DD/MM/YY. She also has a URL giving access to a USGS database of satellite images via a Java applet which allows her to select the mission, the image sensor, one square of the 30 arc-second mosaic defined in the WGS 84 coordinate system, the year, the month, and the minimum percentage cloud cover. The applet allows her to view thumbnail GIF images, and to request the full resolution images, which are JPEG data embedded in GEOTIFF files, and sent by e-mail within a few hours.
What she needs to do is parse the data file defining the roads, order the sectors of roads by increasing construction date, then find matching satellite images around the same time for the different tiles of the region of interest. Then she needs to find those images with the least cloud cover, request the files from the server, and when the e-mails arrive, extract the data files from the mail message. She must then tile these images together using a spherical projection, according to the geographical coordinate metadata in the TIFF headers.
Now she needs to find all the sectors of roads that were constructed before the date of the most recent tile in that composite image. Then she needs to convert the grid points of the roads to WGS 84 coordinates using the same spherical projection, to give page coordinates which are used to overlay SVG paths representing the roads, according to a line-style and width determined by road type, and a colour key indicating the approximate age of the sectors. Then she needs to project the UTM grid, draw the border of the WGS84 graticule, and overlay a title and a key showing the construction dates of the roads. And she needs to do this at approximately two year intervals, covering a period of fifteen years. These images will form part of a written presentation containing other text and images, some part of which will be presented on a television programme via a Quantel video graphics system which will broadcast directly from digital representations of the images in a proprietary format, reading them from an ISO-9660 format CD-ROM and inserting them into a video stream according to synchronous cues provided by the presenter pressing a button.
While she's doing this, her phone rings. She turns down the volume of the music to answer the call. Afterwards she sends an e-mail to someone else, after looking up their e-mail address on a web page. Finally she makes a note of a scheduled meeting in her diary. That done, she turns up the volume of the music a little, and carries on with her work.
It turns out, then, that the user interface is primarily concerned with switching information channels from one source to another. Now by assumption, a channel is some kind of recursive data structure, and a source is a process which deconstructs one such recursive structure, the input, and constructs another, which is the output. This general form of process is called interpretation.
Now we can't yet answer our original question, but we can say this much:
A user-interface is a means to create new information channels by plugging sources into existing channels.
So that's what a user interface actually does. It remains to say what it is supposed to do. And the answer is in that word supposed because what it is supposed to do is what we intend it to do. This means that an effective user interface must be intensional: it must allow us to express the operation of the system intensionally. This is because it is only intensional representations which can be effectively interpreted. An intensional representation is one that is essential in so far as it is free of any and all accidents of representation, such as superfluous parentheses, white space, and arbitrary variable names. So the ideal user interface should allow one to express the intended operation of the system using abstract rather than the concrete syntax one typically works with using a text editor.
Let us make this very clear: the notion of using text representation to express semantics is fundamentally wrong. Anyone who doesn't believe this need only take a brief look at the two examples on lists given on page 91 of Girard, Lafont and Taylor's Proofs and Types to see just what kind of a mess one inevitably gets into if one tries to deal with the semantics of concrete syntax, such as operator precedence and associativity. It is not a simple kind of mess! If you like clearing up that sort of problem, then you're welcome, but you will only ever be clearing up one particular instance, and you will never be bored again, because there will always be plenty more such problems for you to solve. The only general solution is to avoid concrete syntax altogether.
This would have the not insignificant side-benefit of avoiding so-called 'religious wars' such as that between some adherents of OCaml and Standard ML. This is a problem only because programmers insist on using linear text representations of program source code. If they switched to using abstract syntax representation then programs in either of the two languages could be translated immediately, one into the other (notwithstanding semantic insanities such as the typing of OCaml's optional arguments, which not even the OCaml compiler can interpret.)
Using abstract syntax would have other advantages too, because editors could easily rename variables, merge change sets, and even translate program keywords into other languages. For example, someone in Pakistan could use Standard ML with Urdu syntax, written right to left, and with meaningful Urdu names for keywords and program variables, and these could be automatically translated, via a dictionary, into Chinese OCaml, written from top to bottom, and with variable names that mean the same things in Chinese. Then someone in China could edit the same source code in Chinese OCaml and send change sets to Pakistan, where they would appear to have been made in the Urdu dialect of Standard ML. This would be an example of religion, which means joining back together.
Our provisional answer has evolved a little:
We need user interfaces to allow us to express the intensional semantics of plugging information sources into channels.It is the presence of that word intensional that is most crucial. The problem Alice has to solve in the extended example of a user interface we gave above would be trivial if systems were specified using only abstract syntax. But as it is now, well, just try and do what she has to do, and tell us how long it takes you. You can use any languages, libraries and tools you like. And then tell us how long you think it would take you to solve the same problem, with the same data, but after we've changed the concrete representation: i.e. the character sets, the file formats, the transmission medium, the communications protocols, the programming languages and tools, the measurement units and the coordinate systems? The answer would be "about the same time, or rather longer," I expect. But these two instances are essentially the same problem. If only we had an intensional abstract syntax representation of the algorithms, the application, the programming languages, the file formats and the protocols, we could change any of these things with half a dozen key-strokes. Then either problem could be solved in ten minutes, even allowing for the time she has take out to deal with the phone call and to schedule the subsequent meeting. And once given the solution to one problem, she could solve the other effortlessly in ten seconds, because it would be just a matter of a few trivial identifier substitutions.
The answer to the question "What is a user interface supposed to do?" is:
A user interface is supposed to allow us to compose abstract syntax representations of intensional descriptions of operational semantics of arbitrary languages.Or, if we put it in less abstract terms:
A user interface is supposed to allow us to compose abstract syntax representations of the semantics of arbitrary languages.And because it is supposed to do this, and because arbitrary languages are just arbitrary abstract syntax:
A user interface uses abstract syntax to specify abstract syntax representing the semantics of arbitrary abstract syntaxes.So a user interface allows one to create and edit abstract syntax according to arbitrary grammars. Now if one of those grammars were the grammar of a language for expressing context-free grammars, and if that were itself a context-free grammar, then it could be expressed in its own language, and it would be the only language one would need because all others could be defined in that language.