The Elements of Computing Systems: Building a Modern Computer from First Principles
by Noam Nisan
Notes on Machine LanguageA computer can be described constructively, by laying out its hardware platform and explaining how it is built from low-level chips. A computer can also be described abstractly, by specifying and demonstrating its machine language capabilities.
A machine language is an agreed-upon formalism, designed to code low-level programs as series of machine instructions. Using these instructions, the programmer can command the processor to perform arithmetic and logic operations, fetch and store values from and to the memory, move values from one register to another, test Boolean conditions, and so on. As opposed to high-level languages, whose basic design goals are generality and power of expression, the goal of machine language’s design is direct execution in, and total control of, a given hardware platform.
Machine language is the most profound interface in the overall computer enterprise—the fine line where hardware and software meet.
In fact, just as we say that the machine language is designed to exploit a given hardware platform, we can say that the hardware platform is designed to fetch, interpret, and execute instructions written in the given machine language.
To give a general description of machine languages, it is sufficient to focus on three main abstractions only: a processor, a memory, and a set of registers.
The term memory refers loosely to the collection of hardware devices that store data and instructions in a computer. From the programmer’s standpoint, all memories have the same structure: A continuous array of cells of some fixed width, also called words or locations, each having a unique address. Hence, an individual word (representing either a data item or an instruction) is specified by supplying its address.
The processor, normally called Central Processing Unit or CPU, is a device capable of performing a fixed set of elementary operations. These typically include arithmetic and logic operations, memory access operations, and control (also called branching) operations. The operands of these operations are binary values that come from registers and selected memory locations. Likewise, the results of the operations (the processor’s output) can be stored either in registers or in selected memory locations.
Memory access is a relatively slow operation, requiring long instruction formats (an address may require 32 bits). For this reason, most processors are equipped with several registers, each capable of holding a single value. Located in the processor’s immediate proximity, the registers serve as a high-speed local memory, allowing the processor to manipulate data and instructions quickly.
LanguagesA machine language program is a series of coded instructions. For example, a typical instruction in a 16-bit computer may be 1010 0011 0001 1001. In order to figure out what this instruction means, we have to know the instruction set of the underlying hardware platform. For example, the language may be such that each instruction consists of four 4-bit fields: The left-most field codes a CPU operation, and the remaining three fields represent the operation’s operands. Thus the previous command may code the operation set R3 to R1 + R9, depending of course on the hardware specification and the machine language syntax.
Since binary codes are rather cryptic, machine languages are normally specified using both binary codes and symbolic mnemonics (a mnemonic is a symbolic label whose name hints at what it stands for - in our case hardware elements and binary operations). For example, the language designer can decide that the operation code 1010 will be represented by the mnemonic add and that the registers of the machine will be symbolically referred to using the symbols R0, R1, R2, and so forth. Using these conventions, one can specify machine language instructions either directly, as 1010001100011001, or symbolically, as, say, ADD R3,R1,R9.
We can use a text processing program to parse the symbolic commands into their underlying fields (mnemonics and operands), translate each field into its equivalent binary representation, and assemble the resulting codes into binary machine instructions. The symbolic notation is called assembly language, or simply assembly, and the program that translates from assembly to binary is called assembler.
Since different computers vary in terms of CPU operations, number and type of registers, and assembly syntax rules, there is a Tower of Babel of machine languages, each with its own obscure syntax. Yet irrespective of this variety, all machine languages support similar sets of generic commands.
Hack Machine Language SpecificationThe Hack computer is a 16-bit machine, consisting of a CPU, two separate memory modules serving as instruction memory and data memory, and two memory-mapped I/O devices: a screen and a keyboard.
Memory Address Spaces
The Hack programmer is aware of two distinct address spaces: an instruction memory and a data memory. Both memories are 16-bit wide and have a 15-bit address space, meaning that the maximum addressable size of each memory is 32K 16-bit words.
The CPU can only execute programs that reside in the instruction memory. The instruction memory is a read-only device. The instruction memory can be implemented in a ROM chip that is pre-burned with the required program. Loading a new program is done by replacing the entire ROM chip, similar to replacing a cartridge in a game console.
The Hack programmer is aware of two 16-bit registers called D and A. These registers can be manipulated explicitly by arithmetic and logical instructions like A=D-1 or D=!A (where "!" means a 16-bit Not operation). While D is used solely to store data values, A doubles as both a data register and an address register. That is to say, depending on the instruction context, the contents of A can be interpreted either as a data value, or as an address in the data memory, or as an address in the instruction memory.
First, the A register can be used to facilitate direct access to the data memory. Since Hack instructions are 16-bit wide, and since addresses are specified using 15 bits, it is impossible to pack both an operation code and an address in one instruction. Thus, the syntax of the Hack language mandates that memory access instructions operate on an implicit memory location labeled "M", for example, D=M+1. In order to resolve this address, the convention is that M always refers to the memory word whose address is the current value of the A register.
For example, if we want to effect the operation D = Memory-1, we have to use one instruction to set the A register to 516, and a subsequent instruction to specify D=M-1.
In addition, the hardworking A register is also used to facilitate direct access to the instruction memory. Similar to the memory access convention, Hack jump instructions do not specify a particular address. Instead, the convention is that any jump operation always effects a jump to the instruction located in the memory word addressed by A. Thus, if we want to effect the operation goto 35, we use one in- struction to set A to 35, and a second instruction to code a goto command, without specifying an address. This sequence causes the computer to fetch the instruction located in InstructionMemory in the next clock cycle.
The only non-obvious command in the language is @value, where value is either a number or a symbol representing a number. This command simply stores the specified value in the A register. For example, if sum refers to memory location 17, then both @17 and @sum will have the same effect: A<--17.
In particular, note that every operation involving a memory location requires two Hack commands: One for selecting the address on which we want to operate, and one for specifying the desired operation.
The Hack language consists of two generic instructions: an address instruction, also called A-instruction, and a compute instruction, also called C-instruction. Each instruction has a binary representation, a symbolic representation, and an effect on the computer.
The A-instruction is used to set the A register to a 15-bit value:
A-instruction: @value // Where value is either a non-negative decimal number or a symbol referring to such number.
This instruction causes the computer to store the specified value in the A register. For example, the instruction @5, which is equivalent to 0000000000000101, causes the computer to store the binary representation of 5 in the A register.
The A-instruction is used for three different purposes.
First, it provides the only way to enter a constant into the computer under program control.
Second, it sets the stage for a subsequent C-instruction designed to manipulate a certain data memory location, by first setting A to the address of that location.
Third, it sets the stage for a subsequent C-instruction that specifies a jump, by first loading the address of the jump destination to the A register.
The instruction code is a specification that answers three questions:
(a) what to compute,
(b) where to store the computed value,
(c) what to do next?
Along with the A-instruction, these specifications determine all the possible operations of the computer.
C-instruction: dest = comp;jump // Either the dest or jump fields may be empty. If dest is empty, the "=" is omitted; If jump is empty, the ";" is omitted.
The leftmost bit is the C-instruction code, which is 1. The next two bits are not used.
The dest field instructs where to store the computed value (ALU output).
The comp field instructs the ALU what to compute.
The jump field specifies a jump condition, namely, which command to fetch and execute next.
The Computation Specification
The Hack ALU is designed to compute a fixed set of functions on the D, A, and M registers (where M stands for Memory[A]).
This pattern can potentially code 128 different functions, of which only the 28 listed in the language specification.
The Destination Specification
The value computed by the comp part of the C-instruction can be stored in several destinations, as specified by the instruction’s 3-bit. The first and second d-bits code whether to store the computed value in the A register and in the D register, respectively. The third d-bit codes whether to store the computed value in M (i.e., in Memory[A]). One, more than one, or none of these bits may be asserted.
The Jump Specification
The jump field of the C-instruction tells the computer what to do next.
There are two possibilities: The computer should either fetch and execute the next instruction in the program, which is the default, or it should fetch and exe- cute an instruction located elsewhere in the program. In the latter case, we assume that the A register has been previously set to the address to which we have to jump.
The last instruction (0;JMP) effects an unconditional jump. Since the C-instruction syntax requires that we always effect some computation, we instruct the ALU to compute 0 (an arbitrary choice), which is ignored.
Assembly commands can refer to memory locations (addresses) using either constants or symbols.
Predefined symbols: A special subset of RAM addresses can be referred to by any assembly program using the following predefined symbols:
- Virtual registers: To simplify assembly programming, the symbols R0 to R15 are predefined to refer to RAM addresses 0 to 15, respectively.
- Predefined pointers: The symbols SP, LCL, ARG, THIS, and THAT are predefined to refer to RAM addresses 0 to 4, respectively. Note that each of these memory locations has two labels. For example, address 2 can be referred to using either R2 or ARG.
- I/O pointers: The symbols SCREEN and KBD are predefined to refer to RAM addresses 16384 (0x4000) and 24576 (0x6000), respectively, which are the base addresses of the screen and keyboard memory maps.
Variable symbols: Any user-defined symbol XXX appearing in an assembly program that is not predefined and is not defined elsewhere using the "(XXX)" com-mand is treated as a variable, and is assigned a unique memory address by the assembler, starting at RAM address 16 (0x0010).
The Hack platform can be connected to two peripheral devices: a screen and a keyboard. Both devices interact with the computer platform through memory maps. This means that drawing pixels on the screen is achieved by writing binary values into a memory segment associated with the screen. Likewise, listening to the keyboard is done by reading a memory location associated with the keyboard. The physical I/O devices and their memory maps are synchronized via continuous refresh loops.
The Hack computer includes a black-and-white screen organized as 256 rows of 512 pixels per row. The screen’s contents are represented by an 8K memory map that starts at RAM address 16384 (0x4000). Each row in the physical screen, starting at the screen’s top left corner, is represented in the RAM by 32 consecutive 16-bit words. To write or read a pixel of the physical screen, one reads or writes the corresponding bit in the RAM-resident memory map (1 1⁄4 black, 0 1⁄4 white).
The Hack computer interfaces with the physical keyboard via a single-word memory map located in RAM address 24576 (0x6000). Whenever a key is pressed on the physical keyboard, its 16-bit ASCII code appears in RAM. When no key is pressed, the code 0 appears in this location. In addition to the usual ASCII codes, the Hack keyboard recognizes the keys shown: