Featured image for Automate dependency analytics with GitHub Actions

A lot of technologies, business choices, and public policies gave us the internet we have today—a tremendous boost to the spread of education, culture, and commerce, despite its well-documented flaws. But few people credit two deeply buried technologies for making the internet possible: hashing and cryptography.

If more people understood the role these technologies play, more money and expertise would go toward uncovering and repairing security flaws. For instance, we probably would have fixed the Heartbleed programming error much earlier and avoided widespread vulnerabilities in encrypted traffic.

This article briefly explains where hashing and cryptography come from, how they accomplish what they do, and their indelible effect on the modern internet.

Hashing

Hashing was invented in the 1950s at the world's pre-eminent computer firm of that era, IBM, by Hans Peter Luhn. What concerned him at the time was not security—how many computer scientists thought about that?—but saving disk space and memory, the most costly parts of computing back then.

A hash is a way of reducing each item of data to a small, nearly unique, semi-random string of bits. For instance, if you are storing people's names, you could turn each name into the numerical value of the characters and run a set of adds, multiplies, and shift instructions to produce a 16-bit value. If the hash is good, there will be very few names that produce the same 16-bit value—very few collisions, as that situation is called.

Now suppose you want to index a database for faster searching. Instead of indexing the names directly, it's much simpler and more efficient to make the index out of 16-bit values. That was one of the original uses for hashes. But they turned out to have two properties that make them valuable for security: No one can produce the original value from the hash, and no one can substitute a different value that produces the same hash. (It is theoretically possible to do either of those things, but doing so would be computationally infeasible, so they're impossible in practice.)

Early Unix systems made use of this property to preserve password security. You created a password along with your user account and gave it to the computer, but the operating system never stored the password itself—it stored only a hash. Every time you entered your password after that, the operating system ran the hash function and let you log in if the resulting hash matched the one in the system. If the password file were snatched up by a malicious intruder, all they would get is a collection of useless hashes. (This clever use of hashes eventually turned out not to be secure enough, so it was replaced with encryption, which we'll discuss in more detail in the next section of this article.)

Hashes are also good for ensuring that no one has tampered with a document or software program. Injecting malware into free software on popular repositories is not just a theoretical possibility—it can actually happen. Therefore, every time a free software project releases code, the team runs it through a hash function. Every user who downloads the software can run it through the same function to make sure nobody has intercepted the code and inserted malware. If someone changed even one bit and ran the hash function, the resulting hash would be totally different.

Git is another of the myriad tools that use hashes to ensure the integrity of the repository, as well as to enable quick checks on changes to the repository. You can see a hash (a string of random characters) each time you issue a push or log command:

commit 2de089ad3f397e735a45dda3d52d51ca56d8f19a
Author: Andy Oram <andyo@example.com>
Date:   Sat Sep 3 16:28:41 2022 -0400

    New material related to commercialization of cryptography.

commit f39e7c87873a22e3bb81884c8b0eeeea07fdab48
Author: Andy Oram <andyo@example.com>
Date:   Fri Sep 2 07:47:42 2022 -0400

    Fixed typos.

Hash functions can be broken, so new ones are constantly being invented to replace the functions that are no longer safe.

Cryptography

Mathematically speaking, the goal of cryptography has always been to produce output where each bit or character has an equal chance of being another character. If someone intercepted a message and saw the string "xkowpvi," the "x" would have an equal chance of representing an A, a B, a C, and so on.

In digital terms, every bit in an encrypted message has a 50% chance of representing a 0 and a 50% chance of representing a 1.

This goal is related to hashing, and there is a lot of overlap between the fields. Security experts came up with several good ways to create encrypted messages that couldn't be broken—that is, where the decryption process would be computationally infeasible without knowing the secret key used to encrypt the message. But for a long time these methods suffered from an "initial exchange" problem: The person receiving the message needed to somehow also learn what that secret encryption key was, and learn it in a way that didn't reveal the key to anybody else. Whether you're a spy in World War II Berlin trying to communicate with your U.S. buddies, or a modern retail site trying to confirm a customer's credit card online, getting the shared secret securely is a headache.

The solution by now is fairly familiar. The solution creates a pair of keys, one of which you keep private and the other of which you can share freely. Like a hash, the public key is opaque, and no one can determine your private key from it. (The number of bits in the key has to be doubled every decade or so as computers get more powerful.) This solution is generally attributed to Whitfield Diffie, Martin Hellman, and Ralph Merkle, although a British intelligence agent thought of the solution earlier and kept it secret.

Diffie in particular was acutely conscious of social and political reasons for developing public key encryption. In the 1970s, I think that few people thought of doing online retail sales or services using encryption. It was considered a tool of spies and criminals—but also of political dissidents and muckraking journalists. These associations explain why the U.S. government tried to suppress it, or at least keep it from being exported, for decades.

Diffie is still quite active in the field. The most recent article I've seen with him listed as an author was published on July 18, 2022.

The linchpin of internet cryptography came shortly afterward with RSA encryption, invented by Ron Rivest, Adi Shamir, and Len Adleman. RSA encryption lets two parties communicate without previously exchanging keys, even public keys. (They were prevented from reaping much profit from this historic discovery because the U.S. government prevented the export of RSA technology during most of the life of their patent.)

A big problem in key exchange remains: If someone contacts you and says they are Andy Oram, proffering what they claim to be Andy Oram's public key, how do you know they're really me? The two main solutions (web of trust and certificate authorities) are beyond the scope of this article, and each has vulnerabilities and a lot of overhead. Nevertheless, the internet seems to work well enough with certificate authorities.

The internet runs on hashes and cryptography

The internet essentially consists of huge computer farms in data centers, to which administrators and other users have to log in. For many years, the universal way to log into another system was Telnet, now abandoned almost completely because it's insecure. If you use Telnet, someone down the hall can watch your password cross the local network and steal the password. Anyone else who can monitor the network could do the same.

Nowadays, all communication between users and remote computers goes over the secure shell protocol (SSH), which was invented as recently as 1995. All the cloud computing and other data center administration done nowadays depend on it.

Interestingly, 1995 also saw the advent of the secure sockets layer (SSL) protocol, which marks the beginning of web security. Now upgraded to Transport Layer Security (TLS), this protocol is used whenever you enter a URL beginning with HTTPS instead of HTTP. The protocol is so important that Google penalizes web sites that use unencrypted HTTP.

Because most APIs now use web protocols, TLS also protects distributed applications. In addition to SSH and TLS, encryption can be found everywhere modern computer systems or devices communicate. That's because the modern internet is beset with attackers, and we use hashes and encryption to minimize their harm.

Some observers think that quantum computing will soon have the power to break encryption as we know it. That could leave us in a scary world: Everything we send over the wire would be available to governments or large companies possessing quantum computers, which are hulking beasts that need to be refrigerated to within a few degrees of absolute zero. We may soon need a new army of Luhns, Diffies, and other security experts to find a way to save the internet as we know it.