The Guide to Hashing I Wish I Had When I Started

12 min read

Introduction

Hashing is interesting. At first glance, it sounds complicated, but it is actually quite simple. At least the principle of it. If you break it down to smaller pieces, all it really is a one-way function that takes an input and produces an output that always has the same size.

Of course, that is not ALL it is. There are tons of stuff that goes into it. But the principle is simple. I recently took a deep dive into the subject to learn, and I figured I would write down a guide with all the things I learned - a guide that I wish I had when I started.

What is hashing?

I briefly mentioned it in the previous section, but let’s go a bit deeper into it. What does it mean to have a function that takes an input and produces a fixed-size output?

First of all, the input can be any data. A string, number, file, you name it. Why is that? Because all of them, on a lower level, are represented as binary data. 1s and 0s. What the function does is then pretty simple - it runs mathematical functions and produces another binary output. 1s and 0s in another order. However, the output is always the same size. Simple as that.

Hashing

Above is a very simple visualisation. We try to hash the string “hello” (which is just binary data). The function takes the input, and produces a fixed-size output. In this case, only 4 bits, a nibble. The hashing algorithms usually produce a much larger output, but this is just for the sake of simplicity.

This means, that we could basically send in any data, and still get 4-bit output in our super simple hash algorithm.

Hashing with image

Even though an image, in this example, is often much larger than the string “hello”, or even 4 bits, the hashing algorithm will still produce a fixed-size output. In this case, 4 bits.

Why hashing?

Before diving deeper into the world of hashing, it is probably best to understand why we need hashing (this article is more focused on the technical part, so this will just be a short overview).

There are several reasons why we need hashing, and the most common ones are:

Data integrity

Data gets sent over the internet all the time. Hashing is used to verify that data has not been altered during transmission. A sort of safeguard that you get the correct data. We will look briefly at this further down.

Password storage

When you create an account on a website, your password is hashed and stored in the database. This way, even if the database is compromised, your password remains secure. This part will also be mentioned further down.

Digital signatures

Digital signatures are used to verify the authenticity of a message or document. Hashing is used to create a unique signature that can be verified later. This is more part of cryptography, which I might or might not cover in a future article.

The principles of a hash function

Although the concept of hashing is somewhat simple, as illustrated above, it is important to understand that there are some principles that a hash function must follow. For example, it cannot just output 4 bits like the example above. I just didn’t want to create a diagram with too many bits.

Speed

Like most other things, speed is important for hashing. A hash function should reliably produce a hash in short time, whether the input is large or small. For practical applications, this is important. If the hashing algorithm is slow, it will slow down the entire system.

Determinism

The output of a hashing algorithm must always be the same for the same input. If I want to hash the same image 1000 times, the output should always be the same. This is what makes hashing useful.

Collision resistance

We have a lot of data we can hash. Actually more data than we have possible hashes. This means that there is a risk of a collision. Two different inputs that are hashed to the same output. A good hashing function should minimize this risk, but it is impossible to remove it completely. We will talk more about the pigeonhole principle and birthday paradox further down, which is related to this.

More technical stuff about hashing

Let’s dive a little deeper into the technical part. Things we should know about hashing. For example, hashing is a one-way function.

One-way function

It does exactly what it sounds like - it only works one way. Something that is hashed, can never be unhashed (which is not even a word). You can never get the original input from the hashed output.

One way hashing

This is good. It means that if someone gets access to the hashed output, they can never get the original input. This is what makes hashing useful for passwords, for example.

Avalanche effect

Another interesting property of hashing is the avalanche effect. It just means that a small change in the input, will produce a big change in the output. That’s all it means.

For the sake of it, let’s compare the hashing of the very similar words hello and Hello. The only difference is the capital H. We’ll use the SHA-256 hashing algorithm for this.

Hello

185f8db32271fe25f561a6fc938b2e264306ec304eda518007d1764826381969

hello

2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824

This is the avalanche effect. The hashed output is not at all similar. And there are good reasons for this.

Similar output would make it easier to “crack” the hash. The avalanche effect prevents predictability. No two similar objects have similar hashes. This also enhances security, as it is harder to find patterns in relations between the input and output. It also reduces the collision risk, which is nice.

Hash output length

The output of a hash is always the same size, and which size depends on the hashing algorithm. In my example above, I only used 4 bits as the output. Not very secure. But in reality, the output is much larger. The more common sizes are 128, 160, 256, and 512 bits. The most common one being 256 bits. 256 1s and 0s.

A higher bit length means a more secure hash. But it comes at a cost, as a higher bit length means a slower hashing algorithm.

Representation of the hash

As you probably saw in the example above, we don’t use 1s and 0s to represent the hash. Instead, we use hexadecimal or base64. The main reason for this is that it is easier to read. A 256 size hash would be 256 1s and 0s in binary, and in hexadecimal, it is only 64 characters. In base64, it is even shorter. 44 characters.

Both outputs are used, but the hexadecimal format is probably the most common one. It is also the default output for most hashing algorithms.

A short comparison of the SHA-256 hash of Hello in hexadecimal and base64:

Hexadecimal

185f8db32271fe25f561a6fc938b2e264306ec304eda518007d1764826381969

Base64

GF+NsyJx/iX1Yab8k4suJkMG7DBO2lGAB9F2SCY4GWk=

Pigeonhole principle and the birthday paradox

The pigeonhole principle is a simple concept. If you have 10 pigeons and 9 holes, one hole will have two pigeons. This is the same with hashing. If you have more inputs than outputs, there will be a collision. And theoretically, this is always possible as long as there are more than one entry. We have an infinite amount of data to input, and a finite amount of hashes.

The birthday paradox is a similar concept, which explains why collisions might be more common than one think. It states that if you have a room of 23 people, there is a 50% chance that two of them share a birthday.

It sounds a bit strange that only 23 people in 365 days will lead to a 50% collision chance, but it is true. The reason for this is that the number of combinations increases rapidly with the number of people.

It is also worth mentioning that we are not talking about the chance that one person shares a birthday with another person in the room. We are talking about the chance that anyone in the room shares a birthday with someone else.

  • There is a low chance that one person shares a birthday with another person in the room.
  • There is a higher chance that anyone in the room shares a birthday with someone else.

Hashing vs. encryption

Hashing is not encryption. They are two different things, even though they are often confused.

Encryption is a two-way function. You can encrypt something, and you can decrypt it. Hashing is one-way, as mentioned earlier. Once hashed, there is no going back. Just wanted to clarify this, as it is a common misconception.

Common hashing algorithms

There are a lot of hashing algorithms out there, and it is very likely that you have seen some of their names before. Let’s go through some of the most common ones.

MD5

MD5 is one of the oldest hashing algorithms, and it was widely used for a long time. It produces a 128-bit hash. Fast, but not secure. It is broken, and should not be used for anything that requires security. Two different inputs can produce the same output.

SHA-1

SHA-1 is a bit newer than MD5, and it produces a 160-bit hash. It is also broken, and should not be used for anything that requires security. It is still used in some places, but it is not recommended.

SHA-2

First one in the list that is considered secure!

Basically, it is a family of hashing algorithms, and the most common ones are SHA-256 and SHA-512. They produce 256-bit and 512-bit hashes, as their name suggests.

It is commonly used for digital signatures and data integrity checks.

Hashing in Node.js

Hashing can be done in most programming languages. JavaScript and Node.js uses the crypto module for hashing. It is built-in, so no need to install anything extra.

Hashing a string

The most simple example maybe.

import { createHash } from 'crypto';

const hash = createHash('sha256');
hash.update('hello');
const hashedString = hash.digest('hex');

Hashing a file

Let’s take it a step further. Let’s hash a file instead.

import { createHash } from 'crypto';
import { createReadStream } from 'fs';

const hash = createHash('sha256');
const fileStream = createReadStream('path/to/file.txt');

fileStream.on('data', (chunk) => {
  hash.update(chunk);
});

fileStream.on('end', () => {
  const hashedFile = hash.digest('hex');
  console.log(hashedFile);
});

I don’t even know why I included this example, as it is very similar to the string example. But yeah, it is there now.

Hashing a password

Hashing a password is a bit different. You should always use a salt when doing it. We will also use another library, bcrypt, for this. It is a popular library for hashing passwords, and it is very easy to use.

It uses another algorithm we haven’t talked about yet, called bcrypt (surprise). It is a bit slower than SHA-256, but it is more secure. The module also salts the password for you without you having to think about it.

Salting is the process of adding random data to the input before hashing it. The salt is not secret, but it is unique for each input. This makes it harder to crack the hash, as the same input will produce different hashes.

import bcrypt from 'bcrypt';
import { randomBytes } from 'crypto';

const saltRounds = 10;
const password = 'password123';

const hashedPassword = await bcrypt.hash(password, saltRounds);

File integrity check

Hashing is also used for file integrity checks. You can hash a file, and then compare the hash with the original hash to see if the file has been altered.

Whenever you hash a file, the hash is often called a checksum. It is used to verify that the file has not been altered during transmission.

import { createHash } from 'crypto';
import { readFile } from 'fs/promises';

const hash = createHash('sha256');
// For the example, let's read the whole file into memory
const fileBuffer = await readFile('path/to/file.txt');

hash.update(fileBuffer);
const checksum = hash.digest('hex');

When it is time to verify the file, you can just hash it again and compare the hashes.

import { createHash } from 'crypto';
import { readFile } from 'fs/promises';

const checksum = 'expected-checksum';

const hash = createHash('sha256');
const fileReceived = await readFile('path/to/file.txt');

hash.update(fileReceived);
const checksumReceived = hash.digest('hex');

if (checksum === checksumReceived) {
 console.log('File is intact');
} else {
 console.log('File has been altered');
}

Conclusion

Interesting that I started this article stating that hashing is simple, and then spending a bit over 2000 words to explain a TINY bit of it.

Hopefully, you learned something new. I learned a lot while writing this article. I am by no means a hashing expert, so feel free to correct me if I got something wrong.

I am thinking about writing a follow-up article about cryptography, as it is a related subject. If you are interested in that, let me know on Twitter.

Thanks for reading!