Colliding Secure Hashes

By David Buchanan, 18^th December 2023

At this point, it's well known that hash functions like MD5 and SHA1 have cryptographic weaknesses, allowing collisions (a pair of distinct inputs that produce identical outputs) to be engineered using much less computation than the designers intended.

But what if we take a nominally secure hash like SHA-256? Can we attack that, without exploiting any structural weaknesses in the hash function itself?

The short and boring answer is "no, SHA-256 is secure." But what if we consider a weakened version? How much weaker do we have to make it before we can generate collisions? There's more than one way to weaken a hash, but in this article we'll consider truncation: throwing away some portion of the hash digits.

Truncation

Hash truncation is common in practice, and usually done for the sake of brevity. SHA256 is a 256-bit hash, i.e., 32 bytes, or 64 hexadecimal digits, which is quite long. Here are some real-world examples of hash truncation:

Checking the hash of a downloaded executable "by eye." If you're in a hurry, you might only check the first or last handful of digits.
Referencing a SHA-1 git commit by its first 7 hex digits—hopefully not in any security-critical contexts though!
PGP public keys are often referenced by the last 64 bits of their fingerprint, or even just 32.
The HOTP algorithm, aka RFC4226
Some TLS cipher suites used truncated hashes improperly, leading to SLOTH attacks: "Security Losses from Obsolete and Truncated Transcript Hashes."
Bluesky user identifiers, aka did:plc
Many, many more applications and protocols.

By the way, truncation can increase the security of a hash in certain contexts, by making it less vulnerable to length extension attacks.

Upper and Lower Bounds

Going back to the question: what's the smallest safe hash length?

It's hard to answer that question concretely, but anything at least 224 bits long is probably fine. 224 bits is the output length of the shortest NIST-approved hash function, SHA-224, which we can reasonably conclude to be safe.

Conversely, I can say that anything less than 96 bits is definitely not fine.

I know this because I produced a 96-bit truncated-SHA256 collision for myself, using only my desktop computer. Take a look:

import hashlib

def sha256_96(data: bytes) -> bytes:
	h = hashlib.sha256(data).digest()
	return h[:6] + h[-6:] # throw away some of the middle bits

print(sha256_96(b"retr0id_662d970782071aa7a038dce6").hex())
print(sha256_96(b"retr0id_430d19a6c51814d895666635").hex())
# 307e0e71a4099d0564e6ea9a
# 307e0e71a4099d0564e6ea9a

Or to visually compare the full-length hash outputs:

In the rest of this article I'll be explaining how I did that, and finally, using my findings to estimate the cost of breaking longer hashes.

You might be wondering, "hey, some of those examples you gave earlier use fewer than 96 bits, does that mean they're unsafe?". It Depends™. Some aren't security-sensitive in the first place, some don't actually care about collisions, some have mitigating factors that make it alright overall, and some are indeed unsafe.

Naive Bruteforce

We need to walk before we can run. Let's break a 32-bit hash function.

import hashlib

def sha256_32(data: bytes) -> bytes:
	h = hashlib.sha256(data).digest()
	return h[:4]  # keep only the first 32 bits (4 bytes)

target = sha256_32(b"hello")
print(f'sha256_32(b"hello") => {target.hex()}')

i = 0
while True:
	attempt = b"hello " + str(i).encode()
	if sha256_32(attempt) == target:
		print(f'sha256_32({attempt}) => {sha256_32(attempt).hex()}')
		break
	i += 1

This is the simplest possible approach. We just keep trying new inputs in a loop, until one of them matches our target hash. The expected number of iterations, on average, is $2^{32}$ (this is not strictly correct; the statistical truth is an expression involving $e$ , or maybe some logarithms, but I won't bore you or myself with the specifics) We can generalise this to $2^{n}$ for some arbitrary hash length.

Here's the result:

$ time python3 naive_brute.py 
sha256_32(b'hello') => 2cf24dba
sha256_32(b'hello 2972475304') => 2cf24dba

real    23m43.392s
user    23m30.633s
sys     0m0.691s

That took a while! It took almost 3 billion iterations to get there, which is fairly close to the expected number. If we were unlucky, it could've taken even longer.

We could make this code more efficient by writing it in C, parallelising it, running it on a GPU, etc. etc., but it still fundamentally requires work in the order of $2^{n}$ operations. This would not scale well for attacking longer hashes.

This is an example of a "Second Preimage" attack, and without trying to exploit internal mathematical features of the hash function, or using quantum computers that may or may not exist, $2^{n}$ is the best we can do.

A second preimage attack is when you're trying to find a new message that gives the same hash as that of a certain previous message—in this case the string "hello".

This is a special type of collision, but it's not the only type of collision. If we drop the constraint of matching a specific previous message and instead look for any pair of messages that collide, we can conduct our search more efficiently.

Naive Birthday Bruteforce

Due to the Birthday Paradox, we can perform a Birthday Attack, which means we can expect to find a collision in only $O (2^{n / 2})$ steps.

lookup = {}
i = 0
while True:
	attempt = b'hello ' + str(i).encode()
	h = sha256_32(attempt)
	if h in lookup:
		print(f"sha256_32({attempt}) => {h.hex()}")
		print(f"sha256_32({lookup[h]}) => {sha256_32(lookup[h]).hex()}")
		break
	lookup[h] = attempt
	i += 1

We compute hashes in a loop and record the (input, hash) pairs in a lookup table. Our table is actually a Python dict, indexed by hash, giving us O(1) insertions and O(1) membership checks. Before we add a new hash to the table, we check if it already exists. If it does, we've just found a collision!

With our previous approach, each hash we calculated had a $1$ -in- $2^{32}$ chance of colliding. Now that we're keeping a lookup table, we can collide with any previous result, and the $n$ th hash has a $n$ -in- $2^{32}$ chance of colliding. Perhaps you can intuit how that leads to an overall complexity of $O (2^{n / 2})$ . There's a constant factor involving $e$ , but it can be ignored.

If that isn't intuitive, don't worry too much, as long as you can see that it should be faster than before.

$ time python3 birthday_brute.py 
sha256_32(b'hello 66453') => 13b18fe7
sha256_32(b'hello 31232') => 13b18fe7

real    0m0.066s
user    0m0.059s
sys     0m0.007s

That's pretty quick! Clearly, there's headroom here for breaking longer hashes. However, this approach also consumes $O (2^{n / 2})$ memory, for the lookup table. For our goal of breaking a 96-bit hash, that would require $2^{48} \times 12$ bytes of memory just to store the hashes, let alone data structure overheads. That's 3 petabytes!

I don't have anywhere near that much storage, but even if I did, those " $O (1)$ " hash table lookups and insertions have overheads, and those overheads would start to matter.

For those reasons, this won't be a viable approach to scale up, so we're going to need a better algorithm.

Pollard's Rho

Pollard's Rho Algorithm is originally an algorithm for integer factorisation, but we can apply its principles to searching for hash collisions.

Via. B. Weber and X. Zhang, 2018 (The idea is much older but I like their diagram)

Each node in the above graph represents a hash value. An arrow represents applying the hash function to generate a new hash. When there's a collision, a loop in the graph forms. If we can detect and locate the start of that loop, then we can find the colliding hash values without having to store them all in memory.

To detect these loops, aka cycles, we can apply Floyd's Tortoise and Hare algorithm. I'm going to be light on the details here because it ends up not being too relevant to our final approach, but the important part is that it takes O(n) time and O(1) space. I lifted my implementation directly from the Wikipedia article.

import hashlib
from wikipedia import floyd  # https://en.wikipedia.org/wiki/Cycle_detection

def sha256_32(data: bytes) -> bytes:
	h = hashlib.sha256(data).digest()
	return h[:4]  # keep only the first 32 bits (4 bytes)

def hash_to_msg(h: bytes) -> bytes:
	return f"hello {h.hex()}".encode()

def pollard_next(h: bytes) -> bytes:
	return sha256_32(hash_to_msg(h))

seed = sha256_32(b"seed")
lam, mu = floyd(pollard_next, seed)

# floyd's algorithm just gives us numbers, we need to recompute the hash values
# (you could probably rewrite it to avoid this, but I'm lazy)
h = seed
for _ in range(mu - 1): h = pollard_next(h)
msg_a = hash_to_msg(h)
for _ in range(lam): h = pollard_next(h)
msg_b = hash_to_msg(h)

print(f"sha256_32({msg_a}) => {sha256_32(msg_a).hex()}")
print(f"sha256_32({msg_b}) => {sha256_32(msg_b).hex()}")

And here's the timing results:

$ time python3 pollard.py
sha256_32(b'hello 41573a5d') => 263b95e8
sha256_32(b'hello e1299054') => 263b95e8

real    0m0.350s
user    0m0.342s
sys     0m0.006s

This is actually slower than our previous approach, by a factor of about 5 (presumably due to constant-factor overheads), but it has the benefit of only using O(1) memory, making it more practical to scale up for longer hashes.

Parallel Pollard's Rho

The next step to speeding this up is to introduce parallelisation. However, Floyd's algorithm is inherently sequential. We could run multiple instances of it in parallel, starting from different seeds, and wait for one to finish first, but that doesn't help very much—it scales sub-linearly with the number of threads.

Fortunately, there is a way to efficiently parallelise it. I first learned of this method through the paper Parallel Hash Collision Search by Rho Method with Distinguished Points, IEEE LISAT 2018, using methodology introduced in Parallel Collision Search with Cryptanalytic Applications by Van Oorschot and Wiener in 1996.

The gist of the method is embodied in this diagram:

Via. van Oorschot, P.C., Wiener, M.J

Like the sequential version, each node represents a hash. A "distinguished point" is one that meets some arbitrary condition that's easy to check. Since we're dealing with hashes, a sensible distinguishing property is "has at least n leading zeroes."

Each worker thread is responsible for building these "trails," starting from a random starting point, and iterating until a distinguished point is found. On completing a trail, the start and end points can be recorded in a lookup table. This is more space-efficient than recording all the points, by a factor of whatever your average trail length is, which can be controlled by adjusting the distinguisher function.

If you find two different trails that end at the same point, like the Y-shaped structure in the diagram, you've found a collision.

Detecting colliding trails and figuring out the actual collision points is most conveniently done from a single central thread, but the main work of constructing the trails can be as parallel as you like, and is suitable for GPU acceleration.

Here's a CPU-only Python implementation of this idea:

import hashlib
import os
from multiprocessing import Queue, Process
from typing import Tuple

NUM_WORKERS = 8
HASH_LENGTH_BITS = 32
HASH_LENGH_BYTES = HASH_LENGTH_BITS // 8

def sha256_trunc(data: bytes) -> bytes:
	h = hashlib.sha256(data).digest()
	return h[:HASH_LENGH_BYTES]

def hash_to_msg(h: bytes) -> bytes:
	return f"hello {h.hex()}".encode()

def pollard_next(h: bytes) -> bytes:
	return sha256_trunc(hash_to_msg(h))

def is_distinguished(h: bytes) -> bool:
	return h.startswith(b"\x00\x00")

def build_trail() -> Tuple[bytes, bytes]:
	trail_start = sha256_trunc(os.urandom(16))
	point = trail_start
	while not is_distinguished(point):
		point = pollard_next(point)
	return trail_start, point

def trail_worker(q: Queue):
	while True:
		q.put(build_trail())

q = Queue()
lookup = {}
workers = [Process(target=trail_worker, args=(q,)) for _ in range(NUM_WORKERS)]
for w in workers: w.start() # start the workers

while True:
	start, end = q.get()
	if end in lookup:
		print(f"Found colliding trails! ({len(lookup)} trails total)")
		break
	lookup[end] = start

for w in workers: w.kill() # we're done with them now!

# find the point where the two trails meet
# (naively - time/space tradeoffs are possible here)
lookup2 = {}
point = lookup[end]
while not is_distinguished(point):
	point, prev = pollard_next(point), point
	lookup2[point] = prev

point = start
while point not in lookup2:
	point, prev = pollard_next(point), point

msg_a = hash_to_msg(prev)
msg_b = hash_to_msg(lookup2[point])
print(f"sha256_{HASH_LENGTH_BITS}({msg_a}) => {sha256_trunc(msg_a).hex()}")
print(f"sha256_{HASH_LENGTH_BITS}({msg_b}) => {sha256_trunc(msg_b).hex()}")

It performs pretty similarly to the previous two implementations, really:

$ time python3 parallel_pollard.py 
Found colliding trails! (5 trails total)
sha256_32(b'hello 1b770e33') => 7c10e518
sha256_32(b'hello 634fb7a4') => 7c10e518

real    0m0.082s
user    0m0.215s
sys     0m0.029s

But at 32-bits, it's over so fast that half the time is probably spent spinning up and shutting down the threads rather than actually doing work. Bumping up the hash length to 64 bits lets it stretch its legs some more:

$ time python3 parallel_pollard.py
Found colliding trails! (77107 trails total)
sha256_64(b'hello 1b27ff0f9ae85364') => 419d1f8c3d759269
sha256_64(b'hello 4aa8d28d2e0d6676') => 419d1f8c3d759269

real    5m12.246s
user    40m52.093s
sys     0m3.107s

Compared to our original naive implementation, we're colliding a hash with twice as many bits in a quarter of the time. Recall that our original algorithm had a complexity of $2^{32}$ for a 32-bit hash, whereas our new one is $2^{64 / 2} = 2^{32}$ for a 64-bit hash. i.e. we're doing about the same amount of work, but now we're working smarter, and distributed over multiple threads.

But wait, there's more! This implementation was really just a proof-of-concept to illustrate the idea. We're still running in Python, on a CPU. If our goal is to find a 96-bit collision, that's $2^{48}$ bits of work. That's $2^{16}$ times more work than the 64-bit collision that took us 5 minutes. Extrapolating, it'd take 227 days. That's too long!

GPU Acceleration

My desktop has an AMD 6700 XT GPU. According to hashcat benchmarks, it can calculate SHA-256 hashes at a rate of 5367.9 MH/s. At this rate, we can calculate $2^{48}$ hashes in only 14.5 hours.

So, I ported the above algorithm to run on my GPU, using pyopencl, and the result is the collision I presented near the start of this article. You might notice I chose to truncate the middle portion of the hash rather than the end. This means it's more likely to fool someone who is visually comparing the starts and ends of a full-length hash.

Unfortunately, my code wasn't as well optimised as hashcat is, so the computation ended up taking about 64 hours. I would share the code, but it's in a messy non-working condition. If I do share it, I'll update this to add a link. 64 hours still isn't bad, representing an ~85x speedup over CPU-only.

Amusingly, I think my GPU beats the university HPC cluster used in the LISAT paper I cited earlier.

Cost Estimates

How much would it cost to collide a 128-bit truncated SHA-256?

The cheapest source of GPU power I'm aware of is vast.ai, which lets users buy and sell GPU hours. Right now, an RTX 4090 costs about $0.40/hour to rent. According to another hashcat benchmark, it can do SHA-256 at 21975.5 MH/s (assuming well-optimised code). We need ~ $2^{64}$ hashes, which will take 26.6 GPU-years, costing $93,000 (realistically, you'd want to rent a few hundred GPUs for a month or so). That's well within the realm of possibility, although slightly outside of my personal budget.

Since we've come this far, let's find out how much it'd cost to collide all 256 bits of SHA-256, just for fun. We just need to multiply that price by... $2^{64}$ .

$1,715,547,198,854,988,300,288,000

Yup, full-length SHA-256 is still safe!

Further Cost Reductions

Using FPGAs, you could perhaps gain an order of magnitude improvement in cost-effectiveness. And with ASICs, a couple more on top of that, minus the up-front development costs.

The whole Bitcoin network is currently running at a hash rate of 540.67 EH/s, almost exclusively on ASICs. That's getting close to 2⁶⁹ hashes per second. Nice. I'm not sure if that number accounts for the fact that BTC uses a double-SHA256 construction or not, but either way, these numbers are so huge it doesn't really matter.

If you could somehow divert the entire Bitcoin network to work on colliding 128-bit SHA-256, it'd take a fraction of a second. But if you tried to use it to collide full-length SHA-256, it'd take 500,000,000,000,000,000 years.

Conclusion

So to update the answer to the original question, 128-bit SHA-256 is also too short, assuming you care about collision resistance and that your adversary has even slightly deep pockets. The cost will likely continue to decrease year-on-year.

Disclaimer: What follows might be bad advice. I'm not a Cryptographer, I'm only qualified to say how not to do cryptography.

If in doubt, use at least a 256-bit hash, unless you can justify the reduction in security.

If you only care about preimage attacks and not generic collisions you can perhaps get away with a 128-bit hash, but you need to be very certain that collisions don't matter. The TLS SLOTH attacks happened in part because they thought collisions wouldn't affect security of the protocol as a whole, but they were wrong.

It is also possible to mitigate the risks of shorter hashes by using a more expensive hash function. For example, Argon2id (nominally a Key Derivation Function) can be repurposed as a hash function. It's deliberately designed to be expensive to compute, and to be resistant to GPU acceleration.

If you only have room for a small hash, or maybe you want something that's easy for humans to manually compare, making the hash as expensive as possible is the way to go.