New Post

2025-10-09 06:21:13 +00:00 · 2024-12-30 19:32:23 -05:00 · 2024-12-30 19:32:23 -05:00 · 1880b3ff56
commit 1880b3ff56
parent 4d2b895b72
1 changed files with 182 additions and 0 deletions
--- a/content/blog/hashing-based-on-word-emoji-lists.md
+++ b/content/blog/hashing-based-on-word-emoji-lists.md
@ -0,0 +1,182 @@
 ---
 title: "Hashing Based on Word (Emoji?) Lists"
 date: 2024-12-30T19:13:35-05:00
 draft: false
 tags: []
 math: true
 medium_enabled: false
 ---
 When I go to download Fedora Workstation 41, it gives me an option to verify the ISO download with a SHA-256 checksum.
 ```
 a2dd3caf3224b8f3a640d9e31b1016d2a4e98a6d7cb435a1e2030235976d6da2
 ```
 The idea is that when I download the ISO, I run
 ```bash
 sha256sun Fedora-Workstation-Live-x86_64-41-1.4.iso
 ```
 The output of the command should match the given checksum. If not, I should assume that I have a fraudulent ISO.[^1]
 [^1]: You might ask, how do we know that the checksum hasn't been tampered with? Fedora's solution is PGP-signatures, but that's outside the scope of this post.
 SHA-256 is one particular [*Cryptographic hash function*](https://en.wikipedia.org/wiki/Cryptographic_hash_function). There are many others, but these algorithms are typically used for some form of verification. These algorithms have the following properties:
 - **Deterministic**:  When you run it on the same input, you get the same output.
 - **Fixed-Length:** Inputs can be as long as you want, but the output has a fixed-length.
 - **Pre-image Resistant**: All inputs are equally likely to produce a particular hash value.
 - **Second Pre-Image Resistant**: All inputs are equally likely to produce the same hash value as a particular input.
 - **Collision Resistant**: It's difficult to find any two messages that have the same hash value.
 I won't get into these properties in this post. Feel free to read the Wikipedia article linked above to learn more.
 Another example of a verification task is password checking. The idea is that (hopefully most) websites don't actually store your passwords, but a hash value of them on their servers. When you send in your password, they then hash it and make sure the hash values match.
 **Please don't use SHA-256 for password checking.** Unfortunately, many people use short/simple passwords, and re-use them across multiple websites. If an attacker gets a hold of your password hash database, they can run the hash function on a bunch of common passwords and see if any of those hashes match what's inside the database. Best practices evolve over time, so please do your research when implementing authentication.
 Krishna wrote a [blog post](https://chittur.dev/cs/2020/04/11/readable-hashes.html) a few years ago advocating for the use of *human-readable* hash values. Instead of using a random sequence of characters, why don't we use a random sequence of words instead?
 As a community, we have agreed on what possible characters a SHA-256 checksum can consist of. That is, alphanumeric characters a-z and 0-9. However, *we don't agree on what words we should use*.
 Lifting from Krishna's blog post, there are [multiple](https://github.com/singpolyma/mnemonicode) [readable](https://tools.ietf.org/html/rfc1751) [hash](https://github.com/fpgaminer/hash-phrase) ideas out there. They each use different wordlists. Even Krishna's approach uses a different one!
 I'll go over a general approach on how to implement a hash function for any arbitrary wordlist. Then I'll argue, that we should consider using Emojis for our wordlist.
 ### A general approach to hashing with word lists
 Instead of directly worrying about how to create a hash function, we're going to piggy-back off an existing one.  For example, consider the SHA-256 algorithm. This takes an input and produces a 256-bit number.
 Given a wordlist of $N$ words, we want to produce a sequence of words that captures at least the amount of information within a 256-bit number. The number of bits that a word represents in our wordlist is:
 $$
 wbits = \lfloor log_2(N) \rfloor
 $$
 From this, we can derive how many words we need to capture a 256-bit number
 $$
 outlen = \lceil 256 / wbits \rceil
 $$
 Consider an arbitrary input `x` and hash it to create `h`
 ```python
 h = int(hashlib.sha256(x).hexdigest(), 16) 
 ```
 For the UTF-8 encoded string `"test"`, we'll get the following binary representation:
 ```
 1001111110000110110100001000000110001000010011000111110101100101100110100010111111101010101000001100010101011010110100000001010110100011101111110100111100011011001010110000101110000010001011001101000101011101011011000001010110110000111100000000101000001000
 ```
 To determine which word from the word lists to use, we'll consider binary sequences of length `wbits`. Since 256 might not be divisible by `wbits`, we might need to pad a certain number of zeros at the end.
 ```bash
 bits_needed = (math.ceil(256 / outlen) * outlen) - 256
 h = h << bits_needed
 ```
 For sake of example, let's say that `wbits` is equal to 11. Then the first binary sequence is `10011111100`. This corresponds to the 1276th word in our word list.
 Iterate through all these subsequences to have a list of indices:
 ```python
 indices = []
 for i in range(outlen):
    num = (h >> (wbits * i)) & (2**wbits - 1)
    indices.append(num)
 ```
 Consider the word list `wordlist`, we can use the indices to print out the hashed version of our input using the word list!
 ```python
 words = [wordlist[i] for i in indices]
 print(" ".join(words))
 ```
 The full script is located at the bottom of this post.
 ### Which wordlist should I use?
 The smaller our word list length $N$ is, the more words we'll need to output for the hash. Ideally, we use a large word list. As discussed before, there's many different opinions on what properties a word list should have. Some believe that a word list should only contain [easy-to-pronounce](https://github.com/singpolyma/mnemonicode) words. Many don't want profanity in their word lists.
 Joseph Bonneau wrote a [blog post](https://www.eff.org/deeplinks/2016/07/new-wordlists-random-passphrases) describing all the criteria he used when crafting his choice of 7776 words. The EFF endorses this word list for use in password generation, and the [diceware](https://github.com/ulif/diceware) Python package also uses this list.
 What's important though, is that both parties agree on what word list to use.
 When designing a word list, one issue to look out for is the [Prefix Code Problem](https://en.wikipedia.org/wiki/Prefix_code). Some words are concatenations of other words, for example: rainbow, today, sunflower, etc. If you don't use a delimiter between words, then you'll lose information since we can't distinguish whether it's multiple words or just one word.
 Returning to the EFF word list, here's how we can obtain the word list as of the time of writing:
 ```bash
 curl https://www.eff.org/files/2016/07/18/eff_large_wordlist.txt | cut -f2 > eff_wordlist.txt
 ```
 In order to capture the 256-bit hash, we need to use 22 words from the EFF word list. The hash for the UTF-8 encoded version of "test" is:
 ```
 duress amiss antler atop item illicitly blimp anchor gigahertz consoling chance atonable frugality hardhead freeing bust crowd drool editor earful detective fiction
 ```
 Words are nice and all. But what about Emojis? The Unicode Consortium maintains a list of [thousands of emojis](https://unicode.org/emoji/charts/full-emoji-list.html). Like words, emojis are very easy for our eyes to parse.
 As of the time of writing, we're at version 16 of the Emoji character list.  Here's a crude parser for getting a list of emojis.
 ```bash
 curl https://unicode.org/Public/emoji/16.0/emoji-test.txt | grep -v "^#" | grep -oP '#\s*\K.*?(?=\s*E)' > emojis.txt
 ```
 This produces a list of 5042 emojis. Similarly, to capture a 256-bit hash, we'll need to use 22 emojis. For the UTF-8 encoded version of "test", the hash is
 ```
 🧍🏼‍♂️ 💜 🫷 🫵🏽 👩🏿‍❤‍👩🏾 👨🏼‍❤️‍👨🏼 👨🏾 💤 🤹 🧑🏿‍🎨 🤦🏼‍♂ 🫵🏼 🤸🏾 👨🏽‍❤️‍💋‍👨🏼 🚴🏿‍♂️ 🙎🏿‍♀️ 💂‍♀ 🚶🏾‍♀️‍➡️ 🧎🏽‍♀️‍➡ 🧎🏾 🧙🏿‍♂ 🏄🏻‍♂️
 ```
 Does your browser show all the emojis?
 While I'm advocating for the use of emojis, there are two problems that I currently recognize:
 - New emojis get introduced regularly. Luckily the dataset is versioned, so you can state to verify using the version 16 set.
 - Not all applications and fonts support newer emojis. I use Konsole with Noto Sans and "🫩" is not recognized.
 Though I feel that emojis are richer than a standard English based word list. Also, it's friendly to those who don't speak English as well!
 The full script from before:
 ```python
 import argparse
 import hashlib
 import math
 parser = argparse.ArgumentParser(description="Create a hash from a wordlist")
 parser.add_argument("wordlist", type=str, help="Path to wordlist")
 args = vars(parser.parse_args())
 wordlist = []
 # File must have one word per line
 with open(args['wordlist']) as f:
    wordlist = f.read().splitlines()
 assert len(wordlist) > 0
 # Number of bits we can use to index the wordlist
 wlbits = math.floor(math.log(len(wordlist)) / math.log(2))
 outlen = math.ceil(256 / wlbits)
 bits_needed = (math.ceil(256 / outlen) * outlen) - 256
 data = input("")
 sha256_data = hashlib.sha256(data.encode("utf-8"))
 encoded_data = int(sha256_data.hexdigest(), 16) << bits_needed
 indices = []
 for i in range(outlen):
    num = (encoded_data >> (wlbits * i)) & (2**wlbits - 1)
    indices.append(num)
 words = [wordlist[i] for i in indices]
 print(" ".join(words))
 ```