In a recent conversation with a coworker, the topic of fuzzy hashing came up and how frequently he uses it in his malware analysis ventures. This sparked some inspiration to gen up a blog post dedicated to what it is and how it could be helpful in analyzing malware samples.
What is fuzzy hashing?
Fuzzy hashing is a term for a function that compares similarities between files. Normally, in the context of IOCs and malware analysis, we gather hashes and compare them to databases to see if the hash matches a known-bad sample. The unfortunate loophole in this is that it’s very simple to make a small change in the file and therefore generate a completely different SHA256 hash.
This is where fuzzy hashing can save the day. Fuzzy hashing will use an algorithm to hash the files in parts, and then compare the similarity of those parts. This results in a percentage score of similarity between the two or more files.
In the image above I use ssdeep, which uses the methodology I’ve described, also known as context-triggered piecewise hashes (CTPH).
Fuzzy hashing is an analyst’s lifesaver at times; threat actors will attempt to circumvent signature-based detection engines by creating slight modifications to their malware samples. This also applies to polymorphic malware, which is self-mutating malware that maintains its original functionality. A fuzzy hash comparison will show which portions of the files are similar and different, and to what extent, allowing analysts and researchers to group similar specimens together. This is a win for the cyber and threat intel community in that it also assists in malware attribution and identifying overlaps.