Hash matching inside out: Comparing hash types for the detection of CSAM
Updated: Feb 15
Hash technology is a powerful tool for identifying known visual content, and is used by hotlines, law enforcement, industry, and other types of organizations in the removal of illegal visual content. To help readers better understand how exactly hash technologies can help identify visual content, we’ve provided an in-depth look into how the various types of hash technologies work, as well as an explanation of their strengths and weaknesses.
Included in this article:
Why not all hashes are created equal: Types of hash technology
Using hash technologies to stop the spread of CSAM
Hash matching is the process of comparing one hash value with another to determine whether two files are identical, and is the central method for tracking illegal content online. Hash matching allows organizations such as hotlines and law enforcement to effectively trace the distribution of known illegal content in cases where it gets re-uploaded and recycled after being reported and removed from its originating domain.
A light in the darkest corners of the web
Of the many forms of illegal harmful content, such as terrorism and violent extremist content (TVEC) or non-consensual intimate imagery (NCII), stopping the distribution of child sexual abuse material (CSAM) online is likely the most important and essential use case for hash technology today.
Because hashing is able to accurately identify both identical and similar content, it is quickly becoming a vital tool for hotline analysts–such as those at the National Center for Missing and Exploited Children (NCMEC)–who spend their days sorting through deep databases of reported abusive imagery. Hashing automates the identification of known CSAM, giving analysts more time to address newly reported content without having to sift through duplicates of files that have already been reported. This makes content review faster, more efficient, and reduces the workload and emotional stress for analysts by saving them from the often traumatizing task of repeatedly reviewing known CSAM.
However, while hash technology as a whole has certainly helped in the fight against CSAM, some hash types are more ideal than others when being used to identify illegal content. That in mind, let’s take a moment to go over the different hash types in depth and highlight their varied capabilities/applications.
Detailed description of hash types: Not all hashes are created equal
Strict hashes, also known as file or cryptographic hashes, are very good at just one thing: Identifying exact content matches. Algorithms such as MD5, SHA1, and SHA256, work according to what’s commonly referred to as the ‘avalanche effect,’ which, in layman's, goes something like this: The slightest change to a file—whether it be text in a document or a pixel in an image—results in a hash value that is lightyears away from the original piece of content.
That in mind, the central pitfall of strict hashes is that the slightest alteration to a piece of content will generate a new hash value. This means that you can have multiple images that each look the same to the human eye, but that have been assigned entirely different hashes, as seen in the image below.
This is not the same image!
Exampled in the images above, a change to just one pixel of the image gives it an entirely different hash value, making it impossible to identify content that has been at all altered from the original.
Still, there are some benefits to strict hashes:
Lightweight, quick and easy to generate. Built for speed.
Since each file gets a unique hash value, you can say with a high certainty that two files with the same hash value are indeed the same file.
The hash values are short enough to be read by humans, meaning you can compare the hashes just by looking at them.
Perceptual hashes can identify images that have undergone minor modifications such as resizing, compression, and contrast changes. Probably the most well known of this technology is PhotoDNA (developed by Microsoft), but PDQ (developed by Meta) and pHash are commonly used as well.
Simply put, perceptual hashes work by assigning similar hashes for similar files. For example, if one of two identical images is resized, it will be given a hash value in close numerical range of the original. The measured distance between these two hash values then determines the level of similarity between the two images.
However, modifications that extensively alter the image can cause problems for this hash type, which include but are not limited to bordering, cropping, and embedding text/watermarks/icons on the images. Additionally, this hash type has mostly been used for static image matching only, and is unable to scale well when using video.
Local descriptor hashes
Local descriptor hashes store information about images using visual interest points, sometimes several hundred points per image, allowing for highly in-depth comparison between drastically altered files. Like perceptual hashes, local descriptor hashes measure similarity according to the difference in distance between two images’ numerical value, but done in a much more complex fashion due to the hundreds of visual interest points applied to each image (see below for more). Popular local descriptor hash algorithms include SIFT, SURF, and ORB.
While local descriptor hashes are not quite as lightweight as perceptual and strict hashes (sometimes resulting in scalability issues, though this issue has been resolved by Videntifier), they are extremely powerful when used to hash match against content that has been modified. Where perceptual hashes can only identify slight modifications (resizing, compression, contrast changes), local descriptor hashes go above and beyond by identifying duplicated content that has been extensively altered to avoid detection.
Local descriptor hashes can ‘see through’ extensive alterations such as:
Content with borders
Cropped content with overlays
And much more…
Due to the intricacy of their algorithm, local descriptor hashes are considered the premier technology for identifying videos. With local descriptor hashes, analysts are able to efficiently navigate reported videos without having to review known CSAM.
Shortcomings of strict and perceptual hashes
Now that we’ve outlined the differences between the three main hash types, at this point it’s important to highlight how strict and perceptual hashes fall short when used as the central approach for what we believe is hash technology’s most important use case: Identifying and stopping the spread of CSAM online.
As noted above, strict and perceptual hash algorithms tend to be lightweight, fast, and in some cases highly scalable. But the plain fact of the matter is that those distributing CSAM on the internet will often use one or multiple extensive alteration techniques to avoid detection by law enforcement--alterations which these hash types have trouble detecting. And so, while strict and perceptual hashes do have their benefits, it’s necessary to measure these hash types against the more advanced algorithms hitting the scene, especially when in the market for a content identification tool.
Mending the shortcomings with local descriptor hashes
Local descriptor hashes explained
Recently, local descriptor hashes have been used to pick up the slack left by strict (file, or cryptographic) and perceptual hashes, and are in some spaces considered to be the most efficacious method of identifying and stopping the distribution of CSAM.
Additionally, using local descriptors to accurately identify images and videos regardless of the extent of alteration is quickly becoming the most effective way for digital domains to keep track of the content posted to their platforms, and so it is worth taking a closer look at how exactly local descriptor hashes operate.
How local descriptor hashing helps hotline analysts identify altered content
Key takeaways to understanding local descriptor hashing:
Each descriptor uses multiple pixels to calculate a distinctive row of numbers (hash value).
Similar content produces similar descriptors.
Each image and video frame can contain hundreds of descriptors for a total of thousands of numbers describing each image.
1. A piece of previously unreported CSAM is submitted to a hotline. The image’s hash value is determined by a set of unique visual interest points and labeled as known content in the hotline’s database. Let’s call this image #1.
2. Days later, the same image is reported to the hotline, except it has been altered to avoid detection. Just as done above, its hash value is determined according to a set of visual interest points. Let’s call this image #2.
3. Between image #1 and image #2, hundreds of unrelated images are reported. Their hash value is determined according to the sets of their own unique visual interest points. Let’s call these Unrelated Image(s).
Assessing the data
When comparing the original descriptor (image #1) to itself, it becomes evident that the distance between the numerical values is none, indicating a perfect match.
When comparing the descriptors of image #1 to the descriptors extracted from image #2 (altered), a small difference/distance is found, indicating that the two images are very similar. Without having to manually review the content, the analyst now knows that newly reported image #2 is simply a modified duplicate of image #1, indicating that this particular CSAM case has already been reported to the authorities and can put image #2 aside to focus on identifying new cases.
When comparing the original descriptor to descriptors from a file depicting image #3, or unrelated content, the descriptors will hardly match at all. In fact, there will be very large differences/distances between every pair of descriptors, concluding that the two files don’t share visually similar content.
The ideal hash type for stopping the spread of CSAM
When we look back at the way in which each of these hash types function, it's obvious that local descriptor hashes own the right capabilities for accurately and efficiently identifying and therefore quelling the spread of CSAM online. While local descriptor hashes have yet to make the kind of splash that perceptual hashes such as PhotoDNA or PDQ have, it is quickly becoming recognized as the most powerful of the three hash types, with much promise for further advancement as innovations in hash technology continue to be discovered.
If you're interested in learning more about local descriptor hashes, check out this article on how they are helping CSAM hotline analysts in their initiatives to stop ongoing abuse.