Introduction to hash technology: A brief overview
Hash technology is a powerful tool for identifying known visual content, and is used by hotlines, law enforcement, industry, and other types of organizations in the removal of illegal harmful content. In this article, we’ve provided a brief introduction to hash technology and its varying types.
Answered in this article:
What is hash technology?
Hashing—also known as hash functions, hash values, or hash codes—is an overarching term used to describe the various techniques for algorithmically assigning unique fingerprints to digital files. Hashing is the process of taking a large amount of data and reducing it by giving a particular file, set of files, or specific characteristics of a file their own numerical value, thus significantly shrinking the data load while simultaneously stamping said file with its own distinct digital fingerprint.
Use cases for hash technology
Hash technologies are used to accurately identify duplicated content (images and videos) on the internet. Utilizing the benefits of hash technologies requires a database of known content (and their hash values) which is then used to automate the process of testing against the hash values of secondary images to determine whether they match those in the database. This process is known as hash matching.
Somewhat recently, hash matching has earned a reputation for enabling online platforms, law enforcement, and CSAM hotlines to effectively and efficiently address the disconcertingly persistent distribution of illegal content online. By hashing known illegal content and storing their values in a database, these organizations can quickly identify duplicated illegal content in the event that it is reposted to a particular online space. With the constant stream of all types of media being uploaded to the internet every day, hashing provides a method of automating identification of illegal content in a safe and secure manner.
Types of hash technology
While each is designed according to fundamental hashing principles, hash types go by a variety of names, function according to different algorithms, and vary in their capabilities. For the sake of simplicity and brevity, we’ve broken each of these types into three separate categories: Basic hashes, intermediate hashes, and advanced hashes.
Basic hashes, also known as strict, file, or cryptographic hashes, are excellent at identifying exact content matches. Popular basic hash algorithms include MD5, SHA1, and SHA256, and work by assigning dramatically dissimilar hash values to differing images. However, this also means that alterations undetectable to the human eye will result in a hash value that is entirely different from the original image, leading analysts to assume that the altered image is not the same as the original.
Intermediate hashes, also known as perceptual hashes, work by assigning similar hashes to similar files, and are often used to assess the similarity between two pieces of content. PhotoDNA (developed by Microsoft), PDQ (developed by Meta) and pHash use intermediate hashes to identify content that has undergone moderate alterations such as resizing, compression, and contrast changes. While intermediate hashes can be fairly effective in identifying similar content, this hash type is not powerful enough to detect alterations such as bordering, cropping, and embedding text/watermarks/icons.
Advanced hashes, or local descriptor hashes, use hundreds of visual interest points per image to reach a highly comprehensive comparison between drastically altered files. Where intermediate hashes tend to fall short when identifying altered content, advanced hash algorithms such as SIFT, SURF, and ORB are able to identify content that has undergone drastic changes such as bordering, overlays, mirroring, arbitrary rotation, and picture-in-picture.
Want to learn more about hash technologies?
Check out this article to learn more about how these different hash types can be used to stop the spread of illegal content online.