A hash function is any algorithm that maps data of a variable length to data of a fixed length. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes.
Hash functions are primarily used to generate fixed-length output data that acts as a shortened reference to the original data. This is useful when the original data is too cumbersome to use in its entirety.
One practical use is a data structure called a hash table where the data is stored associatively. Searching for a person's name in a list is slow, but the hashed value can be used to store a reference to the original data and retrieve constant time (barring collisions). Another use is in cryptography, the science of encoding and safeguarding data. It is easy to generate hash values from input data and easy to verify that the data matches the hash, but hard to 'fake' a hash value to hide malicious data. This is the principle behind the Pretty Good Privacy algorithm for data validation.
Hash functions are also used to accelerate table lookup or data comparison tasks such as finding items in a database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and so on.
A hash function should be deterministic: when it is invoked twice on pieces of data that should be considered equal (e.g., two strings containing exactly the same characters), the function should produce the same value. This is crucial to the correctness of virtually all algorithms based on hashing. In the case of a hash table, the lookup operation should look at the slot where the insertion algorithm actually stored the data that is being sought for, so it needs the same hash value.
Hash functions are typically not invertible, meaning that it is not possible to reconstruct the input datum x from its hash value h(x) alone. In many applications, it is common that several values hash to the same value, a condition called a hash collision. Since collisions cause "confusion" of objects, which can make exact hash-based algorithm slower approximate ones less precise, hash functions are designed to minimize the probability of collisions. For cryptographic uses, hash functions are engineered in such a way that is impossible to reconstruct any input from the hash alone without expending great amounts of computing time (see also One-way function).
Hash functions are related to (and often confused with) checksums, check digits, fingerprints, randomization functions, error-correcting codes, and cryptographic. Although these concepts overlap to some extent, each has its own uses and requirements and is designed and optimized differently. The Hash Keeper database maintained by the American National Drug Intelligence Center, for instance, is more aptly described as a catalog of file fingerprints than of hash values.