A mystery using Python hashlib to calculate checksums

Recently I was working with hashing some files in order to take checksums to avoid working with data-corrupted files received at the endpoint of a web server. It sounds easy, receive the file, calculate the hash, and emit the response. I only needed something like hash = calculate_hash(file).
A five minute research on duckduckgo lead me to write the following code,

import hashlib

def calculate_hash(filepath, chunk_size=4096, hash_method=hashlib.md5()):
    hasher = hash_method

    with open(filepath, 'rb') as fp:
        buffer = fp.read(chunk_size)
        while len(buffer) > 0:
            hasher.update(buffer)
            buffer = fp.read(chunk_size)

    return hasher.hexdigest()

Open a file in binary mode and read it chunk-wise, updating a hash with each chunk until EOF. Easy, right? WRONG!

The issue was that everytime a new request arrived with the same file, the hasher calculated the wrong hash for it from the second request onwards. I could not find what was wrong with the code, it looked like somehow the hasher state was kept between function calls, that’s why It showed different hash digest even when using the very same file for each request. With little time at hand I decided to refactor the code, and even though python docs encourage to use named constructors – like hashlib.md5() – over hashlib.new('md5') I found that the following implementation worked really well and most importantly worked as expected,

import hashlib

def calculate_hash(filepath, chunk_size=4096, hash_method='md5'):
    hasher = hashlib.new(hash_method)

    with open(filepath, 'rb') as fp:
        buffer = fp.read(chunk_size)
        while len(buffer) > 0:
            hasher.update(buffer)
            buffer = fp.read(chunk_size)

    return hasher.hexdigest()

Whether hashlib or the hasher instance was preserved between function calls is something I need to explore further, I just wanted to keep this anecdote for future reference here.

comments powered by Disqus