A mystery using Python hashlib to calculate checksums
Recently I was working with hashing some files in order to take checksums to avoid
working with data-corrupted files received at the endpoint of a web server.
It sounds easy, receive the file, calculate the hash, and emit the response.
I only needed something like hash = calculate_hash(file)
.
A five minute research on duckduckgo lead me to write the following code,
import hashlib
def calculate_hash(filepath, chunk_size=4096, hash_method=hashlib.md5()):
hasher = hash_method
with open(filepath, 'rb') as fp:
buffer = fp.read(chunk_size)
while len(buffer) > 0:
hasher.update(buffer)
buffer = fp.read(chunk_size)
return hasher.hexdigest()
Open a file in binary mode and read it chunk-wise, updating a hash with each chunk until EOF. Easy, right? WRONG!
The issue was that everytime a new request arrived with the same file, the hasher
calculated the wrong hash for it from the second request onwards.
I could not find what was wrong with the code, it looked like somehow
the hasher
state was kept between function calls, that’s why It showed different
hash digest even when using the very same file for each request. With little time
at hand I decided to refactor the code, and even though python docs encourage to
use named constructors – like
hashlib.md5()
– over hashlib.new('md5')
I found that the following implementation worked really well
and most importantly worked as expected,
import hashlib
def calculate_hash(filepath, chunk_size=4096, hash_method='md5'):
hasher = hashlib.new(hash_method)
with open(filepath, 'rb') as fp:
buffer = fp.read(chunk_size)
while len(buffer) > 0:
hasher.update(buffer)
buffer = fp.read(chunk_size)
return hasher.hexdigest()
Whether hashlib or the hasher instance was preserved between function calls is something I need to explore further, I just wanted to keep this anecdote for future reference here.