andykais•2y ago

how to hash very large files

We have deprecated the old std/hash code in favor of the crypto module, but I do not know what the standard approach to hashing something piecemeal looks like. The advice around the crypto digest generally looks like this:

async function crypto_hash(filepath: string) {
  const file_buffer = await Deno.readFile(filepath)
  const hash_buffer = await crypto.subtle.digest('SHA-256', file_buffer)
  const hash_array = Array.from(new Uint8Array(hash_buffer))
  const hash_hex = hash_array.map((b) => b.toString(16).padStart(2, '0')).join('')
  return hash_hex
}

async function crypto_hash(filepath: string) {
  const file_buffer = await Deno.readFile(filepath)
  const hash_buffer = await crypto.subtle.digest('SHA-256', file_buffer)
  const hash_array = Array.from(new Uint8Array(hash_buffer))
  const hash_hex = hash_array.map((b) => b.toString(16).padStart(2, '0')).join('')
  return hash_hex
}

this works fine on smaller files, but I need to occasionally hash a big file, like 2GB files. This is a lot of data to hold in memory, and if I could leverage readable streams that would be ideal. If I need to do that, should I simply feed the hash_buffer back into itself, concatenating new data as I go?

8 Replies

ioB•2y ago

See the conversation here https://discord.com/channels/684898665143206084/1064614652081885394/1064614652081885394 I had the same issue, the conclusion is I'm not quite sure how to do this with current web specs

andykais•2y ago

hmm well its a question for everyone using the web, not just deno so there will definitely be an answer eventually Ill keep researching. Imo a readable stream interface in std/crypto would add a lot of value though

ioB•2y ago

I believe webcrypto allows streaming of some sort but I forget

andykais•2y ago

I found this, which seems to say it cant https://github.com/w3c/webcrypto/issues/73 I think heres the crux of it. I can make a very simple function that hashes my file chunkwise, appending files as I go. I will have consistency within my app, because I will always hash it the same way. However, that will not be the same standard sha256 that I might get out of the linux command line, or any number of tools. I need to reimplement the standard sha256 hash algorithm for chunks, which is going to take a bit of doing

ioB•2y ago

that seems to be the case, yeah unfortunate when you just want to verify a checksum though

ioB•2y ago

looks like https://github.com/wintercg/proposal-webcrypto-streams could be the solution to this problem long-term

GitHub

GitHub - wintercg/proposal-webcrypto-streams

Contribute to wintercg/proposal-webcrypto-streams development by creating an account on GitHub.

ioB•2y ago

unfortunate that std/hash was removed before all usecases were fully considered

andykais•2y ago

well, I suppose there is nothing broken in the older hashing code, its just wasm vs native so a tad slower https://deno.land/std@0.160.0/hash/mod.ts Ill probably just use this for now fwiw, web devs outside of deno have been feeling this pain as well https://lists.w3.org/Archives/Public/public-webcrypto/2016Nov/0000.html