ioB
ioB2y ago

Updating digest with web crypto

For some context, I'm updating some code that was importing from std/hash (specifically std/hash/sha256) to use std/crypto since the hash submodule is long gone. I've run up against a case where I'm not really sure what I should do to move forward. The code looks something like this:
const digest = new Sha256()

for await (const chunk of generate.chunks()) {
digest.update(chunk)
}

if(some_condition) {
digest.update(bonus_chunk)
}

return digest.hex()
const digest = new Sha256()

for await (const chunk of generate.chunks()) {
digest.update(chunk)
}

if(some_condition) {
digest.update(bonus_chunk)
}

return digest.hex()
Is this possible to handle in the web crypto standard? I'm pretty sure I should just be able to aggregate the chunks into one massive uint8array or something but I feel like there is no reason to buffer the whole thing into memory. Any ideas?
13 Replies
ioB
ioB2y ago
@iuioiua as the main contributor to the removal of std/hash, do you have any ideas? I tried buffering the input but this resulted in a roughly ~10x slowdown so not really ideal.
iuioiua
iuioiua2y ago
What code are you currently using that’s causing the slowdown?
ioB
ioB2y ago
let buffer = new Uint8Array(0)

for await (const chunk of generate.chunks()) {
const appendedBuffer = new Uint8Array(buffer.length + buf.length)
appendedBuffer.set(buffer, 0)
appendedBuffer.set(buf, buffer.length)
buffer = appendedBuffer
}

if(some_condition) {
// same as above
}

return toHashString(await crypto.subtle.digest("SHA-256", buffer))
let buffer = new Uint8Array(0)

for await (const chunk of generate.chunks()) {
const appendedBuffer = new Uint8Array(buffer.length + buf.length)
appendedBuffer.set(buffer, 0)
appendedBuffer.set(buf, buffer.length)
buffer = appendedBuffer
}

if(some_condition) {
// same as above
}

return toHashString(await crypto.subtle.digest("SHA-256", buffer))
iuioiua
iuioiua2y ago
It looks like the bulk of the work lies within that for await loop. If that second conditional wasn't there, you could just do the following because .digest accepts AsyncIterables:
return toHashString(await crypto.subtle.digest("SHA-256", generate.chunks()));
return toHashString(await crypto.subtle.digest("SHA-256", generate.chunks()));
If that conditional is required, I'd: 1. Create one Uint8Array from the AsyncIterable using BytesList from std/bytes (https://deno.land/std/bytes/mod.ts?s=BytesList), specifically the .add() and .concat(). 2. Pass the resulting data into crypt.subtle.digest().
ioB
ioB2y ago
I haven't looked into the impl of BytesList. Would this be more efficient than the code I presented? The conditional is neccessary, also this wouldn't work for other reasons because the example I gave is a drastic oversimplification of the current implementation I am replacing
iuioiua
iuioiua2y ago
AFAIK, using digest.update() in your initial snippet loads the data into memory behind the scenes anyway, so having a Uint8Array might be fine to do. Based on if having a single Uint8Array is fine, performance-wise, yeah, using either BytesList or concat() from std/bytes would be efficient.
ioB
ioB2y ago
afaik, it definitely did not. Here was the std implementation:
/** Update hash
*
* @param message The message you want to hash.
*/
update(message: Message): this {
if (this.#finalized) {
return this;
}

let msg: string | number[] | Uint8Array | undefined;
if (message instanceof ArrayBuffer) {
msg = new Uint8Array(message);
} else {
msg = message;
}

let index = 0;
const length = msg.length;
const blocks = this.#blocks;

while (index < length) {
let i: number;
if (this.#hashed) {
this.#hashed = false;
blocks[0] = this.#block;
blocks[16] =
blocks[1] =
blocks[2] =
blocks[3] =
blocks[4] =
blocks[5] =
blocks[6] =
blocks[7] =
blocks[8] =
blocks[9] =
blocks[10] =
blocks[11] =
blocks[12] =
blocks[13] =
blocks[14] =
blocks[15] =
0;
}

if (typeof msg !== "string") {
for (i = this.#start; index < length && i < 64; ++index) {
blocks[i >> 2] |= msg[index] << SHIFT[i++ & 3];
}
} else {
for (i = this.#start; index < length && i < 64; ++index) {
let code = msg.charCodeAt(index);
if (code < 0x80) {
blocks[i >> 2] |= code << SHIFT[i++ & 3];
} else if (code < 0x800) {
blocks[i >> 2] |= (0xc0 | (code >> 6)) << SHIFT[i++ & 3];
blocks[i >> 2] |= (0x80 | (code & 0x3f)) << SHIFT[i++ & 3];
} else if (code < 0xd800 || code >= 0xe000) {
blocks[i >> 2] |= (0xe0 | (code >> 12)) << SHIFT[i++ & 3];
blocks[i >> 2] |= (0x80 | ((code >> 6) & 0x3f)) << SHIFT[i++ & 3];
blocks[i >> 2] |= (0x80 | (code & 0x3f)) << SHIFT[i++ & 3];
} else {
code = 0x10000 +
(((code & 0x3ff) << 10) | (msg.charCodeAt(++index) & 0x3ff));
blocks[i >> 2] |= (0xf0 | (code >> 18)) << SHIFT[i++ & 3];
blocks[i >> 2] |= (0x80 | ((code >> 12) & 0x3f)) << SHIFT[i++ & 3];
blocks[i >> 2] |= (0x80 | ((code >> 6) & 0x3f)) << SHIFT[i++ & 3];
blocks[i >> 2] |= (0x80 | (code & 0x3f)) << SHIFT[i++ & 3];
}
}
}

this.#lastByteIndex = i;
this.#bytes += i - this.#start;
if (i >= 64) {
this.#block = blocks[16];
this.#start = i - 64;
this.hash();
this.#hashed = true;
} else {
this.#start = i;
}
}
if (this.#bytes > 4294967295) {
this.#hBytes += (this.#bytes / 4294967296) << 0;
this.#bytes = this.#bytes % 4294967296;
}
return this;
}
/** Update hash
*
* @param message The message you want to hash.
*/
update(message: Message): this {
if (this.#finalized) {
return this;
}

let msg: string | number[] | Uint8Array | undefined;
if (message instanceof ArrayBuffer) {
msg = new Uint8Array(message);
} else {
msg = message;
}

let index = 0;
const length = msg.length;
const blocks = this.#blocks;

while (index < length) {
let i: number;
if (this.#hashed) {
this.#hashed = false;
blocks[0] = this.#block;
blocks[16] =
blocks[1] =
blocks[2] =
blocks[3] =
blocks[4] =
blocks[5] =
blocks[6] =
blocks[7] =
blocks[8] =
blocks[9] =
blocks[10] =
blocks[11] =
blocks[12] =
blocks[13] =
blocks[14] =
blocks[15] =
0;
}

if (typeof msg !== "string") {
for (i = this.#start; index < length && i < 64; ++index) {
blocks[i >> 2] |= msg[index] << SHIFT[i++ & 3];
}
} else {
for (i = this.#start; index < length && i < 64; ++index) {
let code = msg.charCodeAt(index);
if (code < 0x80) {
blocks[i >> 2] |= code << SHIFT[i++ & 3];
} else if (code < 0x800) {
blocks[i >> 2] |= (0xc0 | (code >> 6)) << SHIFT[i++ & 3];
blocks[i >> 2] |= (0x80 | (code & 0x3f)) << SHIFT[i++ & 3];
} else if (code < 0xd800 || code >= 0xe000) {
blocks[i >> 2] |= (0xe0 | (code >> 12)) << SHIFT[i++ & 3];
blocks[i >> 2] |= (0x80 | ((code >> 6) & 0x3f)) << SHIFT[i++ & 3];
blocks[i >> 2] |= (0x80 | (code & 0x3f)) << SHIFT[i++ & 3];
} else {
code = 0x10000 +
(((code & 0x3ff) << 10) | (msg.charCodeAt(++index) & 0x3ff));
blocks[i >> 2] |= (0xf0 | (code >> 18)) << SHIFT[i++ & 3];
blocks[i >> 2] |= (0x80 | ((code >> 12) & 0x3f)) << SHIFT[i++ & 3];
blocks[i >> 2] |= (0x80 | ((code >> 6) & 0x3f)) << SHIFT[i++ & 3];
blocks[i >> 2] |= (0x80 | (code & 0x3f)) << SHIFT[i++ & 3];
}
}
}

this.#lastByteIndex = i;
this.#bytes += i - this.#start;
if (i >= 64) {
this.#block = blocks[16];
this.#start = i - 64;
this.hash();
this.#hashed = true;
} else {
this.#start = i;
}
}
if (this.#bytes > 4294967295) {
this.#hBytes += (this.#bytes / 4294967296) << 0;
this.#bytes = this.#bytes % 4294967296;
}
return this;
}
I'm really surprised no one as far as I can tell has run into the issue of hashing a large file. That seems like a pretty obvious use case? To clarify on my statement that it did not load more data into memory, sha256 works by having some blocks behind the scenes. These blocks hold a certain amount of bytes each and then some magic bit shifting allows you to continuously add on to the hash. It's a fixed sized hash function so even without looking at the implementation this is how I would imagine it working I will see if the perf of this is any good. I will return with my findings.
iuioiua
iuioiua2y ago
std/crypto wouldn't have any issues hashing a large file if simply passed as an AsyncIterator, which would be the usual case. This case becomes more interesting because of the conditional. Ok, I see what you might mean. If you're code is publicly available, I'd love to take a look.
ioB
ioB2y ago
It's an OSS project. Here's the PR I'm working on https://github.com/teaxyz/cli/pull/312 (probably should have linked this earlier 😛). https://github.com/teaxyz/cli/blob/c908653b66df4a8f89b654fd357b5fb2bbfe977b/src/hooks/useDownload.ts#L112 specifically it's the download_with_sha function in useDownload.ts
iuioiua
iuioiua2y ago
Ok, what's the simplest version of that function that demonstrates the performance regression? I'd like to replicate it better on my end. Can you provide a pure function?
ioB
ioB2y ago
Let me try to get something hold on Hm, so from my messing around with it, it looks like web crypto is obviously much faster than the std implementation. That being said, if the file is bigger than a certain size (in the gigabytes), the std implementation continues working while the web crypto implementation crashes because v8 restricts the memory usage. I think in this case we'll just have to pin the hash to an older std version and maybe come back to fixing this more elegantly another day
iuioiua
iuioiua2y ago
If std/crypto currently doesn’t handle large streams well, we should open an issue WDYT, @iobdas?
ioB
ioB2y ago
Will attempt a proper implementation @tea soon and report back for posterity’s sake please check out https://discord.com/channels/684898665143206084/684898665151594506/1087454360243540062 also I did implement it in tea using this method