LFCavalcanti
LFCavalcanti
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
The performance is not good. There's improvements I need to revise in the "processChunk" function, but watching the resource usage it seems the workers can't read in parallel. I know that at the OS level, each thread locks the file while reading, but it seems the lock remains all the way while the stream is not at the final position.
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
Inside the worker, I create a read strem like this:
import { ByteSliceStream } from "https://deno.land/std/streams/byte_slice_stream.ts";
import { readerFromStreamReader } from "https://deno.land/std/io/mod.ts";
let readBuffer = new Uint8Array(26500);
self.onmessage = async (messageData) => {
const file = await Deno.open(messageData.data.filePath);
await file.seek(messageData.data.start, Deno.SeekMode.Start);
const slicestart = 0;
const sliceend = messageData.data.end - messageData.data.start + 1;
const slice = file.readable.pipeThrough(
new ByteSliceStream(slicestart, sliceend)
);
const fileReaderOri = slice.getReader();

if (fileReaderOri) {
const fileReader = readerFromStreamReader(fileReaderOri);
let numberRead = 0;
do {
numberRead = (await fileReader.read(readBuffer)) || 0;
if (numberRead == 0 || !numberRead) break;
await processChunk(readBuffer, numberRead);
} while (true);
self.postMessage(processedLines);
self.close();
}
};
import { ByteSliceStream } from "https://deno.land/std/streams/byte_slice_stream.ts";
import { readerFromStreamReader } from "https://deno.land/std/io/mod.ts";
let readBuffer = new Uint8Array(26500);
self.onmessage = async (messageData) => {
const file = await Deno.open(messageData.data.filePath);
await file.seek(messageData.data.start, Deno.SeekMode.Start);
const slicestart = 0;
const sliceend = messageData.data.end - messageData.data.start + 1;
const slice = file.readable.pipeThrough(
new ByteSliceStream(slicestart, sliceend)
);
const fileReaderOri = slice.getReader();

if (fileReaderOri) {
const fileReader = readerFromStreamReader(fileReaderOri);
let numberRead = 0;
do {
numberRead = (await fileReader.read(readBuffer)) || 0;
if (numberRead == 0 || !numberRead) break;
await processChunk(readBuffer, numberRead);
} while (true);
self.postMessage(processedLines);
self.close();
}
};
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
Then, for each worker I call:
const lineWorker = new Worker(import.meta.resolve("./workerLines.js"), {
type: "module",
});
lineWorker.postMessage({
filePath,
start: workerNum === 0 ? 0 : offsets[workerNum - 1],
end: offsets[workerNum] - 1,
});
const lineWorker = new Worker(import.meta.resolve("./workerLines.js"), {
type: "module",
});
lineWorker.postMessage({
filePath,
start: workerNum === 0 ? 0 : offsets[workerNum - 1],
end: offsets[workerNum] - 1,
});
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
What I did was finding the offsets on the file, like so:
const filePath = Deno.args[0];
const MAX_LINE_LENGTH = 106;
const file = await Deno.open(filePath);
const fileStats = await Deno.stat(filePath);
const FILE_SIZE = fileStats.size;
const MAX_WORKERS = mod.cpus().length;
const SEGMENT_SIZE = Math.floor(FILE_SIZE / MAX_WORKERS);
const offsets = [];
const bufferToFindOffsets = new Uint8Array(MAX_LINE_LENGTH);
let offset = 0;
while (true) {
offset += SEGMENT_SIZE;

if (offset >= FILE_SIZE) {
offsets.push(FILE_SIZE);
break;
}
await file.seek(offset, Deno.SeekMode.Start);
await file.read(bufferToFindOffsets);

const lineEndPos = bufferToFindOffsets.indexOf(10);
if (lineEndPos === -1) {
chunkOffsets.push(FILE_SIZE);
break;
} else {
offset += lineEndPos + 1;
offsets.push(offset);
}
}
const filePath = Deno.args[0];
const MAX_LINE_LENGTH = 106;
const file = await Deno.open(filePath);
const fileStats = await Deno.stat(filePath);
const FILE_SIZE = fileStats.size;
const MAX_WORKERS = mod.cpus().length;
const SEGMENT_SIZE = Math.floor(FILE_SIZE / MAX_WORKERS);
const offsets = [];
const bufferToFindOffsets = new Uint8Array(MAX_LINE_LENGTH);
let offset = 0;
while (true) {
offset += SEGMENT_SIZE;

if (offset >= FILE_SIZE) {
offsets.push(FILE_SIZE);
break;
}
await file.seek(offset, Deno.SeekMode.Start);
await file.read(bufferToFindOffsets);

const lineEndPos = bufferToFindOffsets.indexOf(10);
if (lineEndPos === -1) {
chunkOffsets.push(FILE_SIZE);
break;
} else {
offset += lineEndPos + 1;
offsets.push(offset);
}
}
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
@raunioroo, the challenge call for use only standard APIs, that's why I'm trying my best to not use any module.
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
So, I'm reviving this now... I had such a crazy streak of work this was in the back bench for more than a month.
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
The concepts of streaming content from files or http connections I gasped well enough... I think... but Deno has a way of doing things that is different from Node, I'm not versed enough on Deno to give opinions yet, this challenge seemed like a good opportunity to test and learn
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
Thanks for now!
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
I'll try this out tomorrow
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
oh okay... I think it's time to start coding and breaking things to understand better
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
ohhh so the slice object in your example has a .read method that I can pass a buffer to be read into?
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
So... 1 - Find the offsets I want to process 2 - Using those offsets calculate the segments of the file for each worker thread to process 3 - Openfile and use the ByteSliceStream updating the slice and parsing it in a loop until the sliceEnd >= workerData.end
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
It seems I need to but the call for the ByteSliceStream inside a loop, so If there is a range of bytes I want to read, I calculate start and end for each iteration moving start and end along with the buffer size
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
I do have 32GB of RAM, but the goal is to use around 4GB as Node does
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
Even if I slice that into 16 slices, it's too much to hold in memory
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
I'm asking because the challenge in question has a file with 1 billion lines, each line can have max 105 bytes... we are talking something around 13GB of data
36 replies
DDeno
Created by LFCavalcanti on 3/19/2024 in #help
Is there a way to read big files using something like .seek but with ending position as well?
Hi @raunioroo , thanks for the tip with the ByteSliceStream. Is the "slice" a stream that can be read into a buffer? In the sense that as parts of the file are read up to a buffer size I can parse that buffer, empty it then a new stream is read into the buffer and so on...
36 replies