Is there a way to read big files using something like .seek but with ending position as well?

So, I'm doing a version of the billion line challenge in JS. I've already implemented a version with Node basically calculating offsets to "slice" the file in a amount og segments matching the available parallelism in the system. To achieve that I'm calling Node's "fs" library like so:
const fileReadStream = fs.createReadStream(workerData.filePath, {
start: workerData.start,
end: workerData.end,
});
const fileReadStream = fs.createReadStream(workerData.filePath, {
start: workerData.start,
end: workerData.end,
});
I'm now trying to do the same in Deno, both for the challenge and learning Deno. In the docs there's a way to use the file.seek(offset, whence) that allows for a offset that can be either from the end, start or from current. I'm not sure how to use the same logic having different threads/workers reading from the same files in different byte ranges.
DAH
Deno AI Helperā€¢34d ago
In Deno, you can achieve similar functionality to Node's fs.createReadStream with specific start and end positions by using a combination of Deno.open, file.seek, and file.read. The documentation provides examples of how to read files in chunks and seek to specific positions within a file. Here's a step-by-step approach to read a specific range of a file in Deno: 1. Open the file using Deno.open to get a Deno.FsFile object. 2. Seek to the starting position using file.seek(startPosition, Deno.SeekMode.Start). 3. Read the desired amount of bytes from the file. However, the documentation does not provide a direct example of reading up to an end position. To read a specific range, you would need to calculate the number of bytes you want to read based on the start and end positions and then read that amount of bytes from the file. Here's an example based on the documentation:
const startPosition = workerData.start; // Your calculated start position
const endPosition = workerData.end; // Your calculated end position
const readLength = endPosition - startPosition + 1; // Calculate the number of bytes to read

const file = await Deno.open(workerData.filePath);
await file.seek(startPosition, Deno.SeekMode.Start); // Seek to the start position

const buffer = new Uint8Array(readLength);
const bytesRead = await file.read(buffer); // Read the calculated length
console.log(`Read ${bytesRead} bytes`);

file.close(); // Make sure to close the file when you are done
const startPosition = workerData.start; // Your calculated start position
const endPosition = workerData.end; // Your calculated end position
const readLength = endPosition - startPosition + 1; // Calculate the number of bytes to read

const file = await Deno.open(workerData.filePath);
await file.seek(startPosition, Deno.SeekMode.Start); // Seek to the start position

const buffer = new Uint8Array(readLength);
const bytesRead = await file.read(buffer); // Read the calculated length
console.log(`Read ${bytesRead} bytes`);

file.close(); // Make sure to close the file when you are done
Remember that you need to handle the case where the actual number of bytes read (bytesRead) might be less than the requested readLength, especially if the end of the file is reached before reading the full range. For parallel processing, you can use Deno's worker threads to read
DAH
Deno AI Helperā€¢34d ago
different parts of the file simultaneously. Each worker would perform the steps above for their respective range. Please note that you need to handle permissions with --allow-read and also consider the file system's limitations and the potential for race conditions when multiple workers are reading from the same file concurrently. šŸ¦• If you need more granular control over file operations or if you're dealing with very large files, you might want to look into streaming file operations as described in the documentation, which allows for more efficient memory usage when processing large files.
R
rauniorooā€¢34d ago
This is one approach which has worked for me to reply to HTTP range requests. Struggled a bit to arrive at this, was a lot of trial and error. That is to say, I don't have the most intimate knowledge on these stream stuffs and whether this is the neatest or most performant approach, but it does appear to work reliably in true streaming manner. Here start and end is the byte range. Can't remember whether that's inclusive range as in, is it everything between start and end, or does it include the bytes at start and end position, too. Probably however the range is defined in HTTP spec/browsers, as that was what I was testing this against.
const file = await Deno.open(localpath, { read: true });
await file.seek(start, Deno.SeekMode.Start);
const slicestart = 0;
const sliceend = end - start + 1;
const slice = file.readable.pipeThrough(
new ByteSliceStream(slicestart, sliceend),
);
return slice;

const file = await Deno.open(localpath, { read: true });
await file.seek(start, Deno.SeekMode.Start);
const slicestart = 0;
const sliceend = end - start + 1;
const slice = file.readable.pipeThrough(
new ByteSliceStream(slicestart, sliceend),
);
return slice;

What the AI response suggests looks like it holds the whole returned range in memory, which can be RAM heavy depending your range sizes. My suggestion above should be properly streaming I think, so it should be able to handle very big files and ranges while using very little RAM. Oh, and the ByteSliceStream comes from Deno std lib: import { ByteSliceStream } from "https://deno.land/std@0.219.0/streams/byte_slice_stream.ts";
E
Esenteā€¢34d ago
Just curious, why can't you use ByteSliceStream(start, end)?
R
rauniorooā€¢34d ago
Can't remember testing it, but is based on the assumption that ByteSliceStream alone would "fast forward" the source stream by actually reading bytes and discarding the results up to the starting position. I can't imagine it to know to call file.seek first. Calling seek first on the file first makes sure it can jump right away to the correct start position with a simple fast filesystem call.
E
Esenteā€¢34d ago
That's my belief too. But I didn't think that .seek would do better by skipping. TIL šŸ™‚
L
LFCavalcantiā€¢33d ago
Hi @raunioroo , thanks for the tip with the ByteSliceStream. Is the "slice" a stream that can be read into a buffer? In the sense that as parts of the file are read up to a buffer size I can parse that buffer, empty it then a new stream is read into the buffer and so on... I'm asking because the challenge in question has a file with 1 billion lines, each line can have max 105 bytes... we are talking something around 13GB of data Even if I slice that into 16 slices, it's too much to hold in memory I do have 32GB of RAM, but the goal is to use around 4GB as Node does
L
LFCavalcantiā€¢33d ago
L
LFCavalcantiā€¢33d ago
It seems I need to but the call for the ByteSliceStream inside a loop, so If there is a range of bytes I want to read, I calculate start and end for each iteration moving start and end along with the buffer size So... 1 - Find the offsets I want to process 2 - Using those offsets calculate the segments of the file for each worker thread to process 3 - Openfile and use the ByteSliceStream updating the slice and parsing it in a loop until the sliceEnd >= workerData.end
R
rauniorooā€¢33d ago
So, a stream is like a pipeline that you can consume (=read) at your own pace, and yes the stream holds some kind small buffer to get new more data ready as you consume it. The buffer is probably like, a couple of kilobytes or so, so very small, it won't need to keep much in memory at all as it tries to fetch the file at about the same pace as you are consuming the stream. Then you can have transform streams that take a raw byte stream, and, on-the-fly, transforms it to something more useful form. Like, you could have a transformstream that allows you to consume a stream line by line instead of in byte chunks. ByteSliceStream is one transformstream like that. It basically just lets you specify a range you want to read from another stream. Like an image crop function, but it crops a stream. ...but we don't use the start parameter of ByteSliceStream and instead use zero for start. Since calling file.seek does the same thing but more effectively. If I get this right, choosing some easy numbers for demo purposes. Llet's say you have 1Gb file, and you want to split the work of processing it between 4 workers. 1) Find the offsets for each worker, 0-250mb for worker one, 250mb-500mb for worker two etc 2) create only ONE byteslicestream for that worker's offset. For worker two, it would be 250-500mb. 3) in your actual processing loop, repeatedly call the byteslicestream read method to get new small chunk, process the chunk, ask for more by calling read again on the same stream object, etc etc until the stream is exhausted. If the data is in a newline-separated format, you can wrap the stream in yet another transformstream, like TextLineStream available in std. That will make it automatically so that every read() call returns a whole line instead of some arbitrary number of bytes. That's much easier to process
L
LFCavalcantiā€¢33d ago
ohhh so the slice object in your example has a .read method that I can pass a buffer to be read into?
R
rauniorooā€¢33d ago
So the file is the cake. ByteSliceStream is a cake slice, you give one slice to each worker. stream.read() method is the spoon. Because you only consume and hold a spoonful at a time, the cake can be infinitely big you'll never run out of memory. Yeah something like that!
L
LFCavalcantiā€¢33d ago
oh okay... I think it's time to start coding and breaking things to understand better
R
rauniorooā€¢33d ago
I think the read() method returns a buffer, though, that holds some smallish amount of bytes. You can use some other stream helpers that do the conversion from byte buffer to something that is even more easy to process. Like TextLineStream makes read() method return strings, one line at a time.
L
LFCavalcantiā€¢33d ago
I'll try this out tomorrow Thanks for now!
R
rauniorooā€¢33d ago
No worries. The streaming stuffs can be hard to wrap your head around, related API's can be a bit confusing with many similarly named but different things, and since there are so many different ways to accomplish the same tasks, can be easily overwhelming to google and understand it all. I myself still struggle with that stuff. But it's worth it to learn, super useful stuff! Deno uses/offers streaming APIs for so many things it's good to take advantage of that.
L
LFCavalcantiā€¢33d ago
The concepts of streaming content from files or http connections I gasped well enough... I think... but Deno has a way of doing things that is different from Node, I'm not versed enough on Deno to give opinions yet, this challenge seemed like a good opportunity to test and learn
R
rauniorooā€¢33d ago
Slightly annoyingly, in the past Deno used to have it's own non-standard streams API. Just like Node has it's own. Both kinda similar, but still different. But the old Deno API has been deprecated, or iirc mostly removed. Deno has now moved to a more standard-based Web Streams API (https://developer.mozilla.org/en-US/docs/Web/API/Streams_API). Standard API is nice, but it's a bit more cumbersome in some ways than the simpler, old Node or Deno API's. Standard web Streams API is kinda new and not so widely used yet in the backend, so there is not that much information and guides out there on it. That'll improve for sure in the future. But also when you try to google stuff, you now get a mix of results some using Node API, some use old Deno API, some use the new Web API. All have similar sounding methods etc which makes it a bit tedious to research
More Posts
esbuild-deno-loader with npm private registryDid anyone used the esbuild-deno-loader with npm: specifier while using a private npm registry that how to load env variables from .env file in taskthe `source` shell command is not available. I tried ``` "tasks": { "my-task": "export .env &&Is there a way to specify a range and list from KV?```typescript const kv = await Deno.openKv(); kv.set(["parent", 1], "key1") kv.set(["parent", 2], "kPermissionDenied while using ioredis npm packageHello, I'm facing an issue with an npm library ioredis (https://github.com/redis/ioredis). For someIs there a way to obtain Deno.cron schedule info?Deploy lists the registered cron jobs with name, schedule, last run and next run. Is there a programVisual Studio Code Deno does not use node typesI'm trying to use the "esquery" node library with its types, but for any reason Deno ignores the typDrop permissionsI'd like to harden my deno app a bit, but specifying permissions manually gets repetitive. Is there New fresh project comes out of the box with errorsI created a new fresh project on linux, I am using VsCode and im getting many errors although the prIs there a deno function for inputting prompt reply without pressing enter?Normally, one has to press enter for the input to be accepted. I was wondering if there was a versioSetting Deno.Command priority and affinity (on windows)Is it possible to set a tasks / commands priority and affinity directly? Code: ```ts await new DenoIs there a way to write a Deno.AtomicCheck that succeeds for any non-null versionstamp?You can write a Deno.AtomicCheck that fails if the versionstamp for a key-value pair in the KV storeImportTypeScript API + get value of `compilerOptions` with default options from `tsconfig.json`?Hello, I'm trying to figure out the proper way to do a few things: 1. What's the correct way to impHow to override the version from a dependency of a project's dependency from npmI'm currently using react-pdf as a dependency, react-pdf uses fontkit as a dependency, and fontkit uHow to publish global types with JSR?I was trying to republish my npm package with JSR via npm jsr package. It all went well with one excCaching results from APIHi, I'm currently working on a project where I get data from an API (https://www.robotevents.com/apiINTERNAL_SERVER_ERROR_2 / 500 response codeSeveral clients concurrently reported INTERNAL_SERVER_ERROR_2 with a 500 error on requests to a deplMy site seems to be under attackI receive a large amount of request last night. Does this mean my site is under attack...?Deno drizzle-kit migrate:pgI don't know if its drizzle error or Deno error... I saw path resolution problem, it is said `C:UseDoes V8 engine eventloop/threading model conflict with rust?hi, I am trying to embed deno's rusty_v8 into rust application. I am confused about the eventloop/thHow installing Silverbullet on denoOn the officle website of silverbullet they say we can install silver bullet on deno but i dont know