Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read big file line by line #4007

Closed
saostad opened this issue Feb 15, 2020 · 5 comments
Closed

read big file line by line #4007

saostad opened this issue Feb 15, 2020 · 5 comments
Labels

Comments

@saostad
Copy link

saostad commented Feb 15, 2020

I am trying to read a big file (13,147,026 lines of text) line by line with deno but it's giving me error:

BufferFullError: Buffer full
    at BufReader.readSlice (bufio.ts:339:15)
    at async BufReader.readString (bufio.ts:217:20)
    at async stream_file (buffer.ts:9:18)

Here is my code:

import { BufReader } from "https://deno.land/std/io/bufio.ts";

export async function stream_file(filename: string) {
  const file = await Deno.open(filename);
  const bufReader = new BufReader(file);
  console.log("Reading data...");
  let line: string | any;
  let lineCount: number = 0;
  while ((line = await bufReader.readString("\n")) != Deno.EOF) {
    lineCount++;
    // do something with `line`.
  }
  file.close();
  console.log(`${lineCount} lines read.`);
}

versions:

deno 0.33.0
v8 8.1.108
typescript 3.7.2

I am trying to find an equivalent of node streams.
please advise!

@artisonian
Copy link

artisonian commented Feb 15, 2020

In Deno, the I/O model is different. The Reader interface is what you want. The idea of a reader is to give the caller a way to request data while getting back pressure (which you can detect by checking to see if the Reader gave you the number of bytes you expected). If you aren't familiar with Go, this might seems a bit odd.

But to give an example of your code above:

import { BufReader } from "https://deno.land/std@v0.33.0/io/mod.ts";
import { TextProtoReader } from "https://deno.land/std@v0.33.0/textproto/mod.ts";
import { parse } from "https://deno.land/std@v0.33.0/flags/mod.ts";
import { basename } from "https://deno.land/std@v0.33.0/path/mod.ts";

export async function read(r: Deno.Reader) {
  const reader = new TextProtoReader(BufReader.create(r));
  console.log("Reading data...");

  let lineCount = 0;
  while (true) {
    let line = await reader.readLine();
    if (line === Deno.EOF) break;
    // do something with `line`
    lineCount += 1;
  }

  console.log(`${lineCount} lines read.`);
}

if (import.meta.main) {
  const args = parse(Deno.args, {
    boolean: ["h"],
    alias: {
      h: ["help"]
    }
  });

  if (args.h) {
    printUsage();
    Deno.exit(0);
  }

  const [filename] = args._;
  if (!filename) {
    printUsage();
    Deno.exit(1);
  }

  const file = filename === "-" ? Deno.stdin : await Deno.open(filename);
  await read(file);
  file.close();

  function printUsage() {
    console.error(
      `Usage: deno --allow-read ${basename(import.meta.url)} <filename>`
    );
  }
}

The thing to note is that I'm using TextProtoReader as a convenience because it has logic for reading lines when the underlying BufReader's buffer is full (you can check out the source…it's pretty straightforward).

@saostad
Copy link
Author

saostad commented Feb 17, 2020

I tried it with 3 approaches to read 307 MB test file :

  • Approach 1: (failed for big file - just worked for medium size file with 128,457 lines )
import { BufReader } from "https://deno.land/std/io/bufio.ts";

export async function stream_file(filename: string) {
  const file = await Deno.open(filename);
  const bufReader = new BufReader(file);
  console.log("Reading data...");
  let line: string | any;
  let lineCount: number = 0;
  while ((line = await bufReader.readString("\n")) != Deno.EOF) {
    lineCount++;
    // do something with `line`.
  }
  file.close();
  console.log(`${lineCount} lines read.`);
}
  • Approach 2:
    • CPU utilization: 18%
    • time to read data: 1':24"
import { BufReader } from "https://deno.land/std@v0.33.0/io/mod.ts";
import { TextProtoReader } from "https://deno.land/std@v0.33.0/textproto/mod.ts";

export async function textProtoReader(filename:string) {
  const r: Deno.Reader = await Deno.open(filename)
  const reader = new TextProtoReader(BufReader.create(r));
  console.log("Reading data...");

  let lineCount = 0;
  while (true) {
    let line = await reader.readLine();
    if (line === Deno.EOF) break;
    // do something with `line`
    lineCount += 1;
  }

  console.log(`${lineCount} lines read.`);
}
  • Approach 3:
    • CPU: 23%
    • time: 27"
import { BufReader } from "https://deno.land/std/io/bufio.ts";

export async function readLine(filename: string) {
  const file = await Deno.open(filename);
  const bufReader = new BufReader(file);
  console.log("Reading data...");
  let line: string | any;
  let lineCount: number = 0;
  while ((line = await bufReader.readLine()) != Deno.EOF) {
    lineCount++;
    // do something with `line`.
  }
  file.close();
  console.log(`${lineCount} lines read.`);
}

rust itself does it in

  • CPU: 19%
  • time: 13"
use std::fs::File;
use std::io::{self, BufRead};
use std::path::Path;
use std::io::{stdin, stdout, Read, Write};
use std::time::{Instant};

fn main() {
  let start = Instant::now();
  let mut counter = 0;
    // File hosts must exist in current path before this produces output
    if let Ok(lines) = read_lines("./enwik9") {
        // Consumes the iterator, returns an (Optional) String
        for _line in lines {
                counter = counter + 1;
            
        }
        println!("{}",counter)
      }
      let duration = start.elapsed();
      println!("Time elapsed in expensive_function() is: {:?}", duration);
      pause()
}


fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<File>>>
where P: AsRef<Path>, {
    let file = File::open(filename)?;
    Ok(io::BufReader::new(file).lines())
}

fn pause() {
  let mut stdout = stdout();
  stdout.write(b"Press Enter to continue...").unwrap();
  stdout.flush().unwrap();
  stdin().read(&mut [0]).unwrap();
}

node does that:

  • CPU: 23%
  • time: 10"
const fs = require("fs");
const readline = require("readline");

async function processLineByLine() {
  console.log(Date());
  const fileStream = fs.createReadStream("./enwik9");

  const rl = readline.createInterface({
    input: fileStream,
    crlfDelay: Infinity
  });

  let counter = 0;
  for await (const line of rl) {
    counter++;
  }
  console.log(Date());
  return counter;
}

can anybody explain why it's different?
I guess the best way to read big file in deno is approach 3.

@kitsonk
Copy link
Contributor

kitsonk commented Feb 17, 2020

Side note, feels like something we need to add to the benchmarks.

@stale
Copy link

stale bot commented Jan 6, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 6, 2021
@saostad saostad closed this as completed Jan 7, 2021
@josephrocca
Copy link
Contributor

josephrocca commented Oct 17, 2021

saostad's approach 3 doesn't seem to be working for me (possibly due to the unversioned import). For others arrive here via Google looking for a quick copy/paste line-by-line file reader, deepakshrma has a neat example that seems to work well:

import { readLine } from "https://raw.githubusercontent.com/deepakshrma/deno-by-example/feacd84bd5cd5b1a630dfcc72afb3cd64de21b91/examples/file_reader.ts";
let lineIterator = await readLine("yourFile.txt");
for await (let line of lineIterator) {
  console.log(line);
}

(I haven't benchmarked it.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants