Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DFP MultiFileSource optionally poll for file updates #978

Conversation

dagardner-nv
Copy link
Contributor

@dagardner-nv dagardner-nv commented Jun 9, 2023

Description

  • Adds two new constructor args to MultiFileSource: watch and watch_interval, when watch is True the source will poll the input file globs every watch_interval seconds for new files.
  • These are exposed as --watch_inputs and --watch_interval on the command line
  • Misc updates to fix linting warnings

I spent some time looking into what the impacts of this change would be on the rest of the pipeline:

  • Files shouldn't appear in the source directory until they're fully populated, otherwise the pipeline will ingest a partially populated file.
  • DFPFileBatcherStage: Assuming that the watch_interval is smaller than the period argument to DFPFileBatcherStage, and assuming that new files are actually new (not historical files recently fetched) then this will cause all most new files to likely be batched together, unless they straddle the period boundary. This should be OK and is likely the desired outcome.
  • DFPRollingWindowStage: This should be OK, the stage appends incoming data to the existing history for the user. However, there is a potential issue if new files are older than existing files already ingested. This could potentially happen if the files appearing in the directory could be populated from an outside source not in creation order.

fixes #975

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@dagardner-nv dagardner-nv added non-breaking Non-breaking change feature request New feature or request DO NOT MERGE PR should not be merged; see PR for details 2 - In Progress labels Jun 9, 2023
@dagardner-nv dagardner-nv requested a review from a team as a code owner June 9, 2023 00:07
@dagardner-nv dagardner-nv requested a review from a team as a code owner June 9, 2023 15:46
@dagardner-nv dagardner-nv added 3 - Ready for Review and removed DO NOT MERGE PR should not be merged; see PR for details 2 - In Progress labels Jun 9, 2023
Copy link
Contributor

@mdemoret-nv mdemoret-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one fix to counting incremental time and we should be good.

@dagardner-nv dagardner-nv requested a review from mdemoret-nv June 13, 2023 23:00
Copy link
Contributor

@mdemoret-nv mdemoret-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question.

@mdemoret-nv
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit c1cc78d into nv-morpheus:branch-23.07 Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA]: Include the Directory Watcher functionality in the DFP Production example
2 participants