Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Include the Directory Watcher functionality in the DFP Production example #975

Closed
2 tasks done
nvawood opened this issue Jun 7, 2023 · 1 comment · Fixed by #978
Closed
2 tasks done

[FEA]: Include the Directory Watcher functionality in the DFP Production example #975

nvawood opened this issue Jun 7, 2023 · 1 comment · Fixed by #978
Assignees
Labels
feature request New feature or request Needs Triage Need team to review and classify

Comments

@nvawood
Copy link

nvawood commented Jun 7, 2023

Is this a new feature, an improvement, or a change to existing functionality?

Change

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem this feature solves

I want Morpheus to continually run and monitor a directory for new files, performance inference on those files, save the output, and then repeat the process when new files are detected. The functionality exists already in morpheus/stages/input/autoencoder_source_stage.py but not in examples/digital_fingerprinting/production/morpheus/dfp/stages/multi_file_source.py.

Describe your ideal solution

A new command line argument in examples/digital_fingerprinting/production/morpheus/dfp_*_pipeline.py to indicate the input_file or input_glob should be continually monitored, e.g.:

@click.option('--watch_directory',
              type=bool,
              default=False,
              help=("The watch directory option instructs this stage to not close down once all files have been read. "
                    "Instead it will read all files that match the 'input_glob' pattern, and then continue to watch "
                    "the directory for additional files. Any new files that are added that match the glob will then "
                    "be processed."))

Describe any alternatives you have considered

It's possible to write wrapper shells scripts to launch new instances of the pipeline when new files are detected, but this is not efficient.

Additional context

The code exists both in Morpheus input stages (appshield_source_stage.py, azure_source_stage.py, cloud_trail_source_stage.py, autoencoder_source_stage.py, and duo_source_stage.py) as well as the ransomware_detection example, but not in the DFP Production example.

Code of Conduct

  • I agree to follow this project's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request
@nvawood nvawood added the feature request New feature or request label Jun 7, 2023
@jarmak-nv jarmak-nv added the Needs Triage Need team to review and classify label Jun 7, 2023
@jarmak-nv
Copy link
Contributor

Hi @nvawood!

Thanks for submitting this issue - our team has been notified and we'll get back to you as soon as we can!
In the mean time, feel free to add any relevant information to this issue.

@dagardner-nv dagardner-nv self-assigned this Jun 8, 2023
@dagardner-nv dagardner-nv moved this from Todo to In Progress in Morpheus Boards Jun 8, 2023
@rapids-bot rapids-bot bot closed this as completed in #978 Jun 21, 2023
rapids-bot bot pushed a commit that referenced this issue Jun 21, 2023
* Adds two new constructor args to `MultiFileSource`: `watch` and `watch_interval`, when `watch` is True the source will poll the input file globs every `watch_interval` seconds for new files.
* These are exposed as `--watch_inputs` and `--watch_interval` on the command line
* Misc updates to fix linting warnings


I spent some time looking into what the impacts of this change would be on the rest of the pipeline:
* Files shouldn't appear in the source directory until they're fully populated, otherwise the pipeline will ingest a partially populated file.
* `DFPFileBatcherStage`: Assuming that the watch_interval is smaller than the `period` argument to `DFPFileBatcherStage`, and assuming that new files are actually new (not historical files recently fetched) then this will cause all most new files to likely be batched together, unless they straddle the period boundary. This should be OK and is likely the desired outcome.
* `DFPRollingWindowStage`: This should be OK, the stage appends incoming data to the existing history for the user. However, there is a potential issue if new files are older than existing files already ingested. This could potentially happen if the files appearing in the directory could be populated from an outside source not in creation order. 

fixes #975

Authors:
  - David Gardner (https://github.com/dagardner-nv)

Approvers:
  - Michael Demoret (https://github.com/mdemoret-nv)

URL: #978
@github-project-automation github-project-automation bot moved this from In Progress to Done in Morpheus Boards Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Needs Triage Need team to review and classify
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants