Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest files from a directory #2622

Closed
cgardens opened this issue Mar 26, 2021 · 10 comments
Closed

Ingest files from a directory #2622

cgardens opened this issue Mar 26, 2021 · 10 comments
Assignees
Labels
area/connectors Connector related issues type/enhancement New feature or request

Comments

@cgardens
Copy link
Contributor

cgardens commented Mar 26, 2021

Tell us about the problem you're trying to solve

  • I have a directory on S3. New files are added to that directory on some cadence. Whenever a new file is added to the directory I want to sync that data to a destination. The names of the files are monotonically increasing and could be used as cursor fields.

Describe the solution you’d like

  • This is either a new feature in the file source or a new connector altogether.

heads up sherifnada

requested by SG. tagging roshan

┆Issue is synchronized with this Asana task by Unito

@cgardens cgardens added type/enhancement New feature or request area/connectors Connector related issues labels Mar 26, 2021
@sherifnada
Copy link
Contributor

@roshan would all the files have the same schema/contribute to the same stream?

@roshan
Copy link
Contributor

roshan commented Mar 26, 2021 via email

@roshan
Copy link
Contributor

roshan commented Mar 26, 2021

If you would like, it is possible to arrange it so that each prefix represents a stream. I am happy to work with whatever method you think is most general.

@sherifnada
Copy link
Contributor

@jrhizor
Copy link
Contributor

jrhizor commented Mar 31, 2021

  • file name history in state or timestamp based "cursor" (timestamp would work for cloud storage, won't work for local disk if you're doing incremental updates)
  • user defined mapping of regex -> jsonschema for each stream? default to object (no normalization)
  • open questions around how this fits with the existing file connector (should it be in the same connector or separate?)

@harshithmullapudi
Copy link
Contributor

wouldn't it be good to solve independently for sources like S3, Google Storage as the s3-csv tap

@harshithmullapudi
Copy link
Contributor

We have a usecase for which we need to solve this for S3. Thinking of bringing that tap-s3-csv to airbyte any thoughts here

@sherifnada sherifnada added this to the Core - 2021-07-07 milestone Jun 30, 2021
@Phlair
Copy link
Contributor

Phlair commented Jul 5, 2021

@cgardens
Copy link
Contributor Author

@Phlair can we close this?

@Phlair
Copy link
Contributor

Phlair commented Aug 23, 2021

@Phlair Phlair closed this as completed Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants