-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source File: ability to get HTTPS attachments #5537
Comments
thanks @muutech ! would you be open to creating a PR to submit this feature? |
We have opened an issue on smart_open, to see if they can help us with some light/guidance or we would need to use another library (it work with "requests"). piskvorky/smart_open#644 |
Ok, with great help from the people of smart_open, the problem seems to be that the library is strict with the optional HTTP header "Accept-ranges", so in some cases like using "owncloud", "gdrive", "transfer.sh" which does not respond with that header when airbyte calls "seek" it will fail. They tell us they would release a new version with this check relaxed. So, when it is gets out, it would be apparently solved, just updating this library, do you see any problem or thing to have in mind when updating this library? See detailed explanation on: piskvorky/smart_open#644 (comment) |
smart_open 5.2.1 is already out, with this change. I've tested it, and now the error is different... any idea?
Could be related to this? #5110 Thanks! |
this seems like a different issue @muutech -- the issue you linked is from the S3 source, which is different than the file source. Can you share the inputs you provided to the file source that resulted in this error? |
Sure. format: "excel" |
@grubberr thanks. It could be any test.xlsx but find attached the one I used, I think compression is made on the http transfer. I tested uploading it to: https://transfer.sh/ Thanks! |
@muutech
We need to think how to better handle it |
I have published |
@muutech any updates on this ? |
When setting up connection: Failed to fetch schema. Please try again. You can the URL if you want. Airbyte version: 0.39.41-alpha |
yes problem still exists for transfer.sh in discover stage |
@muutech thanks for bug, I have updated |
great! it works... but local csvs started to fail, airbyte tell us it succeeded but you see the destination is empty (header ok but no rows). find log attached, i do not know if its related or not. |
@muutech is it possible to get |
@muutech
|
Sorry, github get my CSV wrong: COLUMNA1;COLUMNA2; In the logs it seems to get the record object OK, but when trying to make it to destination CSV or Postgresql it ends "successful" but with 0 rows. thanks! |
Tested! Now everything works. Thanks! |
Tell us about the problem you're trying to solve
We are trying to download an Excel file shared in a "ownCloud" or vía "transfer.sh". I think it does not work when response header is "Content-disposition: attachment". This header is very common and having this feature would be very nice and give access to public shared links from different platforms like Google Drive, etc.
Describe the solution you’d like
When we tried to use the connector File with HTTPS it works well for the same file if we upload it to a web server serving DIRECTLY the file, but, the same file, uploaded to an OwnCloud environment or to "transfer.sh" and trying to download it, airbyte says it is not a valid Excel file.
Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/source_file/source.py", line 122, in discover streams = list(client.streams) File "/usr/local/lib/python3.7/site-packages/source_file/client.py", line 384, in streams "properties": self._stream_properties(), File "/usr/local/lib/python3.7/site-packages/source_file/client.py", line 372, in _stream_properties for df in df_list: File "/usr/local/lib/python3.7/site-packages/source_file/client.py", line 327, in load_dataframes yield reader(fp, **reader_options) File "/usr/local/lib/python3.7/site-packages/pandas/util/_decorators.py", line 299, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 336, in read_excel io = ExcelFile(io, storage_options=storage_options, engine=engine) File "/usr/local/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 1056, in __init__ content=path_or_buffer, storage_options=storage_options File "/usr/local/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 942, in inspect_excel_format stream.seek(0) File "/usr/local/lib/python3.7/site-packages/smart_open/http.py", line 263, in seek raise OSError OSError
URL is like this (none of them working but it is easy to upload something to transfer.sh):
https://transfer.sh/get/1HpZ3cN/test.xlsx (/get/ is the key to direct download)
https://www.xxxxxxcloud.com/drive/index.php/s/xxxxxxx/download?path=%2F&files=test.xlsx
The only difference we notice from using a direct http server and this other options is that in the response header it comes "Content-Disposition: attachment ..."
Describe the alternative you’ve considered or used
I have looked in smart_open for a related issue but could not find it...
The text was updated successfully, but these errors were encountered: