-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add file storage option #12590
base: develop
Are you sure you want to change the base?
Add file storage option #12590
Conversation
Build Artifacts
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably want to keep the options.py interface restricted, others can override settings if they see fit, but we should focus on enabling the specific options we need.
Hint: if you do kolibri configure list-env
you will see a complete list of available env vars for configuration (which will also show how your new option can be set as an env var).
kolibri/utils/options.py
Outdated
@@ -359,6 +377,16 @@ def lazy_import_callback_list(value): | |||
|
|||
|
|||
base_option_spec = { | |||
"FileStorage": { | |||
"DEFAULT_FILE_STORAGE": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd want to hew closer to the pattern we have for the Cache and Database options here - and offer simple, pre-specified string options that refer to specific backends. If someone really wants to run a custom backend, they can override the settings file and do what they like.
That way, with specific backends in mind, we can then explicitly enumerate the additional things that need to be specified in each case - for example the default "file_system"
backend value will need a path, if it's a GCS backend then other things might be required (or may be automagically configured in some cases).
kolibri/utils/options.py
Outdated
except ImportError: | ||
logger.error("Default file storage is not available.") | ||
raise VdtValueError(value) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't ever be catching a bare Exception, unless for very good reason - it can hide a multitude of sins.
kolibri/utils/options.py
Outdated
modules = value.split(".") | ||
klass = modules.pop() | ||
module_path = ".".join(modules) | ||
module = importlib.import_module(module_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that Django exposes a utility called import_string
which we already use in this module for loading classes by a string dot path, so this seems preferable to use here.
kolibri/utils/options.py
Outdated
@@ -15,6 +15,7 @@ | |||
from configobj import ConfigObj | |||
from configobj import flatten_errors | |||
from configobj import get_extra_values | |||
from django.core.files.storage import Storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably why flake8 wouldn't let pre-commit pass but it didn't give me useful output
@@ -737,6 +766,7 @@ def _get_validator(): | |||
"url_prefix": url_prefix, | |||
"bytes": validate_bytes, | |||
"multiprocess_bool": multiprocess_bool, | |||
"storage_option": storage_option, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how will this work with database based cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm - I'm not sure. I assumed this was only related to validation on initialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see that this would affect anything to do with the cache.
logger.info("File saved - Path: {}".format(file_storage.url(file))) | ||
logger.info("File saved - Size: {}".format(file_storage.size(file))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jredrejo I've tried several things around here, but no matter what I do I the file always shows size 0... I've confirmed that there are users (usernames is full of data, for example) -- so I'm not sure why the writer
isn't updating the file object here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem was not in reading the file info, but writing it. It had 0 bytes. I have made a commit in this PR to ensure data is flushed into the BytesIO and now it's working:
INFO 2024-12-26 20:45:45,313 Invoking command bulkexportusers
INFO 2024-12-26 20:45:45,373 Creating users csv file /home/jose/.kolibri/log_export/Kolibri en casa de admin_3da4_users.csv
[#######################################################################################################################################################################################-------------------------------------------------------------] 75%INFO 2024-12-26 20:45:46,767 File saved - Path: https://storage.googleapis.com/kdp-csv-reporting-develop/Kolibri_en_casa_de_admin_3da4_users.csv
INFO 2024-12-26 20:45:47,073 File saved - Size: 482
INFO 2024-12-26 20:45:47,073 Created csv file /home/jose/.kolibri/log_export/Kolibri en casa de admin_3da4_users.csv with 4 lines
(I've also rebased develop in the branch because there was already a conflict and more might appear soon)
43309bc
to
953b29b
Compare
123671a
to
ebc6d37
Compare
@jredrejo thank you for retargeting - I've rebased and restructured the PR a bit so the commit history is cleaner and hopefully easier to follow. Ready for code review when you can. Also, could you try it locally w/ your gcloud credentials? I think I lost something that I had gotten working previously but cannot figure it out right now. EDIT: I figured it out - had to setup my local adc creds per https://cloud.google.com/docs/authentication/set-up-adc-local-dev-environment |
b63fc75
to
1184c1e
Compare
@rtibbles this failing test seems to be related to the original issue itself, possibly, because it is ultimately trying to get a hold of the DB file using That repair_db stuff calls to get the default backup folder. The local storage uses the MEDIA_ROOT by default. Not sure the best path forward so would appreciate your thoughts. |
if not os.environ.get("DEFAULT_FILE_STORAGE"): | ||
if conf.OPTIONS["FileStorage"]["STORAGE_BACKEND"] == "gcs": | ||
DEFAULT_FILE_STORAGE = "kolibri.utils.file_storage.KolibriFileStorage" | ||
BUCKET_NAME = os.getenv("GCS_BUCKET_NAME") or "kdp-csv-reporting-develop" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a BUCKET NAME option in our options.py as well - we definitely should not be hard coding anything specific about any platform in here.
If GCS_BUCKET_NAME
is an environment variable that is set by default on GCP (but I don't think it is) we can specify specific env vars in the options configuration to draw values from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than subclassing GoogleCloudStorage, we should just set the additional settings here for it.
kolibri/utils/file_storage.py
Outdated
# https://django-storages.readthedocs.io/en/latest/backends/gcloud.html#google-cloud-storage | ||
|
||
|
||
class KolibriFileStorage(GoogleCloudStorage): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we can override the default values for the location if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear to me why this file is necessary - we can just directly use the GoogleCloudStorage class and use settings (with their own options in options.py to allow them to be defined).
https://django-storages.readthedocs.io/en/latest/backends/gcloud.html#settings
file_storage = DefaultStorage() | ||
|
||
|
||
def random_filename(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In hindsight, I think that maybe just cleaning up the files after each test could have avoided needing this.
@@ -150,18 +153,18 @@ def import_exported_csv(self): | |||
assert len(classroom.get_coaches()) == 1 | |||
|
|||
def test_dryrun_from_export_csv(self): | |||
with open_csv_for_reading(self.filepath) as source: | |||
header = next(csv.reader(source, strict=True)) | |||
source = file_storage.open(self.filepath, "r") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking back at this, the reassignment of source
makes this a it odd to read in a way that it wasn't when they were created inside of with
statements.
I'll come back through and do some cleaning
There seems to be a problem with how files are being handled that is leaving file handles open - this is why things are erroring on windows, because it is much more sensitive to files not being closed:
I haven't looked through the code in detail to see where this might be happening, but it suggests that something isn't being handled entirely properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes here seem to have completely ignored the purpose of the two helper functions open_csv_for_writing
and open_csv_for_reading
- I think the diff could be significantly reduced by leaning into the existence of these two functions and consolidating the logic in there - with the small tweak as you have done that we change filepath
to filename
, so that it is clear it is not a literal disk file path.
We can also probably conditionalize whether we write directly to the file or to memory. Writing large CSV files to memory in a non-cloud environment is probably not going to be a great idea.
The other alternative is we could just write directly to the storage object in either case unless we know absolutely that it is too slow when writing to GCS? Given it's happening in an asynchronous task, and we have to write to it eventually... I am not sure how writing to memory first necessarily helps.
kolibri/core/auth/csv_utils.py
Outdated
@@ -174,7 +176,7 @@ def csv_file_generator(facility, filepath, overwrite=True, demographic=False): | |||
column for column in db_columns if demographic or column not in DEMO_FIELDS | |||
) | |||
|
|||
csv_file = open_csv_for_writing(filepath) | |||
csv_file = io.StringIO() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using io.StringIO()
here but io.BytesIO
below?
if not os.environ.get("DEFAULT_FILE_STORAGE"): | ||
if conf.OPTIONS["FileStorage"]["STORAGE_BACKEND"] == "gcs": | ||
DEFAULT_FILE_STORAGE = "kolibri.utils.file_storage.KolibriFileStorage" | ||
BUCKET_NAME = os.getenv("GCS_BUCKET_NAME") or "kdp-csv-reporting-develop" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a BUCKET NAME option in our options.py as well - we definitely should not be hard coding anything specific about any platform in here.
If GCS_BUCKET_NAME
is an environment variable that is set by default on GCP (but I don't think it is) we can specify specific env vars in the options configuration to draw values from.
kolibri/utils/file_storage.py
Outdated
# https://django-storages.readthedocs.io/en/latest/backends/gcloud.html#google-cloud-storage | ||
|
||
|
||
class KolibriFileStorage(GoogleCloudStorage): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear to me why this file is necessary - we can just directly use the GoogleCloudStorage class and use settings (with their own options in options.py to allow them to be defined).
https://django-storages.readthedocs.io/en/latest/backends/gcloud.html#settings
if not os.environ.get("DEFAULT_FILE_STORAGE"): | ||
if conf.OPTIONS["FileStorage"]["STORAGE_BACKEND"] == "gcs": | ||
DEFAULT_FILE_STORAGE = "kolibri.utils.file_storage.KolibriFileStorage" | ||
BUCKET_NAME = os.getenv("GCS_BUCKET_NAME") or "kdp-csv-reporting-develop" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than subclassing GoogleCloudStorage, we should just set the additional settings here for it.
@@ -737,6 +766,7 @@ def _get_validator(): | |||
"url_prefix": url_prefix, | |||
"bytes": validate_bytes, | |||
"multiprocess_bool": multiprocess_bool, | |||
"storage_option": storage_option, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see that this would affect anything to do with the cache.
"description": """ | ||
The storage backend class that Django will use when managing files. The class given here must implement | ||
the Django files.storage.Storage class. | ||
""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add addiitonal options here for the ACL, bucket name, etc - we don't need to expose all the settings that are available, but for anything we need set, we can include here.
@nucleogenesis and I had a quick sync up to see how to consolidate the file handling here - seems like there are some deficiencies in the Django Storage conformance to the Python file object spec https://forum.djangoproject.com/t/file-open-to-support-different-encodings/21491 - but an idea there to wrap the file object in a |
Hi @nucleogenesis - on my end while regression testing the .csv generation I noticed that it's no longer possible to generate the same .csv file twice. For example if I go to Here are the logs from one of my devices though: log.with.the.same.date.range.mp4 |
6a18704
to
f7317cb
Compare
702204e
to
ff63d7f
Compare
Summary
In order to allow us to use a cloud backend for the File Storage in Django, this adds a value to options.py which defaults to the Django FileSystemStorage (which... it did anyway but now we make sure of it).
This should make it configurable on BCK such that if there is a module that is a Google cloud backend class that implements the Django Storage class, then it can be added as the value for the settings.
For example, if we have a new class "GCloudStorage" in
kolibri.core.storage
then we would use that class if we set the option added here tokolibri.core.storage.GCloudStorage
.This is very much a first whack -- one thing I'm not clear on is if by naming the option by the name that Django would look to in the env vars
DEFAULT_FILE_STORAGE
does the Kolibri options.py stuff automatically apply that setting because of the matching name?References
Fixes #9441 (or at least begins to address it)
Reviewer guidance
@pcenov - could you please test all workflows that involve:
The changes I've made here should basically have no effect to the user for any of those workflows.
Once this passes regression testing, we can deploy it to a BCK instance and do final testing there.
Testing checklist
PR process
Reviewer checklist
yarn
andpip
)