Add file storage option #12590

nucleogenesis · 2024-08-22T21:23:54Z

Summary

In order to allow us to use a cloud backend for the File Storage in Django, this adds a value to options.py which defaults to the Django FileSystemStorage (which... it did anyway but now we make sure of it).

This should make it configurable on BCK such that if there is a module that is a Google cloud backend class that implements the Django Storage class, then it can be added as the value for the settings.

For example, if we have a new class "GCloudStorage" in kolibri.core.storage then we would use that class if we set the option added here to kolibri.core.storage.GCloudStorage.

This is very much a first whack -- one thing I'm not clear on is if by naming the option by the name that Django would look to in the env vars DEFAULT_FILE_STORAGE does the Kolibri options.py stuff automatically apply that setting because of the matching name?

References

Fixes #9441 (or at least begins to address it)

Reviewer guidance

@pcenov - could you please test all workflows that involve:

importing from CSV
generating a CSV
downloading a generated CSV

The changes I've made here should basically have no effect to the user for any of those workflows.

Once this passes regression testing, we can deploy it to a BCK instance and do final testing there.

Testing checklist

Contributor has fully tested the PR manually
If there are any front-end changes, before/after screenshots are included
Critical user journeys are covered by Gherkin stories
Critical and brittle code paths are covered by unit tests

PR process

PR has the correct target branch and milestone
PR has 'needs review' or 'work-in-progress' label
If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
If this is an important user-facing change, PR or related issue has a 'changelog' label
If this includes an internal dependency change, a link to the diff is provided

Reviewer checklist

PR is fully functional
PR has been tested for accessibility regressions
External dependency files were updated if necessary (yarn and pip)
Documentation is updated
Contributor is in AUTHORS.md

github-actions · 2024-08-22T21:54:21Z

Build Artifacts

Asset type	Download link
PEX file	kolibri-0.18.0a0.dev0_git.451.gb1403960.pex
Windows Installer (EXE)	kolibri-0.18.0a0.dev0+git.451.gb1403960-windows-setup-unsigned.exe
Debian Package	kolibri_0.18.0a0.dev0+git.451.gb1403960-0ubuntu1_all.deb
Mac Installer (DMG)	kolibri-0.18.0a0.dev0+git.451.gb1403960.dmg
Android Package (APK)	kolibri-0.18.0a0.dev0+git.451.gb1403960-0.1.4-debug.apk
Raspberry Pi Image	kolibri-pi-image-0.18.0a0.dev0+git.451.gb1403960.zip
TAR file	kolibri-0.18.0a0.dev0+git.451.gb1403960.tar.gz
WHL file	kolibri-0.18.0a0.dev0+git.451.gb1403960-py2.py3-none-any.whl

rtibbles

I think we probably want to keep the options.py interface restricted, others can override settings if they see fit, but we should focus on enabling the specific options we need.

Hint: if you do kolibri configure list-env you will see a complete list of available env vars for configuration (which will also show how your new option can be set as an env var).

rtibbles · 2024-08-23T00:30:03Z

kolibri/utils/options.py

@@ -359,6 +377,16 @@ def lazy_import_callback_list(value):


 base_option_spec = {
+    "FileStorage": {
+        "DEFAULT_FILE_STORAGE": {


I think I'd want to hew closer to the pattern we have for the Cache and Database options here - and offer simple, pre-specified string options that refer to specific backends. If someone really wants to run a custom backend, they can override the settings file and do what they like.

That way, with specific backends in mind, we can then explicitly enumerate the additional things that need to be specified in each case - for example the default "file_system" backend value will need a path, if it's a GCS backend then other things might be required (or may be automagically configured in some cases).

rtibbles · 2024-08-23T00:30:38Z

kolibri/utils/options.py

+    except ImportError:
+        logger.error("Default file storage is not available.")
+        raise VdtValueError(value)
+    except Exception:


Shouldn't ever be catching a bare Exception, unless for very good reason - it can hide a multitude of sins.

rtibbles · 2024-08-23T00:31:51Z

kolibri/utils/options.py

+        modules = value.split(".")
+        klass = modules.pop()
+        module_path = ".".join(modules)
+        module = importlib.import_module(module_path)


Note that Django exposes a utility called import_string which we already use in this module for loading classes by a string dot path, so this seems preferable to use here.

nucleogenesis · 2024-08-23T19:59:15Z

kolibri/utils/options.py

@@ -15,6 +15,7 @@
 from configobj import ConfigObj
 from configobj import flatten_errors
 from configobj import get_extra_values
+from django.core.files.storage import Storage


This is probably why flake8 wouldn't let pre-commit pass but it didn't give me useful output

jredrejo · 2024-08-27T16:10:47Z

kolibri/utils/options.py

@@ -737,6 +766,7 @@ def _get_validator():
            "url_prefix": url_prefix,
            "bytes": validate_bytes,
            "multiprocess_bool": multiprocess_bool,
+            "storage_option": storage_option,


how will this work with database based cache?

Hm - I'm not sure. I assumed this was only related to validation on initialization

I don't see that this would affect anything to do with the cache.

nucleogenesis · 2024-12-20T23:05:10Z

kolibri/core/auth/management/commands/bulkexportusers.py

+            logger.info("File saved - Path: {}".format(file_storage.url(file)))
+            logger.info("File saved - Size: {}".format(file_storage.size(file)))


@jredrejo I've tried several things around here, but no matter what I do I the file always shows size 0... I've confirmed that there are users (usernames is full of data, for example) -- so I'm not sure why the writer isn't updating the file object here...

The problem was not in reading the file info, but writing it. It had 0 bytes. I have made a commit in this PR to ensure data is flushed into the BytesIO and now it's working:

INFO 2024-12-26 20:45:45,313 Invoking command bulkexportusers INFO 2024-12-26 20:45:45,373 Creating users csv file /home/jose/.kolibri/log_export/Kolibri en casa de admin_3da4_users.csv [#######################################################################################################################################################################################-------------------------------------------------------------] 75%INFO 2024-12-26 20:45:46,767 File saved - Path: https://storage.googleapis.com/kdp-csv-reporting-develop/Kolibri_en_casa_de_admin_3da4_users.csv INFO 2024-12-26 20:45:47,073 File saved - Size: 482 INFO 2024-12-26 20:45:47,073 Created csv file /home/jose/.kolibri/log_export/Kolibri en casa de admin_3da4_users.csv with 4 lines

(I've also rebased develop in the branch because there was already a conflict and more might appear soon)

nucleogenesis · 2025-01-14T04:03:00Z

@jredrejo thank you for retargeting - I've rebased and restructured the PR a bit so the commit history is cleaner and hopefully easier to follow. Ready for code review when you can.

Also, could you try it locally w/ your gcloud credentials? I think I lost something that I had gotten working previously but cannot figure it out right now.

EDIT: I figured it out - had to setup my local adc creds per https://cloud.google.com/docs/authentication/set-up-adc-local-dev-environment

nucleogenesis · 2025-01-24T23:15:27Z

@rtibbles this failing test seems to be related to the original issue itself, possibly, because it is ultimately trying to get a hold of the DB file using os.path.join(KOLIBRI_HOME...) instead of the DefaultStorage().

That repair_db stuff calls to get the default backup folder.

The local storage uses the MEDIA_ROOT by default.

Not sure the best path forward so would appreciate your thoughts.

nucleogenesis · 2025-01-24T23:19:11Z

kolibri/deployment/default/settings/base.py

+if not os.environ.get("DEFAULT_FILE_STORAGE"):
+    if conf.OPTIONS["FileStorage"]["STORAGE_BACKEND"] == "gcs":
+        DEFAULT_FILE_STORAGE = "kolibri.utils.file_storage.KolibriFileStorage"
+        BUCKET_NAME = os.getenv("GCS_BUCKET_NAME") or "kdp-csv-reporting-develop"


This is borrowed from @jredrejo but should probably be set to a different fallback bucket.

We also may want to ensure that the GCS_BUCKET_NAME var is set to instance-specific buckets?
cc @DXCanas

We should have a BUCKET NAME option in our options.py as well - we definitely should not be hard coding anything specific about any platform in here.

If GCS_BUCKET_NAME is an environment variable that is set by default on GCP (but I don't think it is) we can specify specific env vars in the options configuration to draw values from.

Rather than subclassing GoogleCloudStorage, we should just set the additional settings here for it.

nucleogenesis · 2025-01-24T23:20:31Z

kolibri/utils/file_storage.py

+# https://django-storages.readthedocs.io/en/latest/backends/gcloud.html#google-cloud-storage
+
+
+class KolibriFileStorage(GoogleCloudStorage):


I think here we can override the default values for the location if needed

Not clear to me why this file is necessary - we can just directly use the GoogleCloudStorage class and use settings (with their own options in options.py to allow them to be defined).

https://django-storages.readthedocs.io/en/latest/backends/gcloud.html#settings

nucleogenesis · 2025-01-24T23:23:01Z

kolibri/core/auth/test/test_bulk_import.py

+file_storage = DefaultStorage()
+
+
+def random_filename():


In hindsight, I think that maybe just cleaning up the files after each test could have avoided needing this.

nucleogenesis · 2025-01-24T23:24:37Z

kolibri/core/auth/test/test_bulk_import.py

@@ -150,18 +153,18 @@ def import_exported_csv(self):
            assert len(classroom.get_coaches()) == 1

    def test_dryrun_from_export_csv(self):
-        with open_csv_for_reading(self.filepath) as source:
-            header = next(csv.reader(source, strict=True))
+        source = file_storage.open(self.filepath, "r")


Looking back at this, the reassignment of source makes this a it odd to read in a way that it wasn't when they were created inside of with statements.

I'll come back through and do some cleaning

rtibbles · 2025-01-24T23:28:25Z

There seems to be a problem with how files are being handled that is leaving file handles open - this is why things are erroring on windows, because it is much more sensitive to files not being closed:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'D:\\a\\kolibri\\kolibri\\.pytest_kolibri_home\\media\\UserImportCommandTestCase.csv'

I haven't looked through the code in detail to see where this might be happening, but it suggests that something isn't being handled entirely properly.

rtibbles

The changes here seem to have completely ignored the purpose of the two helper functions open_csv_for_writing and open_csv_for_reading - I think the diff could be significantly reduced by leaning into the existence of these two functions and consolidating the logic in there - with the small tweak as you have done that we change filepath to filename, so that it is clear it is not a literal disk file path.

We can also probably conditionalize whether we write directly to the file or to memory. Writing large CSV files to memory in a non-cloud environment is probably not going to be a great idea.

The other alternative is we could just write directly to the storage object in either case unless we know absolutely that it is too slow when writing to GCS? Given it's happening in an asynchronous task, and we have to write to it eventually... I am not sure how writing to memory first necessarily helps.

rtibbles · 2025-01-24T23:37:39Z

kolibri/core/auth/csv_utils.py

@@ -174,7 +176,7 @@ def csv_file_generator(facility, filepath, overwrite=True, demographic=False):
        column for column in db_columns if demographic or column not in DEMO_FIELDS
    )

-    csv_file = open_csv_for_writing(filepath)
+    csv_file = io.StringIO()


Why are we using io.StringIO() here but io.BytesIO below?

rtibbles · 2025-01-24T23:40:27Z

kolibri/deployment/default/settings/base.py

+if not os.environ.get("DEFAULT_FILE_STORAGE"):
+    if conf.OPTIONS["FileStorage"]["STORAGE_BACKEND"] == "gcs":
+        DEFAULT_FILE_STORAGE = "kolibri.utils.file_storage.KolibriFileStorage"
+        BUCKET_NAME = os.getenv("GCS_BUCKET_NAME") or "kdp-csv-reporting-develop"


We should have a BUCKET NAME option in our options.py as well - we definitely should not be hard coding anything specific about any platform in here.

If GCS_BUCKET_NAME is an environment variable that is set by default on GCP (but I don't think it is) we can specify specific env vars in the options configuration to draw values from.

rtibbles · 2025-01-24T23:43:45Z

kolibri/utils/file_storage.py

+# https://django-storages.readthedocs.io/en/latest/backends/gcloud.html#google-cloud-storage
+
+
+class KolibriFileStorage(GoogleCloudStorage):


Not clear to me why this file is necessary - we can just directly use the GoogleCloudStorage class and use settings (with their own options in options.py to allow them to be defined).

https://django-storages.readthedocs.io/en/latest/backends/gcloud.html#settings

rtibbles · 2025-01-24T23:44:10Z

kolibri/deployment/default/settings/base.py

+if not os.environ.get("DEFAULT_FILE_STORAGE"):
+    if conf.OPTIONS["FileStorage"]["STORAGE_BACKEND"] == "gcs":
+        DEFAULT_FILE_STORAGE = "kolibri.utils.file_storage.KolibriFileStorage"
+        BUCKET_NAME = os.getenv("GCS_BUCKET_NAME") or "kdp-csv-reporting-develop"


Rather than subclassing GoogleCloudStorage, we should just set the additional settings here for it.

rtibbles · 2025-01-24T23:45:15Z

kolibri/utils/options.py

@@ -737,6 +766,7 @@ def _get_validator():
            "url_prefix": url_prefix,
            "bytes": validate_bytes,
            "multiprocess_bool": multiprocess_bool,
+            "storage_option": storage_option,


I don't see that this would affect anything to do with the cache.

rtibbles · 2025-01-24T23:46:11Z

kolibri/utils/options.py

+            "description": """
+            The storage backend class that Django will use when managing files. The class given here must implement
+            the Django files.storage.Storage class.
+            """,


Let's add addiitonal options here for the ACL, bucket name, etc - we don't need to expose all the settings that are available, but for anything we need set, we can include here.

rtibbles · 2025-01-25T00:40:34Z

@nucleogenesis and I had a quick sync up to see how to consolidate the file handling here - seems like there are some deficiencies in the Django Storage conformance to the Python file object spec https://forum.djangoproject.com/t/file-open-to-support-different-encodings/21491 - but an idea there to wrap the file object in a TextIOWrapper to ensure we generate with the correct encoding and newlines seems promising!

pcenov · 2025-01-27T16:03:19Z

Hi @nucleogenesis - on my end while regression testing the .csv generation I noticed that it's no longer possible to generate the same .csv file twice. For example if I go to Facility > Data and generate a sessions logs file for the current day, it gets generated correctly, but if I attempt to to the same thing 5 minutes later it still says "Generated 5 minutes ago."
The same is valid when generating a new user .csv file - the first time it works fine, but if I go ahead and add several new users, go back to Facility > Data and click again the "Generate new user CSV file" it will seem that it's generated correctly but when you open it you don't see the newly added users. There are no errors in the console.

Here are the logs from one of my devices though:
logs.zip

log.with.the.same.date.range.mp4

…encoding

nucleogenesis requested a review from rtibbles August 22, 2024 21:23

github-actions bot added the DEV: backend Python, databases, networking, filesystem... label Aug 22, 2024

rtibbles reviewed Aug 23, 2024

View reviewed changes

nucleogenesis commented Aug 23, 2024

View reviewed changes

nucleogenesis requested a review from jredrejo August 26, 2024 23:13

rtibbles self-assigned this Aug 27, 2024

rtibbles added this to the Kolibri 0.17: Planned Patch 1 milestone Aug 27, 2024

jredrejo reviewed Aug 27, 2024

View reviewed changes

rtibbles modified the milestones: Kolibri 0.17: Planned Patch 1, Kolibri 0.17: Planned Patch 2 Sep 16, 2024

rtibbles modified the milestones: Kolibri 0.17: Planned Patch 2, Kolibri 0.18: General maintenance Oct 22, 2024

github-actions bot added the APP: Facility Re: Facility App (user/class management, facility settings, csv import/export, etc.) label Dec 20, 2024

nucleogenesis commented Dec 20, 2024

View reviewed changes

jredrejo force-pushed the fix--dynamic-file-storage-backend-cloud-file-storage branch from 43309bc to 953b29b Compare December 26, 2024 19:48

jredrejo changed the base branch from release-v0.17.x to develop December 26, 2024 20:11

nucleogenesis marked this pull request as ready for review January 14, 2025 03:26

nucleogenesis requested a review from jredrejo January 14, 2025 03:26

nucleogenesis force-pushed the fix--dynamic-file-storage-backend-cloud-file-storage branch 2 times, most recently from 123671a to ebc6d37 Compare January 14, 2025 04:01

nucleogenesis force-pushed the fix--dynamic-file-storage-backend-cloud-file-storage branch 2 times, most recently from b63fc75 to 1184c1e Compare January 21, 2025 23:58

nucleogenesis commented Jan 24, 2025

View reviewed changes

rtibbles requested changes Jan 24, 2025

View reviewed changes

nucleogenesis force-pushed the fix--dynamic-file-storage-backend-cloud-file-storage branch from 6a18704 to f7317cb Compare February 3, 2025 23:26

nucleogenesis added 7 commits February 4, 2025 12:55

add django-storages gcs lib, filestorage options & base settings

64f57d5

update open_for_(read|write)_csv to use default_storage w/ utf-8-sig …

0bd9bb3

…encoding

update csv open/read handling & bulk im/export tests

8e846b2

encode open_csv_for_reading

f15fc93

update some csv tests

db5716e

update bulkexportusers, facility views around CSVs, passing tests

8aa9e57

fixing import/export tests

ff63d7f

nucleogenesis force-pushed the fix--dynamic-file-storage-backend-cloud-file-storage branch from 702204e to ff63d7f Compare February 4, 2025 20:57

[pre-commit.ci lite] apply automatic fixes

23d2db6

nucleogenesis requested a review from rtibbles February 4, 2025 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add file storage option #12590

Add file storage option #12590

nucleogenesis commented Aug 22, 2024 •

edited

Loading

github-actions bot commented Aug 22, 2024 •

edited

Loading

rtibbles left a comment

rtibbles Aug 23, 2024

rtibbles Aug 23, 2024

rtibbles Aug 23, 2024

nucleogenesis Aug 23, 2024

jredrejo Aug 27, 2024

nucleogenesis Aug 28, 2024

rtibbles Jan 24, 2025

nucleogenesis Dec 20, 2024

jredrejo Dec 26, 2024

nucleogenesis commented Jan 14, 2025 •

edited

Loading

nucleogenesis commented Jan 24, 2025

nucleogenesis Jan 24, 2025

rtibbles Jan 24, 2025

rtibbles Jan 24, 2025

nucleogenesis Jan 24, 2025

rtibbles Jan 24, 2025

nucleogenesis Jan 24, 2025

nucleogenesis Jan 24, 2025

rtibbles commented Jan 24, 2025

rtibbles left a comment

rtibbles Jan 24, 2025

rtibbles Jan 24, 2025

rtibbles Jan 24, 2025

rtibbles Jan 24, 2025

rtibbles Jan 24, 2025

rtibbles Jan 24, 2025

rtibbles commented Jan 25, 2025

pcenov commented Jan 27, 2025

		logger.info("File saved - Path: {}".format(file_storage.url(file)))
		logger.info("File saved - Size: {}".format(file_storage.size(file)))

		# https://django-storages.readthedocs.io/en/latest/backends/gcloud.html#google-cloud-storage


		class KolibriFileStorage(GoogleCloudStorage):

Add file storage option #12590

Are you sure you want to change the base?

Add file storage option #12590

Conversation

nucleogenesis commented Aug 22, 2024 • edited Loading

Summary

References

Reviewer guidance

Testing checklist

PR process

Reviewer checklist

github-actions bot commented Aug 22, 2024 • edited Loading

rtibbles left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nucleogenesis commented Jan 14, 2025 • edited Loading

nucleogenesis commented Jan 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rtibbles commented Jan 24, 2025

rtibbles left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rtibbles commented Jan 25, 2025

pcenov commented Jan 27, 2025

nucleogenesis commented Aug 22, 2024 •

edited

Loading

github-actions bot commented Aug 22, 2024 •

edited

Loading

nucleogenesis commented Jan 14, 2025 •

edited

Loading