Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCHEMATIC-183] Use paths from file view for manifest generation #1529

Merged
merged 101 commits into from
Dec 10, 2024
Merged
Show file tree
Hide file tree
Changes from 68 commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
8a80ce3
add test for clause method
SageGJ Oct 29, 2024
a2de0c4
add method to process dataset id into query clause
SageGJ Oct 29, 2024
45193a1
use new method for validation
SageGJ Oct 29, 2024
80b6bda
update clause method
SageGJ Oct 29, 2024
d810576
update file based manifest gen test
SageGJ Oct 29, 2024
b1e60af
consolidate filebased manifest gen tests
SageGJ Oct 29, 2024
33796b4
update test layout
SageGJ Oct 29, 2024
62d8b28
use fileview for file paths
SageGJ Oct 30, 2024
8b012d3
add functionality to just return filename
SageGJ Oct 31, 2024
8e670ad
make syn id regex a util function
SageGJ Oct 31, 2024
c90ebf0
add non-api integration test for detFilesInStorageDataset
SageGJ Oct 31, 2024
f9b9bcc
fix typo and mismatched ids
SageGJ Oct 31, 2024
be4f6b6
add case for nested data structure
SageGJ Oct 31, 2024
b39afe2
get nested files as well
SageGJ Nov 1, 2024
566e741
add return type annotation
SageGJ Nov 1, 2024
86fb15d
add docstring
SageGJ Nov 1, 2024
2fa1ad6
add str prefix
SageGJ Nov 1, 2024
cd6713b
revert param name
SageGJ Nov 4, 2024
9438811
change datasetid clause method
SageGJ Nov 4, 2024
e076770
get files in doubly+ nested files
SageGJ Nov 5, 2024
19188ea
add comments
SageGJ Nov 5, 2024
c46a100
add test cases for filtering results
SageGJ Nov 5, 2024
ccead76
add test case for filtered results
SageGJ Nov 7, 2024
26e5539
Update README.md
jaymedina Oct 21, 2024
2f835bd
Update README.md
jaymedina Oct 21, 2024
f228245
Update README.md
jaymedina Oct 22, 2024
ae11b85
updated data model type rules to include error param
andrewelamb Oct 25, 2024
2daacb9
fix validate type attribute to use msg level param
andrewelamb Oct 25, 2024
2982d8e
added error handling
andrewelamb Oct 25, 2024
a1e0783
run black
andrewelamb Oct 25, 2024
450fbdf
Update CODEOWNERS
thomasyu888 Nov 1, 2024
b16bf55
Update scan_repo.yml
thomasyu888 Nov 1, 2024
8d50e1a
Update .github/CODEOWNERS
thomasyu888 Nov 1, 2024
1336fc6
Update .github/workflows/scan_repo.yml
thomasyu888 Nov 1, 2024
c61f39c
Attach additional telemetry data to OTEL traces (#1519)
BryanFauble Nov 1, 2024
ce4d642
feat: added tracing for cross manifest validation and file name valid…
linglp Nov 1, 2024
31f3f1d
Updating contribution doc to expect squash and merge (#1534)
BryanFauble Nov 5, 2024
256403c
[FDS-2491] Integration tests for Schematic API Test plan (#1512)
BryanFauble Nov 5, 2024
856fef6
[FDS-2500] Add Integration Tests for: Manifest Validation (#1516)
jaymedina Nov 6, 2024
d6fc9ad
[FDS-2449] Lock `sphinx` version and update `poetry.lock` (#1530)
jaymedina Nov 7, 2024
08008ae
filter based on filenames if given
SageGJ Nov 8, 2024
d0aa01d
change manifest exclusion method
SageGJ Nov 8, 2024
38cedd5
Update file annotation store process to require filename be present i…
BryanFauble Nov 7, 2024
22f0bba
Revert "Update file annotation store process to require filename be p…
BryanFauble Nov 7, 2024
4580e06
Don't attempt to annotate the table
BryanFauble Nov 7, 2024
ce5c349
Updates for integration test failures (#1537)
BryanFauble Nov 12, 2024
d661b9a
add test for bug case
SageGJ Nov 11, 2024
951a061
update test for table tidyness
SageGJ Nov 11, 2024
89fb9a8
remove unused import
SageGJ Nov 11, 2024
0c9e773
remove etag column if already present when building temp file view
SageGJ Nov 11, 2024
ab4ece7
catch all exceptions to switch to sequential mode
SageGJ Nov 12, 2024
65acb33
update test for updated data
SageGJ Nov 12, 2024
2e6d51f
Revert "update test for updated data"
SageGJ Nov 12, 2024
9a00288
Revert "catch all exceptions to switch to sequential mode"
SageGJ Nov 12, 2024
2170974
catch ValueErrors as well
SageGJ Nov 12, 2024
65fb55d
[FDS-2525] Authenticated export of telemetry data (#1527)
BryanFauble Nov 13, 2024
5e891ef
update mocking for unit tests
SageGJ Nov 13, 2024
4d6fd09
Merge branch 'develop' into fds-2293-file-paths-for-manifest-gen
thomasyu888 Nov 14, 2024
5f5cc43
update test assertions for format
SageGJ Nov 14, 2024
3f90ef0
update tests assertions
SageGJ Nov 14, 2024
f4ad7a1
add mocked integration test for getting dataset files
SageGJ Nov 15, 2024
ee43ad6
use dataset clause method
SageGJ Nov 15, 2024
5a4cd90
Revert "add mocked integration test for getting dataset files"
SageGJ Nov 15, 2024
59c5a69
add mocked test for get files in dataset
SageGJ Nov 15, 2024
d2eee35
clean comments
SageGJ Nov 15, 2024
0eddd47
remove comment
SageGJ Nov 15, 2024
5fcbeb1
add test ids
SageGJ Nov 15, 2024
a1b0c90
remove unneeded param
SageGJ Nov 15, 2024
c43debc
add ids
SageGJ Nov 15, 2024
89a2b9f
use syn store fixture
SageGJ Nov 15, 2024
9cdad89
change variables
SageGJ Nov 15, 2024
51983bc
change to global var
SageGJ Nov 18, 2024
84011b0
change case
SageGJ Nov 18, 2024
3471e18
change case
SageGJ Nov 18, 2024
82082be
update test for dataset clause
SageGJ Nov 19, 2024
c542241
update use of dataset clause method
SageGJ Nov 19, 2024
8a314a7
comments
SageGJ Nov 19, 2024
bc9f479
change test name case
SageGJ Nov 19, 2024
79f5bda
update descriptions
SageGJ Nov 19, 2024
5d0599b
undo development change
SageGJ Nov 20, 2024
c1cd79a
add comment
SageGJ Nov 20, 2024
feb82e8
remove dev work
SageGJ Nov 20, 2024
c1587c6
remove temp test marks
SageGJ Nov 20, 2024
2769d6a
update test for new expected order
SageGJ Nov 20, 2024
dd9d6a9
change method for gathering files in a dataset
SageGJ Nov 20, 2024
800a4cd
update mock test
SageGJ Nov 20, 2024
95cfeed
update other mocked test
SageGJ Nov 20, 2024
9701078
change method for building dataset path
SageGJ Nov 22, 2024
7cd0ce0
wrap path in quotes
SageGJ Nov 22, 2024
65dcc35
update quotes for dataset path
GiaJordan Nov 22, 2024
f9d037d
reformat and add exception
SageGJ Nov 22, 2024
e07f973
add unit test
SageGJ Nov 22, 2024
73227f8
Merge branch 'develop' into fds-2293-file-paths-for-manifest-gen
thomasyu888 Nov 25, 2024
0a00086
fix comment typo
SageGJ Dec 2, 2024
0cca133
raise exception for empty view tables
SageGJ Dec 2, 2024
3261dcc
[SCHEMATIC-183] Update tests - Use magic mock and add parentId (#1554)
thomasyu888 Dec 3, 2024
f0968a2
remove hack related comment
SageGJ Dec 3, 2024
8229e72
add test for new exception
SageGJ Dec 3, 2024
7b0987d
add param back in
SageGJ Dec 3, 2024
9d2ea4a
add integration test
SageGJ Dec 10, 2024
f52bbb9
update var name
SageGJ Dec 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions schematic/manifest/generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -1904,6 +1904,8 @@ def get_manifest(
# TODO: avoid explicitly exposing Synapse store functionality
# just instantiate a Store class and let it decide at runtime/config
# the store type
# TODO: determine which parts of fileview are necessary for `get` operations
# and pass query parameters at object instantiation to avoid having to re-query
if access_token:
# for getting an existing manifest on AWS
store = SynapseStorage(access_token=access_token)
Expand Down
4 changes: 3 additions & 1 deletion schematic/models/validate_attribute.py
Original file line number Diff line number Diff line change
Expand Up @@ -2119,7 +2119,9 @@ def filename_validation(

where_clauses = []

dataset_clause = SynapseStorage.build_clause_from_dataset_id(dataset_scope)
dataset_clause = SynapseStorage.build_clause_from_dataset_id(
dataset_id=dataset_scope
)
where_clauses.append(dataset_clause)

self._login(
Expand Down
50 changes: 35 additions & 15 deletions schematic/store/synapse.py
Original file line number Diff line number Diff line change
Expand Up @@ -417,15 +417,28 @@ def query_fileview(
raise AccessCredentialsError(self.storageFileview)

@staticmethod
def build_clause_from_dataset_id(dataset_id: Optional[str]) -> str:
def build_clause_from_dataset_id(
dataset_id: Optional[str] = None, dataset_folder_list: Optional[list] = None
) -> str:
"""
Method to build a where clause for a Synapse FileView query based on a dataset ID that can be used before an object is initialized.
Args:
dataset_id: Synapse ID of a dataset that should be used to limit the query
dataset_folder_list: List of Synapse IDs of a dataset and all its subfolders that should be used to limit the query
Returns:
clause for the query or an empty string if no dataset ID is provided
"""
return f"parentId='{dataset_id}'" if dataset_id else ""
# Calling this method without specifying synIDs will complete but will not scope the view
if (not dataset_id) and (not dataset_folder_list):
return ""

# This will be used to gather files under a dataset recursively with a fileview query instead of walking
if dataset_folder_list:
search_folders = ", ".join(f"'{synId}'" for synId in dataset_folder_list)
return f"parentId IN ({search_folders})"

# `dataset_id` should be provided when all files are stored directly under the dataset folder
return f"parentId='{dataset_id}'"

def _build_query(
self, columns: Optional[list] = None, where_clauses: Optional[list] = None
Expand Down Expand Up @@ -677,7 +690,7 @@ def getStorageDatasetsInProject(self, projectId: str) -> list[tuple[str, str]]:
def getFilesInStorageDataset(
self, datasetId: str, fileNames: List = None, fullpath: bool = True
) -> List[Tuple[str, str]]:
"""Gets all files in a given dataset folder.
"""Gets all files (excluding manifest files) in a given dataset folder.

Args:
datasetId: synapse ID of a storage dataset.
Expand All @@ -693,28 +706,35 @@ def getFilesInStorageDataset(
"""
file_list = []

# Identify all folders nested under the dataset folder
folders = synapseutils.walk(self.syn, datasetId, includeTypes=["folder"])

# Getting the files directly under the dataset will be the beginning of the query
dataset_clause = f"parentId='{datasetId}' "

# The query will also be ammended to include everything containted in all the subdirectories of the dataset
for subfolder, _, path in folders:
dataset_clause += f"OR parentId='{subfolder[1]}' "
dataset_clause = f"({dataset_clause})"
# Get path to dataset folder from fileview to avoid building a new fileview and walking to determine folders and files within
child_path = self.storageFileviewTable.loc[
self.storageFileviewTable["parentId"] == datasetId, "path"
][0]
parent = child_path.split("/")[0]
dataset_path = f"'{parent}/%'"

# When querying, only include files to exclude entity files and subdirectories
where_clauses = [dataset_clause, "type='file'"]
where_clauses = [f"path like {dataset_path}", "type='file'"]

# Requery the fileview to specifically get the files in the given dataset
self.query_fileview(columns=["id", "path"], where_clauses=where_clauses)

# Exclude manifest files
non_manifest_files = self.storageFileviewTable.loc[
~self.storageFileviewTable["path"].str.contains("manifest"), :
~self.storageFileviewTable["path"].str.contains("synapse_storage_manifest"),
:,
]

# Remove all files that are not in the list of fileNames
if fileNames:
filename_regex = "|".join(fileNames)

matching_files = non_manifest_files["path"].str.contains(
filename_regex, case=False, regex=True
)

non_manifest_files = non_manifest_files.loc[matching_files, :]

# Truncate path if necessary
if not fullpath:
non_manifest_files.path = non_manifest_files.path.apply(os.path.basename)
Expand Down
4 changes: 2 additions & 2 deletions schematic/utils/cli_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from functools import reduce
from typing import Any, Mapping, Optional, Sequence, Union

from schematic.utils.general import syn_id_regex
from schematic.utils.general import SYN_ID_REGEX

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -70,7 +70,7 @@ def parse_syn_ids(
if not syn_ids:
return None

project_regex = re.compile(syn_id_regex())
project_regex = re.compile(SYN_ID_REGEX)
valid = project_regex.fullmatch(syn_ids)

if not valid:
Expand Down
13 changes: 2 additions & 11 deletions schematic/utils/general.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@

T = TypeVar("T")

SYN_ID_REGEX = r"(syn\d+\,?)+"


def find_duplicates(_list: list[T]) -> set[T]:
"""Find duplicate items in a list"""
Expand Down Expand Up @@ -332,14 +334,3 @@ def normalize_path(path: str, parent_folder: str) -> str:
if not os.path.isabs(path):
path = os.path.join(parent_folder, path)
return os.path.normpath(path)


def syn_id_regex() -> str:
"""
Regex pattern to match synapse ids
Args:
None
Returns:
str: synID regex pattern
"""
return r"(syn\d+\,?)+"
4 changes: 2 additions & 2 deletions schematic_api/api/openapi/api.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -692,8 +692,8 @@ paths:
- Synapse Storage
/storage/dataset/files:
get:
summary: Get all files in a given dataset folder
description: Get all files in a given dataset folder
summary: Get all files (excluding manifest files) in a given dataset folder
description: Get all files (excluding manifest files) in a given dataset folder
operationId: schematic_api.api.routes.get_files_storage_dataset
security:
- access_token: []
Expand Down
24 changes: 12 additions & 12 deletions tests/integration/test_commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@
import uuid
from io import BytesIO

import numpy as np
import pandas as pd
import pytest
import requests
from openpyxl import load_workbook
from click.testing import CliRunner
import pandas as pd
import numpy as np
from openpyxl import load_workbook

from schematic.configuration.configuration import Configuration, CONFIG
from schematic.configuration.configuration import CONFIG, Configuration
from schematic.manifest.commands import manifest
from schematic.models.commands import model
from tests.conftest import ConfigurationForTesting
Expand Down Expand Up @@ -665,18 +665,18 @@ def test_generate_bulk_rna_google_sheet_manifest(
# Reset config to it's default values
CONFIG.load_config("config_example.yml")

assert result.output.split("\n")[7] == (
assert result.output.split("\n")[8] == (
"Find the manifest template using this Google Sheet URL:"
)
assert result.output.split("\n")[8].startswith(
assert result.output.split("\n")[9].startswith(
"https://docs.google.com/spreadsheets/d/"
)
assert result.output.split("\n")[9] == (
assert result.output.split("\n")[10] == (
"Find the manifest template using this CSV file path: "
"./CLI_gs_bulk_rna.csv"
)

google_sheet_url = result.output.split("\n")[8]
google_sheet_url = result.output.split("\n")[9]

# Download the Google Sheets content as an Excel file and load into openpyxl
export_url = f"{google_sheet_url}/export?format=xlsx"
Expand Down Expand Up @@ -908,18 +908,18 @@ def test_generate_bulk_rna_google_sheet_manifest_with_annotations(
os.remove("tests/data/example.BulkRNA-seqAssay.schema.json")
os.remove("./CLI_gs_bulk_rna_annos.csv")

assert result.output.split("\n")[10] == (
assert result.output.split("\n")[11] == (
"Find the manifest template using this Google Sheet URL:"
)
assert result.output.split("\n")[11].startswith(
assert result.output.split("\n")[12].startswith(
"https://docs.google.com/spreadsheets/d/"
)
assert result.output.split("\n")[12] == (
assert result.output.split("\n")[13] == (
"Find the manifest template using this CSV file path: "
"./CLI_gs_bulk_rna_annos.csv"
)

google_sheet_url = result.output.split("\n")[11]
google_sheet_url = result.output.split("\n")[12]

# Download the Google Sheets content as an Excel file and load into openpyxl
export_url = f"{google_sheet_url}/export?format=xlsx"
Expand Down
Loading
Loading