Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCHEMATIC-183] Use paths from file view for manifest generation #1529

Merged
merged 101 commits into from
Dec 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
8a80ce3
add test for clause method
SageGJ Oct 29, 2024
a2de0c4
add method to process dataset id into query clause
SageGJ Oct 29, 2024
45193a1
use new method for validation
SageGJ Oct 29, 2024
80b6bda
update clause method
SageGJ Oct 29, 2024
d810576
update file based manifest gen test
SageGJ Oct 29, 2024
b1e60af
consolidate filebased manifest gen tests
SageGJ Oct 29, 2024
33796b4
update test layout
SageGJ Oct 29, 2024
62d8b28
use fileview for file paths
SageGJ Oct 30, 2024
8b012d3
add functionality to just return filename
SageGJ Oct 31, 2024
8e670ad
make syn id regex a util function
SageGJ Oct 31, 2024
c90ebf0
add non-api integration test for detFilesInStorageDataset
SageGJ Oct 31, 2024
f9b9bcc
fix typo and mismatched ids
SageGJ Oct 31, 2024
be4f6b6
add case for nested data structure
SageGJ Oct 31, 2024
b39afe2
get nested files as well
SageGJ Nov 1, 2024
566e741
add return type annotation
SageGJ Nov 1, 2024
86fb15d
add docstring
SageGJ Nov 1, 2024
2fa1ad6
add str prefix
SageGJ Nov 1, 2024
cd6713b
revert param name
SageGJ Nov 4, 2024
9438811
change datasetid clause method
SageGJ Nov 4, 2024
e076770
get files in doubly+ nested files
SageGJ Nov 5, 2024
19188ea
add comments
SageGJ Nov 5, 2024
c46a100
add test cases for filtering results
SageGJ Nov 5, 2024
ccead76
add test case for filtered results
SageGJ Nov 7, 2024
26e5539
Update README.md
jaymedina Oct 21, 2024
2f835bd
Update README.md
jaymedina Oct 21, 2024
f228245
Update README.md
jaymedina Oct 22, 2024
ae11b85
updated data model type rules to include error param
andrewelamb Oct 25, 2024
2daacb9
fix validate type attribute to use msg level param
andrewelamb Oct 25, 2024
2982d8e
added error handling
andrewelamb Oct 25, 2024
a1e0783
run black
andrewelamb Oct 25, 2024
450fbdf
Update CODEOWNERS
thomasyu888 Nov 1, 2024
b16bf55
Update scan_repo.yml
thomasyu888 Nov 1, 2024
8d50e1a
Update .github/CODEOWNERS
thomasyu888 Nov 1, 2024
1336fc6
Update .github/workflows/scan_repo.yml
thomasyu888 Nov 1, 2024
c61f39c
Attach additional telemetry data to OTEL traces (#1519)
BryanFauble Nov 1, 2024
ce4d642
feat: added tracing for cross manifest validation and file name valid…
linglp Nov 1, 2024
31f3f1d
Updating contribution doc to expect squash and merge (#1534)
BryanFauble Nov 5, 2024
256403c
[FDS-2491] Integration tests for Schematic API Test plan (#1512)
BryanFauble Nov 5, 2024
856fef6
[FDS-2500] Add Integration Tests for: Manifest Validation (#1516)
jaymedina Nov 6, 2024
d6fc9ad
[FDS-2449] Lock `sphinx` version and update `poetry.lock` (#1530)
jaymedina Nov 7, 2024
08008ae
filter based on filenames if given
SageGJ Nov 8, 2024
d0aa01d
change manifest exclusion method
SageGJ Nov 8, 2024
38cedd5
Update file annotation store process to require filename be present i…
BryanFauble Nov 7, 2024
22f0bba
Revert "Update file annotation store process to require filename be p…
BryanFauble Nov 7, 2024
4580e06
Don't attempt to annotate the table
BryanFauble Nov 7, 2024
ce5c349
Updates for integration test failures (#1537)
BryanFauble Nov 12, 2024
d661b9a
add test for bug case
SageGJ Nov 11, 2024
951a061
update test for table tidyness
SageGJ Nov 11, 2024
89fb9a8
remove unused import
SageGJ Nov 11, 2024
0c9e773
remove etag column if already present when building temp file view
SageGJ Nov 11, 2024
ab4ece7
catch all exceptions to switch to sequential mode
SageGJ Nov 12, 2024
65acb33
update test for updated data
SageGJ Nov 12, 2024
2e6d51f
Revert "update test for updated data"
SageGJ Nov 12, 2024
9a00288
Revert "catch all exceptions to switch to sequential mode"
SageGJ Nov 12, 2024
2170974
catch ValueErrors as well
SageGJ Nov 12, 2024
65fb55d
[FDS-2525] Authenticated export of telemetry data (#1527)
BryanFauble Nov 13, 2024
5e891ef
update mocking for unit tests
SageGJ Nov 13, 2024
4d6fd09
Merge branch 'develop' into fds-2293-file-paths-for-manifest-gen
thomasyu888 Nov 14, 2024
5f5cc43
update test assertions for format
SageGJ Nov 14, 2024
3f90ef0
update tests assertions
SageGJ Nov 14, 2024
f4ad7a1
add mocked integration test for getting dataset files
SageGJ Nov 15, 2024
ee43ad6
use dataset clause method
SageGJ Nov 15, 2024
5a4cd90
Revert "add mocked integration test for getting dataset files"
SageGJ Nov 15, 2024
59c5a69
add mocked test for get files in dataset
SageGJ Nov 15, 2024
d2eee35
clean comments
SageGJ Nov 15, 2024
0eddd47
remove comment
SageGJ Nov 15, 2024
5fcbeb1
add test ids
SageGJ Nov 15, 2024
a1b0c90
remove unneeded param
SageGJ Nov 15, 2024
c43debc
add ids
SageGJ Nov 15, 2024
89a2b9f
use syn store fixture
SageGJ Nov 15, 2024
9cdad89
change variables
SageGJ Nov 15, 2024
51983bc
change to global var
SageGJ Nov 18, 2024
84011b0
change case
SageGJ Nov 18, 2024
3471e18
change case
SageGJ Nov 18, 2024
82082be
update test for dataset clause
SageGJ Nov 19, 2024
c542241
update use of dataset clause method
SageGJ Nov 19, 2024
8a314a7
comments
SageGJ Nov 19, 2024
bc9f479
change test name case
SageGJ Nov 19, 2024
79f5bda
update descriptions
SageGJ Nov 19, 2024
5d0599b
undo development change
SageGJ Nov 20, 2024
c1cd79a
add comment
SageGJ Nov 20, 2024
feb82e8
remove dev work
SageGJ Nov 20, 2024
c1587c6
remove temp test marks
SageGJ Nov 20, 2024
2769d6a
update test for new expected order
SageGJ Nov 20, 2024
dd9d6a9
change method for gathering files in a dataset
SageGJ Nov 20, 2024
800a4cd
update mock test
SageGJ Nov 20, 2024
95cfeed
update other mocked test
SageGJ Nov 20, 2024
9701078
change method for building dataset path
SageGJ Nov 22, 2024
7cd0ce0
wrap path in quotes
SageGJ Nov 22, 2024
65dcc35
update quotes for dataset path
GiaJordan Nov 22, 2024
f9d037d
reformat and add exception
SageGJ Nov 22, 2024
e07f973
add unit test
SageGJ Nov 22, 2024
73227f8
Merge branch 'develop' into fds-2293-file-paths-for-manifest-gen
thomasyu888 Nov 25, 2024
0a00086
fix comment typo
SageGJ Dec 2, 2024
0cca133
raise exception for empty view tables
SageGJ Dec 2, 2024
3261dcc
[SCHEMATIC-183] Update tests - Use magic mock and add parentId (#1554)
thomasyu888 Dec 3, 2024
f0968a2
remove hack related comment
SageGJ Dec 3, 2024
8229e72
add test for new exception
SageGJ Dec 3, 2024
7b0987d
add param back in
SageGJ Dec 3, 2024
9d2ea4a
add integration test
SageGJ Dec 10, 2024
f52bbb9
update var name
SageGJ Dec 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions schematic/manifest/generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -1904,6 +1904,8 @@ def get_manifest(
# TODO: avoid explicitly exposing Synapse store functionality
# just instantiate a Store class and let it decide at runtime/config
# the store type
# TODO: determine which parts of fileview are necessary for `get` operations
# and pass query parameters at object instantiation to avoid having to re-query
if access_token:
# for getting an existing manifest on AWS
store = SynapseStorage(access_token=access_token)
Expand Down
4 changes: 3 additions & 1 deletion schematic/models/validate_attribute.py
Original file line number Diff line number Diff line change
Expand Up @@ -2119,7 +2119,9 @@ def filename_validation(

where_clauses = []

dataset_clause = f"parentId='{dataset_scope}'"
dataset_clause = SynapseStorage.build_clause_from_dataset_id(
dataset_id=dataset_scope
)
where_clauses.append(dataset_clause)

self._login(
Expand Down
165 changes: 70 additions & 95 deletions schematic/store/synapse.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,10 @@
import numpy as np
import pandas as pd
import synapseclient
import synapseutils
from opentelemetry import trace
from synapseclient import Annotations as OldAnnotations
from synapseclient import (
Column,
Entity,
EntityViewSchema,
EntityViewType,
File,
Expand Down Expand Up @@ -416,6 +414,30 @@ def query_fileview(
else:
raise AccessCredentialsError(self.storageFileview)

@staticmethod
def build_clause_from_dataset_id(
dataset_id: Optional[str] = None, dataset_folder_list: Optional[list] = None
) -> str:
"""
Method to build a where clause for a Synapse FileView query based on a dataset ID that can be used before an object is initialized.
Args:
dataset_id: Synapse ID of a dataset that should be used to limit the query
dataset_folder_list: List of Synapse IDs of a dataset and all its subfolders that should be used to limit the query
Returns:
clause for the query or an empty string if no dataset ID is provided
"""
# Calling this method without specifying synIDs will complete but will not scope the view
if (not dataset_id) and (not dataset_folder_list):
return ""

# This will be used to gather files under a dataset recursively with a fileview query instead of walking
if dataset_folder_list:
search_folders = ", ".join(f"'{synId}'" for synId in dataset_folder_list)
return f"parentId IN ({search_folders})"

# `dataset_id` should be provided when all files are stored directly under the dataset folder
return f"parentId='{dataset_id}'"

def _build_query(
self, columns: Optional[list] = None, where_clauses: Optional[list] = None
):
Expand Down Expand Up @@ -666,7 +688,7 @@ def getStorageDatasetsInProject(self, projectId: str) -> list[tuple[str, str]]:
def getFilesInStorageDataset(
self, datasetId: str, fileNames: List = None, fullpath: bool = True
) -> List[Tuple[str, str]]:
"""Gets all files in a given dataset folder.
"""Gets all files (excluding manifest files) in a given dataset folder.

Args:
datasetId: synapse ID of a storage dataset.
Expand All @@ -680,105 +702,58 @@ def getFilesInStorageDataset(
Raises:
ValueError: Dataset ID not found.
"""
# select all files within a given storage dataset folder (top level folder in
# a Synapse storage project or folder marked with contentType = 'dataset')
walked_path = synapseutils.walk(
self.syn, datasetId, includeTypes=["folder", "file"]
)

current_entity_location = self.synapse_entity_tracker.get(
synapse_id=datasetId, syn=self.syn, download_file=False
)

def walk_back_to_project(
current_location: Entity, location_prefix: str, skip_entry: bool
) -> str:
"""
Recursively walk back up the project structure to get the paths of the
names of each of the directories where we started the walk function.

Args:
current_location (Entity): The current entity location in the project structure.
location_prefix (str): The prefix to prepend to the path.
skip_entry (bool): Whether to skip the current entry in the path. When
this is True it means we are looking at our starting point. If our
starting point is the project itself we can go ahead and return
back the project as the prefix.

Returns:
str: The path of the names of each of the directories up to the project root.
"""
if (
skip_entry
and "concreteType" in current_location
and current_location["concreteType"] == PROJECT_ENTITY
):
return f"{current_location.name}/{location_prefix}"
file_list = []

updated_prefix = (
location_prefix
if skip_entry
else f"{current_location.name}/{location_prefix}"
)
if (
"concreteType" in current_location
and current_location["concreteType"] == PROJECT_ENTITY
):
return updated_prefix
current_location = self.synapse_entity_tracker.get(
synapse_id=current_location["parentId"],
syn=self.syn,
download_file=False,
# Get path to dataset folder by using childern to avoid cases where the dataset is the scope of the view
if self.storageFileviewTable.empty:
raise ValueError(
f"Fileview {self.storageFileview} is empty, please check the table and the provided synID and try again."
)
return walk_back_to_project(
current_location=current_location,
location_prefix=updated_prefix,
skip_entry=False,

child_path = self.storageFileviewTable.loc[
self.storageFileviewTable["parentId"] == datasetId, "path"
]
if child_path.empty:
raise LookupError(
f"Dataset {datasetId} could not be found in fileview {self.storageFileview}."
)
child_path = child_path.iloc[0]

prefix = walk_back_to_project(
current_location=current_entity_location,
location_prefix="",
skip_entry=True,
)
# Get the dataset path by eliminating the child's portion of the path to account for nested datasets
parent = child_path.split("/")[:-1]
parent = "/".join(parent)

project_id = self.getDatasetProject(datasetId)
project = self.synapse_entity_tracker.get(
synapse_id=project_id, syn=self.syn, download_file=False
)
project_name = project.name
file_list = []
# Format dataset path to be used in table query
dataset_path = f"'{parent}/%'"

# iterate over all results
for dirpath, _, path_filenames in walked_path:
# iterate over all files in a folder
for path_filename in path_filenames:
if ("manifest" not in path_filename[0] and not fileNames) or (
fileNames and path_filename[0] in fileNames
):
# don't add manifest to list of files unless it is specified in the
# list of specified fileNames; return all found files
# except the manifest if no fileNames have been specified
# TODO: refactor for clarity/maintainability

if fullpath:
# append directory path to filename
if dirpath[0].startswith(f"{project_name}/"):
path_without_project_prefix = (
dirpath[0] + "/"
).removeprefix(f"{project_name}/")
path_filename = (
prefix + path_without_project_prefix + path_filename[0],
path_filename[1],
)
else:
path_filename = (
prefix + dirpath[0] + "/" + path_filename[0],
path_filename[1],
)
# When querying, only include files to exclude entity files and subdirectories
where_clauses = [f"path like {dataset_path}", "type='file'"]

# Requery the fileview to specifically get the files in the given dataset
self.query_fileview(columns=["id", "path"], where_clauses=where_clauses)

# Exclude manifest files
non_manifest_files = self.storageFileviewTable.loc[
~self.storageFileviewTable["path"].str.contains("synapse_storage_manifest"),
:,
]

# Remove all files that are not in the list of fileNames
if fileNames:
filename_regex = "|".join(fileNames)

matching_files = non_manifest_files["path"].str.contains(
filename_regex, case=False, regex=True
)

non_manifest_files = non_manifest_files.loc[matching_files, :]

# Truncate path if necessary
if not fullpath:
non_manifest_files.path = non_manifest_files.path.apply(os.path.basename)

# add file name file id tuple, rearranged so that id is first and name follows
file_list.append(path_filename[::-1])
# Return list of files as expected by other methods
file_list = list(non_manifest_files.itertuples(index=False, name=None))

return file_list

Expand Down
9 changes: 5 additions & 4 deletions schematic/utils/cli_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,11 @@
# pylint: disable=anomalous-backslash-in-string

import logging

from typing import Any, Mapping, Sequence, Union, Optional
from functools import reduce
import re
from functools import reduce
from typing import Any, Mapping, Optional, Sequence, Union

from schematic.utils.general import SYN_ID_REGEX

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -69,7 +70,7 @@ def parse_syn_ids(
if not syn_ids:
return None

project_regex = re.compile("(syn\d+\,?)+")
project_regex = re.compile(SYN_ID_REGEX)
valid = project_regex.fullmatch(syn_ids)

if not valid:
Expand Down
2 changes: 2 additions & 0 deletions schematic/utils/general.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@

T = TypeVar("T")

SYN_ID_REGEX = r"(syn\d+\,?)+"


def find_duplicates(_list: list[T]) -> set[T]:
"""Find duplicate items in a list"""
Expand Down
4 changes: 2 additions & 2 deletions schematic_api/api/openapi/api.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -692,8 +692,8 @@ paths:
- Synapse Storage
/storage/dataset/files:
get:
summary: Get all files in a given dataset folder
description: Get all files in a given dataset folder
summary: Get all files (excluding manifest files) in a given dataset folder
description: Get all files (excluding manifest files) in a given dataset folder
operationId: schematic_api.api.routes.get_files_storage_dataset
security:
- access_token: []
Expand Down
68 changes: 36 additions & 32 deletions tests/integration/test_commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@
import uuid
from io import BytesIO

import numpy as np
import pandas as pd
import pytest
import requests
from openpyxl import load_workbook
from click.testing import CliRunner
import pandas as pd
import numpy as np
from openpyxl import load_workbook

from schematic.configuration.configuration import Configuration, CONFIG
from schematic.configuration.configuration import CONFIG, Configuration
from schematic.manifest.commands import manifest
from schematic.models.commands import model
from tests.conftest import ConfigurationForTesting
Expand Down Expand Up @@ -95,14 +95,14 @@ def test_validate_valid_manifest(self, runner: CliRunner) -> None:
# command has no (python) errors, has exit code 0
assert result.exit_code == 0
# command output has success message
assert result.output.split("\n")[4] == (
result_list = result.output.split("\n")
assert (
"Your manifest has been validated successfully. "
"There are no errors in your manifest, "
"and it can be submitted without any modifications."
)
) in result_list
# command output has no validation errors
for line in result.output.split("\n")[4]:
assert not line.startswith("error")
errors = [errors for result in result_list if result.startswith("error")]

def test_validate_invalid_manifest(self, runner: CliRunner) -> None:
"""
Expand Down Expand Up @@ -504,9 +504,10 @@ def test_generate_empty_excel_manifest(
os.remove("tests/data/example.Biospecimen.schema.json")

# command output has excel file creation message
result_list = result.output.split("\n")
assert (
result.output.split("\n")[7]
== "Find the manifest template using this Excel file path: ./CLI_empty_excel.xlsx"
"Find the manifest template using this Excel file path: ./CLI_empty_excel.xlsx"
in result_list
)

sheet1 = workbook["Sheet1"]
Expand Down Expand Up @@ -665,18 +666,19 @@ def test_generate_bulk_rna_google_sheet_manifest(
# Reset config to it's default values
CONFIG.load_config("config_example.yml")

assert result.output.split("\n")[7] == (
"Find the manifest template using this Google Sheet URL:"
)
assert result.output.split("\n")[8].startswith(
"https://docs.google.com/spreadsheets/d/"
)
assert result.output.split("\n")[9] == (
result_list = result.output.split("\n")
assert "Find the manifest template using this Google Sheet URL:" in result_list
assert (
"Find the manifest template using this CSV file path: "
"./CLI_gs_bulk_rna.csv"
)

google_sheet_url = result.output.split("\n")[8]
) in result_list
google_sheet_result = [
result
for result in result_list
if result.startswith("https://docs.google.com/spreadsheets/d/")
]
assert len(google_sheet_result) == 1
google_sheet_url = google_sheet_result[0]

# Download the Google Sheets content as an Excel file and load into openpyxl
export_url = f"{google_sheet_url}/export?format=xlsx"
Expand Down Expand Up @@ -908,18 +910,19 @@ def test_generate_bulk_rna_google_sheet_manifest_with_annotations(
os.remove("tests/data/example.BulkRNA-seqAssay.schema.json")
os.remove("./CLI_gs_bulk_rna_annos.csv")

assert result.output.split("\n")[10] == (
"Find the manifest template using this Google Sheet URL:"
)
assert result.output.split("\n")[11].startswith(
"https://docs.google.com/spreadsheets/d/"
)
assert result.output.split("\n")[12] == (
result_list = result.output.split("\n")
assert "Find the manifest template using this Google Sheet URL:" in result_list
assert (
"Find the manifest template using this CSV file path: "
"./CLI_gs_bulk_rna_annos.csv"
)

google_sheet_url = result.output.split("\n")[11]
) in result_list
google_sheet_result = [
result
for result in result_list
if result.startswith("https://docs.google.com/spreadsheets/d/")
]
assert len(google_sheet_result) == 1
google_sheet_url = google_sheet_result[0]

# Download the Google Sheets content as an Excel file and load into openpyxl
export_url = f"{google_sheet_url}/export?format=xlsx"
Expand Down Expand Up @@ -1177,10 +1180,11 @@ def test_generate_mock_component_excel_manifest(self, runner: CliRunner) -> None
# TODO: remove with https://sagebionetworks.jira.com/browse/SCHEMATIC-202
# Reset config to it's default values
CONFIG.load_config("config_example.yml")

# Command output has excel file message
assert result.output.split("\n")[8] == (
result_list = result.output.split("\n")
assert (
"Find the manifest template using this Excel file path: ./CLI_mock_comp.xlsx"
in result_list
)

sheet1 = workbook["Sheet1"]
Expand Down
Loading
Loading