Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SemanticModel Node Type #7769

Merged
merged 15 commits into from
Jun 8, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions core/dbt/contracts/files.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,7 @@ class SchemaSourceFile(BaseSourceFile):
groups: List[str] = field(default_factory=list)
# node patches contain models, seeds, snapshots, analyses
ndp: List[str] = field(default_factory=list)
semantic_models: List[str] = field(default_factory=list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semantic_nodes? (see other comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gshank To clarify, do you think we should change the name in the yml file, or just for our internal property names?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for for the internal property names. I think "semantic_models" in the yaml files is fine, it matches the "models" section we already have. Naming is so hard.

# any macro patches in this file by macro unique_id.
mcp: Dict[str, str] = field(default_factory=dict)
# any source patches in this file. The entries are package, name pairs
Expand Down
26 changes: 18 additions & 8 deletions core/dbt/contracts/graph/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,20 +25,21 @@
from dbt.contracts.publication import ProjectDependencies, PublicationConfig, PublicModel

from dbt.contracts.graph.nodes import (
Macro,
BaseNode,
Documentation,
SourceDefinition,
GenericTestNode,
Exposure,
Metric,
GenericTestNode,
GraphMemberNode,
Group,
UnpatchedSourceDefinition,
Macro,
ManifestNode,
GraphMemberNode,
ResultNode,
BaseNode,
ManifestOrPublicNode,
Metric,
ModelNode,
ResultNode,
SemanticModel,
SourceDefinition,
UnpatchedSourceDefinition,
)
from dbt.contracts.graph.unparsed import SourcePatch, NodeVersion, UnparsedVersion
from dbt.contracts.graph.manifest_upgrade import upgrade_manifest_json
Expand Down Expand Up @@ -689,6 +690,7 @@ class Manifest(MacroMethods, DataClassMessagePackMixin, dbtClassMixin):
public_nodes: MutableMapping[str, PublicModel] = field(default_factory=dict)
project_dependencies: Optional[ProjectDependencies] = None
publications: MutableMapping[str, PublicationConfig] = field(default_factory=dict)
semantic_models: MutableMapping[str, SemanticModel] = field(default_factory=dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing pattern is the node objects use "Model" and the dictionaries use "nodes", that is Model nodes are found in "nodes" dictionary, "PublicModel" nodes are found in public_nodes. Could this be the "semantic_nodes" to preserve that pattern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, do you think we just rely on the existing nodes dictionary, and not have a special case for SemanticModel objects? Would it matter that the SemanticModels are not yet linked into the compiled graph?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original "metrics" went in a "metrics" dictionary... This has been a bit ad-hoc and not closely thought through, but it feels like the existing pattern is that SQL things go in nodes (plus seeds...) and yaml-generated things go in their own dictionaries. There are some assumptions in the rest of the code that match, such as the links in the file objects, etc. So I think it still makes sense to put semantic models in their own dictionary. If anything I'd be tempted to separate out some of the existing things that are in the nodes dictionary into their own dictionaries and maybe make some combined "indexes" (for cases where we don't want to loop over "nodes" all the time...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to mention that one of the reasons for going in the direction of more individual dictionaries rather than less, is that currently jsonschema and deserializers can't always correctly guess the classes of the objects in the "nodes" dictionary, leading to other hacky things like the big deserialization if statement in the _deserialize method in ParsedNode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all really good to know. It seems like the right direction to me.


_doc_lookup: Optional[DocLookup] = field(
default=None, metadata={"serialize": lambda x: None, "deserialize": lambda x: None}
Expand Down Expand Up @@ -1212,6 +1214,11 @@ def add_doc(self, source_file: SourceFile, doc: Documentation):
self.docs[doc.unique_id] = doc
source_file.docs.append(doc.unique_id)

def add_semantic_model(self, source_file: SchemaSourceFile, semantic_model: SemanticModel):
peterallenwebb marked this conversation as resolved.
Show resolved Hide resolved
_check_duplicates(semantic_model, self.semantic_models)
self.semantic_models[semantic_model.unique_id] = semantic_model
source_file.semantic_models.append(semantic_model.unique_id)

# end of methods formerly in ParseResult

# Provide support for copy.deepcopy() - we just need to avoid the lock!
Expand Down Expand Up @@ -1311,6 +1318,9 @@ class WritableManifest(ArtifactMixin):
public_nodes: Mapping[UniqueID, PublicModel] = field(
metadata=dict(description=("The public models used in the dbt project"))
)
semantic_models: Mapping[UniqueID, SemanticModel] = field(
metadata=dict(description=("The semantic models defined in the dbt project"))
)
metadata: ManifestMetadata = field(
metadata=dict(
description="Metadata about the manifest",
Expand Down
66 changes: 50 additions & 16 deletions core/dbt/contracts/graph/nodes.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,29 +6,23 @@
import hashlib

from mashumaro.types import SerializableType
from typing import (
Optional,
Union,
List,
Dict,
Any,
Sequence,
Tuple,
Iterator,
)
from typing import Optional, Union, List, Dict, Any, Sequence, Tuple, Iterator, Protocol

from dbt.dataclass_schema import dbtClassMixin, ExtensibleDbtClassMixin

from dbt.clients.system import write_file
from dbt.contracts.files import FileHash
from dbt.contracts.graph.unparsed import (
Dimension,
Docs,
Entity,
ExposureType,
ExternalTable,
FreshnessThreshold,
HasYamlMetadata,
MacroArgument,
MaturityType,
Measure,
MetricFilter,
MetricTime,
Owner,
Expand Down Expand Up @@ -62,12 +56,6 @@
EmptySnapshotConfig,
SnapshotConfig,
)
import sys

if sys.version_info >= (3, 8):
from typing import Protocol
else:
from typing_extensions import Protocol


# =====================================================================
Expand Down Expand Up @@ -552,6 +540,30 @@ def depends_on_macros(self):
return self.depends_on.macros


@dataclass
class FileSlice(dbtClassMixin, Replaceable):
peterallenwebb marked this conversation as resolved.
Show resolved Hide resolved
"""Provides file slice level context about what something was created from.

Implementation of the dbt-semantic-interfaces `FileSlice` protocol
"""

filename: str
content: str
start_line_number: int
end_line_number: int


@dataclass
class Metadata(dbtClassMixin, Replaceable):
peterallenwebb marked this conversation as resolved.
Show resolved Hide resolved
"""Provides file context about what something was created from.

Implementation of the dbt-semantic-interfaces `Metadata` protocol
"""

repo_file_path: str
file_slice: FileSlice


# ====================================
# CompiledNode subclasses
# ====================================
Expand Down Expand Up @@ -1399,6 +1411,28 @@ class Group(BaseNode):
resource_type: NodeType = field(metadata={"restrict": [NodeType.Group]})


# ====================================
# SemanticModel and related classes
# ====================================


@dataclass
class NodeRelation(dbtClassMixin, Replaceable):
peterallenwebb marked this conversation as resolved.
Show resolved Hide resolved
alias: str
schema_name: str
relation_name: str
database: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this comment -- does this still need to include a unified relation_name that respects quoting + include policies?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do -- we'll probably need the adapter.Relation to do something like this during compilation:

adapter = get_adapter(self.config)
relation_cls = adapter.Relation
relation_name = str(relation_cls.create_from(self.config, node))

(from https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/compilation.py#L486-L488)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the SemanticModels actually get compiled, since they're yaml-only. There is a question of whether they need the individual pieces (identifier/schema/database) or just the relation_name...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@QMalcolm I'm not sure of the answer to this, but I suspect the answer may be yes. What makes sense from the perspective of MetricFlow integration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichelleArk Yep! We need to include the unified relation_name, and it's desirable that it respects quoting + include policies.

As for whether we need both the individual pieces as well as the relation_name, for this PR yes, long term we don't have to. MetricFlow is fine with having the parts or the relation_name. It was suggested that we use this node_relation object as it was a pattern that was already used elsewhere, thus the shape of the object currently. Though MetricFlow currently uses the relation_name attribute as seen here, and ignores the individual properties. However, the individual properties as well as the relation_name are currently required by the protocol. So for the scope of this PR, we should include the individual properties as well as the relation_name. Otherwise we'd first need to update and release a new DSI (which would include a number of other schema changes that have been made), and then propagate all those changes in core. Although the node_relation changes might be trivial in core, propagating the other changes with the new DSI could eat a fair bit of time.



@dataclass
class SemanticModel(GraphNode):
description: Optional[str]
node_relation: NodeRelation
entities: Sequence[Entity]
measures: Sequence[Measure]
dimensions: Sequence[Dimension]


# ====================================
# Patches
# ====================================
Expand Down
56 changes: 56 additions & 0 deletions core/dbt/contracts/graph/unparsed.py
Original file line number Diff line number Diff line change
Expand Up @@ -661,6 +661,62 @@ def validate(cls, data):
raise ValidationError("Group owner must have at least one of 'name' or 'email'.")


#
# semantic interfaces unparsed objects
#


@dataclass
class Entity(dbtClassMixin, Replaceable):
peterallenwebb marked this conversation as resolved.
Show resolved Hide resolved
name: str
type: str # actually an enum
description: Optional[str] = None
role: Optional[str] = None
expr: Optional[str] = None


@dataclass
class MeasureAggregationParameters(dbtClassMixin, Replaceable):
percentile: Optional[float] = None
use_discrete_percentile: bool = False
use_approximate_percentile: bool = False


@dataclass
class Measure(dbtClassMixin, Replaceable):
name: str
agg: str # actually an enum
description: Optional[str] = None
create_metric: Optional[bool] = None
expr: Optional[str] = None
agg_params: Optional[MeasureAggregationParameters] = None
non_additive_dimension: Optional[Dict[str, Any]] = None # TODO: Refine type as class?
agg_time_dimension: Optional[str] = None


@dataclass
class Dimension(dbtClassMixin, Replaceable):
name: str
type: str # actually an enum
description: Optional[str] = None
is_partition: Optional[bool] = False
type_params: Optional[Dict[str, Any]] = field(
default_factory=dict
) # TODO: Refine type as class?
expr: Optional[str] = None
# TODO metadata: Optional[Metadata] (this would actually be the YML for the dimension)


@dataclass
class UnparsedSemanticModel(dbtClassMixin, Replaceable):
name: str
description: Optional[str]
model: str # looks like "ref(...)"
entities: List[Entity] = field(default_factory=list)
measures: List[Measure] = field(default_factory=list)
dimensions: List[Dimension] = field(default_factory=list)


def normalize_date(d: Optional[datetime.date]) -> Optional[datetime.datetime]:
"""Convert date to datetime (at midnight), and add local time zone if naive"""
if d is None:
Expand Down
1 change: 1 addition & 0 deletions core/dbt/node_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ class NodeType(StrEnum):
Exposure = "exposure"
Metric = "metric"
Group = "group"
SemanticModel = "semantic model"

@classmethod
def executable(cls) -> List["NodeType"]:
Expand Down
53 changes: 51 additions & 2 deletions core/dbt/parser/schema_yaml_readers.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
from dbt.parser.schemas import YamlReader, SchemaParser
from dbt.parser.common import YamlBlock
from dbt.node_types import NodeType
from dbt.contracts.graph.unparsed import UnparsedExposure, UnparsedMetric, UnparsedGroup
from dbt.contracts.graph.nodes import Exposure, Metric, Group
from dbt.contracts.graph.unparsed import (
UnparsedExposure,
UnparsedGroup,
UnparsedMetric,
UnparsedSemanticModel,
)
from dbt.contracts.graph.nodes import Exposure, Group, Metric, NodeRelation, SemanticModel
from dbt.exceptions import DbtInternalError, YamlParseDictError, JSONValidationError
from dbt.context.providers import generate_parse_exposure, generate_parse_metrics
from dbt.contracts.graph.model_config import MetricConfig, ExposureConfig
Expand Down Expand Up @@ -269,3 +274,47 @@ def parse(self):
raise YamlParseDictError(self.yaml.path, self.key, data, exc)

self.parse_group(unparsed)


class SemanticModelParser(YamlReader):
def __init__(self, schema_parser: SchemaParser, yaml: YamlBlock):
super().__init__(schema_parser, yaml, "semantic_models")
self.schema_parser = schema_parser
self.yaml = yaml

def parse_semantic_model(self, unparsed: UnparsedSemanticModel):
package_name = self.project.project_name
unique_id = f"{NodeType.SemanticModel}.{package_name}.{unparsed.name}"
path = self.yaml.path.relative_path

fqn = self.schema_parser.get_fqn_prefix(path)
fqn.append(unparsed.name)

parsed = SemanticModel(
description=unparsed.description,
fqn=fqn,
name=unparsed.name,
node_relation=NodeRelation(
alias="", database="", relation_name="", schema_name=""
), # TODO: arguments
original_file_path=self.yaml.path.original_file_path,
package_name=package_name,
path=path,
resource_type=NodeType.SemanticModel,
unique_id=unique_id,
entities=unparsed.entities,
measures=unparsed.measures,
dimensions=unparsed.dimensions,
)

self.manifest.add_semantic_model(self.yaml.file, parsed)

def parse(self):
for data in self.get_key_dicts():
try:
UnparsedSemanticModel.validate(data)
unparsed = UnparsedSemanticModel.from_dict(data)
except (ValidationError, JSONValidationError) as exc:
raise YamlParseDictError(self.yaml.path, self.key, data, exc)

self.parse_semantic_model(unparsed)
7 changes: 7 additions & 0 deletions core/dbt/parser/schemas.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@
"analyses",
"exposures",
"metrics",
"semantic_models",
)


Expand Down Expand Up @@ -217,6 +218,12 @@ def parse_file(self, block: FileBlock, dct: Dict = None) -> None:
group_parser = GroupParser(self, yaml_block)
group_parser.parse()

if "semantic_models" in dct:
from dbt.parser.schema_yaml_readers import SemanticModelParser

semantic_model_parser = SemanticModelParser(self, yaml_block)
semantic_model_parser.parse()


Parsed = TypeVar("Parsed", UnpatchedSourceDefinition, ParsedNodePatch, ParsedMacroPatch)
NodeTarget = TypeVar("NodeTarget", UnparsedNodeUpdate, UnparsedAnalysisUpdate, UnparsedModelUpdate)
Expand Down
1 change: 1 addition & 0 deletions core/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
"cffi>=1.9,<2.0.0",
"pyyaml>=5.3",
"urllib3~=1.0",
"dbt-semantic-interfaces==0.1.0.dev3",
],
zip_safe=False,
classifiers=[
Expand Down
Loading