Skip to content

Commit

Permalink
add column declarations to dataswarm Glean schema
Browse files Browse the repository at this point in the history
Summary:
This adds support for column declarations to the Glean schema for the dataswarm indexer so we can support goto definition on columns.

We distinguish two cases:
[1] Columns declared in a Hive table's toplevel INSERT SELECT statement
and
[2] columns declared in a subquery

Example:
```text
my_task = PrestoInsertOperatorWithSchema(
  output_data={"out": output.table("<TABLE:my_table>")},
  select="""
  WITH foo AS (
    SELECT blah <--- SubqueryColumnDeclaration(di.my_task, foo, blah)
    FROM table1
  )
  SELECT blah <--- TableColumnDeclaration(my_table:di, blah)
  FROM foo
  """
)
```

We treat these two cases differently because a Hive table name+namespace is globally unique across the warehouse, so table_name+namespace+column_name is sufficient to uniquely identify a column declaration – but by contrast a subquery name is not globally unique, it is scoped to a given SQL query, so in that case we need the dataswarm task ID to uniquely identify it

Reviewed By: iamirzhan

Differential Revision: D67097540

fbshipit-source-id: 0b94b0c86adaca5fc126d1396e7de6e139f1de9c
  • Loading branch information
Daniel Ohayon authored and facebook-github-bot committed Jan 8, 2025
1 parent 4208c3d commit f3292e6
Show file tree
Hide file tree
Showing 2 changed files with 1,220 additions and 1,166 deletions.
Loading

0 comments on commit f3292e6

Please sign in to comment.