Merge pull request #1 from duneanalytics/add-llm-code

main file for translation
duneanalytics · Apr 7, 2023 · a94940e · a94940e
2 parents 354c238 + cc110eb
commit a94940e
Show file tree

Hide file tree

Showing 31 changed files with 1,341 additions and 446 deletions.
diff --git a/Pipfile b/Pipfile
@@ -1,16 +1,25 @@
+# Pipfile for translation doctstore (to be replaced with docker-compose)
 [[source]]
 url = "https://pypi.org/simple"
 verify_ssl = true
 name = "pypi"
 
+[[source]]
+url = "https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-2.12.0-py3-none-any.whl"
+verify_ssl = false
+name = "tensorflow"
+
 [packages]
+langchain = "0.0.129"
+tenacity = "8.2.2"
+faiss-cpu = "1.7.3"
+openai = "0.27.2"
+transformers = "4.27.4"
+tiktoken = "0.3.3"
+pytest = "7.2.2"
+boto3 = "1.26.108"
+trino = "0.322.0"
 dbt-trino = "1.4.1"
-dbt-databricks = "1.2.1"
-numpy = "2.0.12"
-pre-commit = "2.20.0"
-pytest = "7.1.3"
-trino = "0.321.0"
-
 
 [requires]
-python_version = "3.9"
+python_version = "3.10"
diff --git a/Pipfile.lock b/Pipfile.lock
diff --git a/README.md b/README.md
@@ -1,197 +1,32 @@
-![spellbook-logo@10x](https://user-images.githubusercontent.com/2520869/200791687-76f1bc4f-05d0-4384-a753-e3b5da0e7a4a.png#gh-light-mode-only)
-![spellbook-logo-negative_10x](https://user-images.githubusercontent.com/2520869/200865128-426354af-8059-494d-83f7-46947aae271c.png#gh-dark-mode-only)
+# Translation Docstore for DuneSQL
 
-Welcome to your [Spellbook](https://youtu.be/o7p0BNt7NHs). Cast a magical incantation to tame the blockchain.
+This spike is based loosely on the Langchain example for building a [question answering database on Notion](https://github.com/hwchase17/notion-qa). 
 
-📖 Documentation of models can be found [here](https://spellbook-docs.dune.com/#!/overview), with a full example contribution walkthrough [here](https://dune.com/docs/spellbook/getting-started/)
+## Methodology
+The inspiration for this spike is to turn the [Dune Migration docs](https://dune.com/docs/query/syntax-differences/#syntax-comparison) into a structured vector database that can be used to manage simple translation tasks. 
 
-### Heads up
-Working on something new? Open a draft PR to let other wizards know you're working on it to help minimize duplicated work. 
+Vector databases using embedding similarity scores to return the top N most similar documents 
+ to a query. (https://python.langchain.com/en/latest/modules/chains/index_examples/summarize.html?highlight=stuff#the-stuff-chain). If our docstore contains documents on how to do syntax translations, they will be inserted directly into the prompt. This allows us to provide a wide range of instructions but preserve precious token space and only insert them when needed. 
 
-Looking for abstractions from the V1 engine? We moved them to [dune-v1-abstractions](https://github.com/duneanalytics/dune-v1-abstractions).
+## Considerations
+Max tokens constraint is a significant issue when considering LLM translation. For Spellbook, what we have landed on is only sending the lines surrounding and error and translating those. We think this is a better path than trying to translate entire queries that can be thousands of lines long. 
 
-## Intro
+## Tests
+Tests include a small set of query snippets that fail in Spellbook and the expected translation. 
 
-Write SQL to transform blockchain data into curated datasets on [dune.com](https://dune.com/home).
+## How to run this
+Temporary instructions for running this spike.
+1) Install virtual environment using the pipfile (`pipenv install` from llm-tests root) 
+(stop yelling at me, I'll convert it to Docker-compose soon)
+2) Either enter the virtual environment (`pipenv shell`) or configure your IDE to use the virtual environment.
+3) Set your environment variables for OpenAI (`export OPENAI_API_KEY=123`).
+4) Run the tests (`pytest` from llm-tests root).
+5) Run the script (`python translation_docstore/main.py --snippet "Your sql to translate" --error "Your Error".) 
+This script will also run with a default snippet and error
 
-First-time visitor? Check out how to [get started](#getting-started) below and visit the [Spellbook Getting Started Guide](https://dune.com/docs/spellbook/getting-started/). More tk.
 
-Been here before? An archive of intermediate datasets that were contributed to Dune v1 can be consulted [here](https://github.com/duneanalytics/dune-v1-abstractions).
-
-## Getting Started
-
-### Prerequisites
-
-- Fork this repo and clone your fork locally. See Github's [guide](https://docs.github.com/en/get-started/quickstart/contributing-to-projects) on contributing to projects.
-- python 3.9 installed. Our recommendation is to follow the [Hitchhiker's Guide to Python](https://docs.python-guide.org/starting/installation/)
-- [pip](https://pip.pypa.io/en/stable/installation/) installed
-- [pipenv](https://pypi.org/project/pipenv/) installed
-- paths for both pip and pipenv are set (this should happen automatically but sometimes does not). If you run into issues like "pipenv: command not found", try troubleshooting with the pip or pipenv documentation.
-
-### Initial Installation
-
-You can watch the video version of this if you scroll down a bit.
-
-Navigate to the abstraction repo within your CLI (Command line interface).
-
-```console
-cd user\directory\github\spellbook
-# Change this to wherever spellbooks are stored locally on your machine.
-```
-
-Use the pipfile to create a pipenv.
-
-```console
-pipenv install
-```
-
-If the env is created successfully, skip ahead to `pipenv shell`.
-
-Our script is looking for a static python version, the likelihood of an error for a wrong python version is pretty high. If that error occurs, check your python version with:
-
-```console
-py --version
-```
-
-Now use any text editor program to change the python version in the pipfile within the spellbook directory to your python version. You need to have at least python 3.9.
-If you have changed the python version in the pipfile, run `pipenv install` again.
-
-You are now ready to activate this project's virtual environment.
-Use:
-
-```console
-pipenv shell
-```
-
-You have now created a virtual environment for this project. You can read more about virtual environments [here](https://realpython.com/pipenv-guide/).
-
-To initiate the dbt project run:
-
-```console
-dbt init
-```
-
-Enter the values as shown below:
-
-```console
-Which database would you like to use?
-[1] databricks
-[2] spark
-
-(Don't see the one you want? https://docs.getdbt.com/docs/available-adapters)
-
-Enter a number: 1
-host (yourorg.databricks.com): .
-http_path (HTTP Path): .
-token (dapiXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX):
-[1] use Unity Catalog
-[2] not use Unity Catalog
-Desired unity catalog option (enter a number): 2
-schema (default schema that dbt will build objects in): wizard
-threads (1 or more) [1]: 2
-```
-
-This will not connect to the database but you have access to some dbt actions.
-**When you are prompted to choose a schema, please enter `wizard` so we know you are an external contributor.**
-Should you make an error during this process (not entering `wizard` being the only one you can make), simply quit the CLI and start over.
-
-To pull the dbt project dependencies run:
-
-```console
-dbt deps
-```
-
-Then, run the following command:
-
-```console
-dbt compile
-```
-
-dbt compile will compile the JINJA and SQL templated SQL into plain SQL which can be executed in the Dune UI. Your spellbook directory now has a folder named `target` containing plain SQL versions of all models in Dune. If you have made changes to the repo before completing all these actions, you can now be certain that at least the compile process works correctly, if there are big errors the compile process will not complete.
-If you haven't made changes to the directory beforehand, you can now start adding, editing, or deleting files within the repository.
-Afterwards, simply run `dbt compile` again once you are finished with your work in the directory and test the plain language sql queries on dune.com.
-
-### Coming back
-
-If you have done this installation on your machine once, to get back into dbt, simply navigate to the spellbook repo, run `pipenv shell`, and you can run `dbt compile` again.
-
-### What did I just do?
-
-You now have the ability to compile your dbt model statements and test statements into plain SQL. This allows you to test those queries on the usual dune.com environment and should therefore lead to a better experience while developing spells. Running the queries will immediately give you feedback on typos, logical errors, or mismatches.
-This in turn will help us deploy these spells faster and avoid any potential mistakes.
-
-We are thinking about better solutions to make more dbt actions available directly but we also have to consider security.
-
-### How to use dbt to create spells
-
-There are a couple of new concepts to consider when making spells in dbt. The most common ones wizards will encounter are refs, sources, freshness, and tests.
-
-In the body of each query, tables are referred to either as refs, ex `{{ ref('1inch_ethereum') }}` or sources, ex `{{ source('ethereum', 'traces') }}`. Refs refer to other dbt models and they should refer to the file name like `1inch_ethereum.sql`, even if the model itself is aliased. Sources refer to "raw" data or tables/views not generated by dbt. Using refs and sources allows us to automatically build dependency trees.
-
-Sources and models are defined in schema.yml files where tests and other attributes are defined.
-
-The best practice is to add tests unique and non_null tests to the primary key for every new model. Similarly, a freshness check should be added to every new source (although we will try not to re-test freshness if the source is used elsewhere).
-
-Adding descriptions to tables and columns will help people find and use your tables.
-
-```yaml
-models:
-  - name: 1inch_ethereum
-    description: "Trades on 1inch, a DEX aggregator"
-    columns:
-      - name: tx_hash
-        description: "Table primary key: a transaction hash (tx_hash) is a unique identifier for a transaction."
-        tests:
-          - unique
-          - not_null
-
-  sources:
-  - name: ethereum
-    freshness:
-      warn_after: { count: 12, period: hour }
-      error_after: { count: 24, period: hour }
-    tables:
-      - name: traces
-        loaded_at_field: block_time
-```
-
-See links to more docs on dbt below.
-
-### Generating and serving documentation:
-
-To generate documentation and view it as a website, run the following commands:
-
-- `dbt docs generate`
-- `dbt docs serve`
-  You must have set up dbt with `dbt init` but you don't need database credentials to run these commands.
-
-See [dbt docs documentation](https://docs.getdbt.com/docs/building-a-dbt-project/documentation) for more information on
-how to contribute to documentation.
-
-As a preview, you can do [things](https://docs.getdbt.com/reference/resource-properties/description) like:
-
-- Write simple one or many line descriptions of models or columns.
-- Write longer descriptions as code blocks using markdown.
-- Link to other models in your descriptions.
-- Add images / project logos from the repo into descriptions.
-- Use HTML in your description.
-
-### Troubleshooting
-
-If you fail to run `dbt compile`, here are some common error messages:
-
-- `Could not find profile named 'spellbook'` <br> Check `~/.dbt/profiles.yml` and make sure there is a profile named `spellbook`. When you run `dbt init` to initiate a project, a profile gets created. Inside `spellbook` you cannot initiate a project called the same name, so you need to run `dbt init spellbook` outside the project so it creates the profile, or create one with a different name and then manually edit the `profiles.yml` file.
-- ```console
-  Credentials in profile "spellbook", target "dev" invalid: Runtime Error
-   http connection method requires additional dependencies.
-   Install the additional required dependencies with pip install dbt-spark[PyHive]
-  ```
-  You've probably selected the `spark` option instead of the `databricks` option when running `dbt init`. Rerun `dbt init`, overwrite the profile, and select the `databricks` option.
-
-### DBT Resources:
-
-- Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction)
-- Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers
-- Join the [chat](https://community.getdbt.com/) on Slack for live discussions and support
-- Find [dbt events](https://events.getdbt.com) near you
-- Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices
+## TODO
+Add many more rules to the doc store
+Store the vector db in an accessible place
+Add more unit tests
+Connect to Alex's work on spells possibly via API?
diff --git a/docs_spark.index b/docs_spark.index
diff --git a/explain_cls.py b/explain_cls.py
@@ -0,0 +1,77 @@
+import json
+import os
+from fnmatch import fnmatch
+
+import boto3 as boto3
+from trino import dbapi
+from trino.auth import BasicAuthentication
+from trino.exceptions import TrinoUserError
+
+
+class Explain_n_Executer:
+    def __init__(self, model_path):
+        self.model_path = os.path.join(os.getcwd(), model_path)
+
+    def execute_query(self, explain_stmt):
+        """
+        Function that executes a query passed as a string against the trino server. We would like to use aws secrets
+        manager to authenticate.
+        """
+        username = os.environ.get('TRINO_USERNAME')
+        password = os.environ.get('TRINO_PASSWORD')
+
+        # Creating a connection to the trino server
+        trino_host = os.environ.get('TRINO_URL')
+        conn = dbapi.connect(
+            host=trino_host,
+            port=443,
+            auth=BasicAuthentication(username, password),
+            http_scheme="https",
+            client_tags=["routingGroup=sandbox"],
+        )
+        # try executing the query and returning the response. return error if it fails.
+        try:
+            cursor = conn.cursor()
+            cursor.execute(explain_stmt)
+            return None
+        except TrinoUserError as e:
+            return e.message
+        except Exception as e:
+            print(f"NON_TRINO_ERROR : {e}")
+            return None
+
+
+    def get_sql(self):
+        """
+        Function that returns the SQL query from a model path.
+        """
+        with open(self.model_path, 'r') as f:
+            self.sql = f.read()
+
+
+    @staticmethod
+    def get_secret():
+        """
+        Function that fetches a secret from AWS Secrets Manager given a secret ARN.
+        Note: may not be needed if we can use the user/pass to authenticate.
+        """
+        secret_arn = os.environ.get('TRINO_SECRET_ARN')
+        session = boto3.session.Session()
+        client = session.client(service_name='secretsmanager')
+        get_secret_value_response = client.get_secret_value(SecretId=secret_arn)
+        secret = get_secret_value_response['SecretString']
+        return secret
+
+
+    def explain_query(self):
+        """
+        Function that explains a query and returns the response.
+        """
+        resp = self.execute_query("EXPLAIN (TYPE LOGICAL, FORMAT JSON) " + self.sql)
+        if type(resp) == str:
+            self.explanation = resp
+
+    def explain(self):
+        self.get_sql()
+        self.explain_query()
+
diff --git a/faiss_store_spark.pkl b/faiss_store_spark.pkl
diff --git a/ingest.py b/ingest.py
@@ -0,0 +1,41 @@
+"""This is the logic for ingesting SQL translation rules logs into LangChain."""
+import argparse
+from pathlib import Path
+from langchain.text_splitter import CharacterTextSplitter
+import faiss
+from langchain.vectorstores import FAISS
+from langchain.embeddings import OpenAIEmbeddings
+import pickle
+
+parser = argparse.ArgumentParser(description="Which translation rules to ingest?")
+parser.add_argument('syntax', nargs='?', type=str, default="spark", help="spark or postgres")
+args = parser.parse_args()
+syntax = args.syntax
+
+# Ingest txt files with rules for making translations
+ps = list(Path('rules/'+syntax + '/').glob("**/*.txt"))
+data = []
+sources = []
+for p in ps:
+    with open(p) as f:
+        data.append(f.read())
+    sources.append(p)
+
+# Here we split the documents, as needed, into smaller chunks.
+# We do this due to the context limits of the LLMs.
+text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
+docs = []
+metadatas = []
+for i, d in enumerate(data):
+    print(f"Processing {i} of {len(data)}")
+    splits = text_splitter.split_text(d)
+    docs.extend(splits)
+    metadatas.extend([{"source": sources[i]}] * len(splits))
+
+
+# Here we create a vector store from the documents and save it to disk.
+store = FAISS.from_texts(docs, OpenAIEmbeddings(), metadatas=metadatas)
+faiss.write_index(store.index, f"docs_{syntax}.index")
+store.index = None
+with open(f"faiss_store_{syntax}.pkl", "wb") as f:
+    pickle.dump(store, f)