Skip to content

Commit

Permalink
Merge pull request #1 from duneanalytics/add-llm-code
Browse files Browse the repository at this point in the history
main file for translation
  • Loading branch information
couralex6 authored Apr 7, 2023
2 parents 354c238 + cc110eb commit a94940e
Show file tree
Hide file tree
Showing 31 changed files with 1,341 additions and 446 deletions.
23 changes: 16 additions & 7 deletions Pipfile
Original file line number Diff line number Diff line change
@@ -1,16 +1,25 @@
# Pipfile for translation doctstore (to be replaced with docker-compose)
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[[source]]
url = "https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-2.12.0-py3-none-any.whl"
verify_ssl = false
name = "tensorflow"

[packages]
langchain = "0.0.129"
tenacity = "8.2.2"
faiss-cpu = "1.7.3"
openai = "0.27.2"
transformers = "4.27.4"
tiktoken = "0.3.3"
pytest = "7.2.2"
boto3 = "1.26.108"
trino = "0.322.0"
dbt-trino = "1.4.1"
dbt-databricks = "1.2.1"
numpy = "2.0.12"
pre-commit = "2.20.0"
pytest = "7.1.3"
trino = "0.321.0"


[requires]
python_version = "3.9"
python_version = "3.10"
890 changes: 640 additions & 250 deletions Pipfile.lock

Large diffs are not rendered by default.

213 changes: 24 additions & 189 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,197 +1,32 @@
![spellbook-logo@10x](https://user-images.githubusercontent.com/2520869/200791687-76f1bc4f-05d0-4384-a753-e3b5da0e7a4a.png#gh-light-mode-only)
![spellbook-logo-negative_10x](https://user-images.githubusercontent.com/2520869/200865128-426354af-8059-494d-83f7-46947aae271c.png#gh-dark-mode-only)
# Translation Docstore for DuneSQL

Welcome to your [Spellbook](https://youtu.be/o7p0BNt7NHs). Cast a magical incantation to tame the blockchain.
This spike is based loosely on the Langchain example for building a [question answering database on Notion](https://github.com/hwchase17/notion-qa).

📖 Documentation of models can be found [here](https://spellbook-docs.dune.com/#!/overview), with a full example contribution walkthrough [here](https://dune.com/docs/spellbook/getting-started/)
## Methodology
The inspiration for this spike is to turn the [Dune Migration docs](https://dune.com/docs/query/syntax-differences/#syntax-comparison) into a structured vector database that can be used to manage simple translation tasks.

### Heads up
Working on something new? Open a draft PR to let other wizards know you're working on it to help minimize duplicated work.
Vector databases using embedding similarity scores to return the top N most similar documents
to a query. (https://python.langchain.com/en/latest/modules/chains/index_examples/summarize.html?highlight=stuff#the-stuff-chain). If our docstore contains documents on how to do syntax translations, they will be inserted directly into the prompt. This allows us to provide a wide range of instructions but preserve precious token space and only insert them when needed.

Looking for abstractions from the V1 engine? We moved them to [dune-v1-abstractions](https://github.com/duneanalytics/dune-v1-abstractions).
## Considerations
Max tokens constraint is a significant issue when considering LLM translation. For Spellbook, what we have landed on is only sending the lines surrounding and error and translating those. We think this is a better path than trying to translate entire queries that can be thousands of lines long.

## Intro
## Tests
Tests include a small set of query snippets that fail in Spellbook and the expected translation.

Write SQL to transform blockchain data into curated datasets on [dune.com](https://dune.com/home).
## How to run this
Temporary instructions for running this spike.
1) Install virtual environment using the pipfile (`pipenv install` from llm-tests root)
(stop yelling at me, I'll convert it to Docker-compose soon)
2) Either enter the virtual environment (`pipenv shell`) or configure your IDE to use the virtual environment.
3) Set your environment variables for OpenAI (`export OPENAI_API_KEY=123`).
4) Run the tests (`pytest` from llm-tests root).
5) Run the script (`python translation_docstore/main.py --snippet "Your sql to translate" --error "Your Error".)
This script will also run with a default snippet and error

First-time visitor? Check out how to [get started](#getting-started) below and visit the [Spellbook Getting Started Guide](https://dune.com/docs/spellbook/getting-started/). More tk.

Been here before? An archive of intermediate datasets that were contributed to Dune v1 can be consulted [here](https://github.com/duneanalytics/dune-v1-abstractions).

## Getting Started

### Prerequisites

- Fork this repo and clone your fork locally. See Github's [guide](https://docs.github.com/en/get-started/quickstart/contributing-to-projects) on contributing to projects.
- python 3.9 installed. Our recommendation is to follow the [Hitchhiker's Guide to Python](https://docs.python-guide.org/starting/installation/)
- [pip](https://pip.pypa.io/en/stable/installation/) installed
- [pipenv](https://pypi.org/project/pipenv/) installed
- paths for both pip and pipenv are set (this should happen automatically but sometimes does not). If you run into issues like "pipenv: command not found", try troubleshooting with the pip or pipenv documentation.

### Initial Installation

You can watch the video version of this if you scroll down a bit.

Navigate to the abstraction repo within your CLI (Command line interface).

```console
cd user\directory\github\spellbook
# Change this to wherever spellbooks are stored locally on your machine.
```

Use the pipfile to create a pipenv.

```console
pipenv install
```

If the env is created successfully, skip ahead to `pipenv shell`.

Our script is looking for a static python version, the likelihood of an error for a wrong python version is pretty high. If that error occurs, check your python version with:

```console
py --version
```

Now use any text editor program to change the python version in the pipfile within the spellbook directory to your python version. You need to have at least python 3.9.
If you have changed the python version in the pipfile, run `pipenv install` again.

You are now ready to activate this project's virtual environment.
Use:

```console
pipenv shell
```

You have now created a virtual environment for this project. You can read more about virtual environments [here](https://realpython.com/pipenv-guide/).

To initiate the dbt project run:

```console
dbt init
```

Enter the values as shown below:

```console
Which database would you like to use?
[1] databricks
[2] spark

(Don't see the one you want? https://docs.getdbt.com/docs/available-adapters)

Enter a number: 1
host (yourorg.databricks.com): .
http_path (HTTP Path): .
token (dapiXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX):
[1] use Unity Catalog
[2] not use Unity Catalog
Desired unity catalog option (enter a number): 2
schema (default schema that dbt will build objects in): wizard
threads (1 or more) [1]: 2
```

This will not connect to the database but you have access to some dbt actions.
**When you are prompted to choose a schema, please enter `wizard` so we know you are an external contributor.**
Should you make an error during this process (not entering `wizard` being the only one you can make), simply quit the CLI and start over.

To pull the dbt project dependencies run:

```console
dbt deps
```

Then, run the following command:

```console
dbt compile
```

dbt compile will compile the JINJA and SQL templated SQL into plain SQL which can be executed in the Dune UI. Your spellbook directory now has a folder named `target` containing plain SQL versions of all models in Dune. If you have made changes to the repo before completing all these actions, you can now be certain that at least the compile process works correctly, if there are big errors the compile process will not complete.
If you haven't made changes to the directory beforehand, you can now start adding, editing, or deleting files within the repository.
Afterwards, simply run `dbt compile` again once you are finished with your work in the directory and test the plain language sql queries on dune.com.

### Coming back

If you have done this installation on your machine once, to get back into dbt, simply navigate to the spellbook repo, run `pipenv shell`, and you can run `dbt compile` again.

### What did I just do?

You now have the ability to compile your dbt model statements and test statements into plain SQL. This allows you to test those queries on the usual dune.com environment and should therefore lead to a better experience while developing spells. Running the queries will immediately give you feedback on typos, logical errors, or mismatches.
This in turn will help us deploy these spells faster and avoid any potential mistakes.

We are thinking about better solutions to make more dbt actions available directly but we also have to consider security.

### How to use dbt to create spells

There are a couple of new concepts to consider when making spells in dbt. The most common ones wizards will encounter are refs, sources, freshness, and tests.

In the body of each query, tables are referred to either as refs, ex `{{ ref('1inch_ethereum') }}` or sources, ex `{{ source('ethereum', 'traces') }}`. Refs refer to other dbt models and they should refer to the file name like `1inch_ethereum.sql`, even if the model itself is aliased. Sources refer to "raw" data or tables/views not generated by dbt. Using refs and sources allows us to automatically build dependency trees.

Sources and models are defined in schema.yml files where tests and other attributes are defined.

The best practice is to add tests unique and non_null tests to the primary key for every new model. Similarly, a freshness check should be added to every new source (although we will try not to re-test freshness if the source is used elsewhere).

Adding descriptions to tables and columns will help people find and use your tables.

```yaml
models:
- name: 1inch_ethereum
description: "Trades on 1inch, a DEX aggregator"
columns:
- name: tx_hash
description: "Table primary key: a transaction hash (tx_hash) is a unique identifier for a transaction."
tests:
- unique
- not_null

sources:
- name: ethereum
freshness:
warn_after: { count: 12, period: hour }
error_after: { count: 24, period: hour }
tables:
- name: traces
loaded_at_field: block_time
```
See links to more docs on dbt below.
### Generating and serving documentation:
To generate documentation and view it as a website, run the following commands:
- `dbt docs generate`
- `dbt docs serve`
You must have set up dbt with `dbt init` but you don't need database credentials to run these commands.

See [dbt docs documentation](https://docs.getdbt.com/docs/building-a-dbt-project/documentation) for more information on
how to contribute to documentation.

As a preview, you can do [things](https://docs.getdbt.com/reference/resource-properties/description) like:

- Write simple one or many line descriptions of models or columns.
- Write longer descriptions as code blocks using markdown.
- Link to other models in your descriptions.
- Add images / project logos from the repo into descriptions.
- Use HTML in your description.

### Troubleshooting

If you fail to run `dbt compile`, here are some common error messages:

- `Could not find profile named 'spellbook'` <br> Check `~/.dbt/profiles.yml` and make sure there is a profile named `spellbook`. When you run `dbt init` to initiate a project, a profile gets created. Inside `spellbook` you cannot initiate a project called the same name, so you need to run `dbt init spellbook` outside the project so it creates the profile, or create one with a different name and then manually edit the `profiles.yml` file.
- ```console
Credentials in profile "spellbook", target "dev" invalid: Runtime Error
http connection method requires additional dependencies.
Install the additional required dependencies with pip install dbt-spark[PyHive]
```
You've probably selected the `spark` option instead of the `databricks` option when running `dbt init`. Rerun `dbt init`, overwrite the profile, and select the `databricks` option.

### DBT Resources:

- Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction)
- Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers
- Join the [chat](https://community.getdbt.com/) on Slack for live discussions and support
- Find [dbt events](https://events.getdbt.com) near you
- Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices
## TODO
Add many more rules to the doc store
Store the vector db in an accessible place
Add more unit tests
Connect to Alex's work on spells possibly via API?
Binary file added docs_spark.index
Binary file not shown.
77 changes: 77 additions & 0 deletions explain_cls.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
import json
import os
from fnmatch import fnmatch

import boto3 as boto3
from trino import dbapi
from trino.auth import BasicAuthentication
from trino.exceptions import TrinoUserError


class Explain_n_Executer:
def __init__(self, model_path):
self.model_path = os.path.join(os.getcwd(), model_path)

def execute_query(self, explain_stmt):
"""
Function that executes a query passed as a string against the trino server. We would like to use aws secrets
manager to authenticate.
"""
username = os.environ.get('TRINO_USERNAME')
password = os.environ.get('TRINO_PASSWORD')

# Creating a connection to the trino server
trino_host = os.environ.get('TRINO_URL')
conn = dbapi.connect(
host=trino_host,
port=443,
auth=BasicAuthentication(username, password),
http_scheme="https",
client_tags=["routingGroup=sandbox"],
)
# try executing the query and returning the response. return error if it fails.
try:
cursor = conn.cursor()
cursor.execute(explain_stmt)
return None
except TrinoUserError as e:
return e.message
except Exception as e:
print(f"NON_TRINO_ERROR : {e}")
return None


def get_sql(self):
"""
Function that returns the SQL query from a model path.
"""
with open(self.model_path, 'r') as f:
self.sql = f.read()


@staticmethod
def get_secret():
"""
Function that fetches a secret from AWS Secrets Manager given a secret ARN.
Note: may not be needed if we can use the user/pass to authenticate.
"""
secret_arn = os.environ.get('TRINO_SECRET_ARN')
session = boto3.session.Session()
client = session.client(service_name='secretsmanager')
get_secret_value_response = client.get_secret_value(SecretId=secret_arn)
secret = get_secret_value_response['SecretString']
return secret


def explain_query(self):
"""
Function that explains a query and returns the response.
"""
resp = self.execute_query("EXPLAIN (TYPE LOGICAL, FORMAT JSON) " + self.sql)
if type(resp) == str:
self.explanation = resp

def explain(self):
self.get_sql()
self.explain_query()

Binary file added faiss_store_spark.pkl
Binary file not shown.
41 changes: 41 additions & 0 deletions ingest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
"""This is the logic for ingesting SQL translation rules logs into LangChain."""
import argparse
from pathlib import Path
from langchain.text_splitter import CharacterTextSplitter
import faiss
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
import pickle

parser = argparse.ArgumentParser(description="Which translation rules to ingest?")
parser.add_argument('syntax', nargs='?', type=str, default="spark", help="spark or postgres")
args = parser.parse_args()
syntax = args.syntax

# Ingest txt files with rules for making translations
ps = list(Path('rules/'+syntax + '/').glob("**/*.txt"))
data = []
sources = []
for p in ps:
with open(p) as f:
data.append(f.read())
sources.append(p)

# Here we split the documents, as needed, into smaller chunks.
# We do this due to the context limits of the LLMs.
text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
docs = []
metadatas = []
for i, d in enumerate(data):
print(f"Processing {i} of {len(data)}")
splits = text_splitter.split_text(d)
docs.extend(splits)
metadatas.extend([{"source": sources[i]}] * len(splits))


# Here we create a vector store from the documents and save it to disk.
store = FAISS.from_texts(docs, OpenAIEmbeddings(), metadatas=metadatas)
faiss.write_index(store.index, f"docs_{syntax}.index")
store.index = None
with open(f"faiss_store_{syntax}.pkl", "wb") as f:
pickle.dump(store, f)
Loading

0 comments on commit a94940e

Please sign in to comment.