The Data Department utilizes AWS for data storage (S3), data cleaning (Glue), and querying (Athena). To connect to these services, follow the instructions below:
- Install and setup the AWS CLI and aws-mfa
- Authenticate with
aws-mfa
via the command line - That's it, calls to S3 should work automatically
- You can use the
aws.s3
library in R orboto3
in Python to upload, download, and manipulate S3 objects - Certain packages, such as
arrow
, can also read from S3 directly (for exampleread_parquet("s3://bucket-name/object")
- You can use the
# Load necessary libraries
library(aws.s3)
# Test connection
aws.s3::bucket_list_df()
# Save object
aws.s3::save_object("s3://bucket-name/object", file = "filename")
# Load necessary packages
import boto3
# Establish connection
s3 = boto3.resource(
service_name='s3',
region_name='us-east-1'
)
# Test connection
for bucket in s3.buckets.all():
print(bucket.name)
# Or, depending on your needs
s3 = boto3.client(
service_name='s3',
region_name='us-east-1',
)
You can use the noctua
R package to pull data from Athena. noctua
has two useful options for speeding up queries: caching and unload.
- Caching allows you to re-use locally-stored query results rather than constantly re-pulling from AWS. It is useful for situations like rendering a Quarto document, where you need to re-run the whole file but may not want to re-pull data.
- The
unload = TRUE
option uses a different method of storing and transferring query results. It tends to be a bit faster on our hardware, and thus we recommend using it by default. NOTE: Theunload
option only works correctly onnoctua>=2.6.3
.
To setup and use noctua
in an R project:
-
Install and setup the AWS CLI and aws-mfa
-
Authenticate with
aws-mfa
via the command line -
In your root-level
.Renviron
file, add the environmental variables below. Message @SweatyHandshake or @dfsnow for the name of the Athena results bucket. Save the file and restart your R sessionAWS_REGION=us-east-1 AWS_ATHENA_S3_STAGING_DIR=$RESULTS_BUCKET
Make sure that
$RESULTS_BUCKET
begins withs3://
. -
Run the following code to instantiate your connection and run a test query
# Load necessary libraries library(DBI) library(noctua) # Optionally enable query caching and unload noctua_options(cache_size = 10, unload = TRUE) # Establish connection AWS_ATHENA_CONN_NOCTUA <- dbConnect( noctua::athena(), # Disable the Connections tab entry for this database. Always use this if # you don't want to browser the tables in the Connections tab, since it # speeds up instantiating the connection significantly rstudio_conn_tab = FALSE ) # Test the connection dbGetQuery( conn = AWS_ATHENA_CONN_NOCTUA, "SELECT year, geoid FROM census.acs5 LIMIT 10" )
If the connection to Athena is successful but your test query returns zero rows, you may need to install the development version of noctua
:
remotes::install_github("dyfanjones/noctua")
# Or via renv:
renv::install("dyfanjones/noctua")
Using python, the pyathena
package is an excellent option for ingesting data from AWS Athena.
As with R, enabling unload via cursor(unload=TRUE)
uses a different method of storing and transferring query results. It tends to be a bit faster on our hardware, and thus we recommend using it by default.
-
Install and setup the AWS CLI and aws-mfa
-
Authenticate with
aws-mfa
via the command line -
Install the
pyathena
package into your python environment usingpip install PyAthena
-
Add the following environment variables to your environment
- Every time we ingest data from Athena, the data has to be created as a file and stored somewhere before it arrives in our coding environment. The
$ATHENA_RESULTS_BUCKET
is our designated bucket for these intermediate files, which in this example is meant to be replaced with our actual bucket name.
AWS_ATHENA_S3_STAGING_DIR=$ATHENA_RESULTS_BUCKET AWS_REGION=us-east-1
- Every time we ingest data from Athena, the data has to be created as a file and stored somewhere before it arrives in our coding environment. The
-
Run the following code to instantiate your connection and run test query
# Load necessary libraries import os import pandas import pyarrow from pyathena import connect from pyathena.pandas.util import as_pandas from pyathena.pandas.cursor import PandasCursor # Connect to Athena cursor = connect( # We add '+ "/"' to the end of the line below because enabling unload # requires that the staging directory end with a slash s3_staging_dir=os.getenv("AWS_ATHENA_S3_STAGING_DIR") + "/", region_name=os.getenv("AWS_REGION"), cursor_class=PandasCursor, ).cursor(unload=True) # Define test query. Note that the query CANNOT end with a semi-colon SQL_QUERY = "SELECT * from default.vw_pin_sale LIMIT 10" # Execute query and return as pandas df cursor.execute(SQL_QUERY) df = cursor.as_pandas()
You will likely need to work with IT admins for permissions to do the following.
- Install Tableau Desktop and Tableau Prep. Verify with a core team member which versions to install. If you plan to publish a file to the CCAO's Tableau Server, the versions of Tableau Desktop and Tableau Prep cannot be more recent than the version of Tableau Server.
- Install the JDBC 2.x Driver with AWS SDK. Then move the downloaded .jar file to
C:\Program Files\Tableau\Drivers
on Windows or~/Library/Tableau/Drivers
on Mac. - Make sure Java SE Development Kit is installed.
- Create a file called
athena.properties
inC:\Users\$USER\Documents\My Tableau Repository\Datasources
on Windows or~/Users/$USER/Documents/My Tableau Repository/Datasources
on Mac with the following lines:Add the same file toworkgroup=read-only-with-scan-limit MetadataRetrievalMethod=ProxyAPI
C:\Users\$USER\Documents\My Tableau Prep Repository\Datasources
on Windows or~/Users/$USER/Documents/My Tableau Prep Repository/Datasources
on Mac. - Open Tableau and on the
Connect
sidebar underTo a Server
, navigate toAmazon Athena
. On Tableau Prep, select the+
next toConnections
and navigate toAmazon Athena
. - Message a core team member for the necessary server info and credentials. Tableau will not save the
Secret Access Key
field.
As documented here, when making an Athena query, noctua
tries to keep the S3 results bucket tidy. It does this by trying to delete the Athena query output from S3. If the query initiator doesn't have permission to do this, it returns the following warning:
Info: (Data scanned: 625.16 GB) additional arguments ignored in warning() Warning: AccessDenied (HTTP 403). Access Denied
For the most part, this error message does not affect the query. However, it can cause operations such as knitting to fail. To remedy this, you disable the deletion behavior using noctua
caching. Inside of R, use:
noctua_options(cache_size = 10)