The Data Department utilizes AWS for data storage (S3), data cleaning (Glue), and querying (Athena). To connect to these services, follow the instructions below:

S3

Install and setup the AWS CLI and aws-mfa
Authenticate with aws-mfa via the command line
That's it, calls to S3 should work automatically
- You can use the aws.s3 library in R or boto3 in Python to upload, download, and manipulate S3 objects
- Certain packages, such as arrow, can also read from S3 directly (for example read_parquet("s3://bucket-name/object")

R Example

# Load necessary libraries
library(aws.s3)

# Test connection
aws.s3::bucket_list_df()

# Save object
aws.s3::save_object("s3://bucket-name/object", file = "filename")

Python Example

# Load necessary packages
import boto3

# Establish connection
s3 = boto3.resource(
    service_name='s3',
    region_name='us-east-1'
    )

# Test connection
for bucket in s3.buckets.all():
    print(bucket.name)

# Or, depending on your needs
s3 = boto3.client(
    service_name='s3',
    region_name='us-east-1',
    )

Athena

R

You can use the noctua R package to pull data from Athena. noctua has two useful options for speeding up queries: caching and unload.

Caching allows you to re-use locally-stored query results rather than constantly re-pulling from AWS. It is useful for situations like rendering a Quarto document, where you need to re-run the whole file but may not want to re-pull data.
The unload = TRUE option uses a different method of storing and transferring query results. It tends to be a bit faster on our hardware, and thus we recommend using it by default. NOTE: The unload option only works correctly on noctua>=2.6.3.

To setup and use noctua in an R project:

Install and setup the AWS CLI and aws-mfa
Authenticate with aws-mfa via the command line
In your root-level .Renviron file, add the environmental variables below. Message @SweatyHandshake or @dfsnow for the name of the Athena results bucket. Save the file and restart your R session
```
AWS_REGION=us-east-1
AWS_ATHENA_S3_STAGING_DIR=$RESULTS_BUCKET
```
Make sure that $RESULTS_BUCKET begins with s3://.

Run the following code to instantiate your connection and run a test query

# Load necessary libraries
library(DBI)
library(noctua)

# Optionally enable query caching and unload
noctua_options(cache_size = 10, unload = TRUE)

# Establish connection
AWS_ATHENA_CONN_NOCTUA <- dbConnect(
  noctua::athena(),
  # Disable the Connections tab entry for this database. Always use this if
  # you don't want to browser the tables in the Connections tab, since it
  # speeds up instantiating the connection significantly
  rstudio_conn_tab = FALSE
)

# Test the connection
dbGetQuery(
  conn = AWS_ATHENA_CONN_NOCTUA,
  "SELECT year, geoid FROM census.acs5 LIMIT 10"
)

If the connection to Athena is successful but your test query returns zero rows, you may need to install the development version of noctua:

remotes::install_github("dyfanjones/noctua")

# Or via renv:
renv::install("dyfanjones/noctua")

Python

Using python, the pyathena package is an excellent option for ingesting data from AWS Athena.

As with R, enabling unload via cursor(unload=TRUE) uses a different method of storing and transferring query results. It tends to be a bit faster on our hardware, and thus we recommend using it by default.

Install and setup the AWS CLI and aws-mfa
Authenticate with aws-mfa via the command line
Install the pyathena package into your python environment using pip install PyAthena
Add the following environment variables to your environment
- Every time we ingest data from Athena, the data has to be created as a file and stored somewhere before it arrives in our coding environment. The $ATHENA_RESULTS_BUCKET is our designated bucket for these intermediate files, which in this example is meant to be replaced with our actual bucket name.
```
AWS_ATHENA_S3_STAGING_DIR=$ATHENA_RESULTS_BUCKET
AWS_REGION=us-east-1
```

Run the following code to instantiate your connection and run test query

# Load necessary libraries
import os
import pandas
import pyarrow
from pyathena import connect
from pyathena.pandas.util import as_pandas
from pyathena.pandas.cursor import PandasCursor

# Connect to Athena
cursor = connect(
    # We add '+ "/"' to the end of the line below because enabling unload
    # requires that the staging directory end with a slash
    s3_staging_dir=os.getenv("AWS_ATHENA_S3_STAGING_DIR") + "/",
    region_name=os.getenv("AWS_REGION"),
    cursor_class=PandasCursor,
).cursor(unload=True)

# Define test query. Note that the query CANNOT end with a semi-colon
SQL_QUERY = "SELECT * from default.vw_pin_sale LIMIT 10"

# Execute query and return as pandas df
cursor.execute(SQL_QUERY)

df = cursor.as_pandas()

Tableau and Tableau Prep

You will likely need to work with IT admins for permissions to do the following.

Install Tableau Desktop and Tableau Prep. Verify with a core team member which versions to install. If you plan to publish a file to the CCAO's Tableau Server, the versions of Tableau Desktop and Tableau Prep cannot be more recent than the version of Tableau Server.
Install the JDBC 2.x Driver with AWS SDK. Then move the downloaded .jar file to C:\Program Files\Tableau\Drivers on Windows or ~/Library/Tableau/Drivers on Mac.
Make sure Java SE Development Kit is installed.
Create a file called athena.properties in C:\Users\$USER\Documents\My Tableau Repository\Datasources on Windows or ~/Users/$USER/Documents/My Tableau Repository/Datasources on Mac with the following lines:
```
workgroup=read-only-with-scan-limit
MetadataRetrievalMethod=ProxyAPI
```
Add the same file to C:\Users\$USER\Documents\My Tableau Prep Repository\Datasources on Windows or ~/Users/$USER/Documents/My Tableau Prep Repository/Datasources on Mac.
Open Tableau and on the Connect sidebar under To a Server, navigate to Amazon Athena. On Tableau Prep, select the + next to Connections and navigate to Amazon Athena.
Message a core team member for the necessary server info and credentials. Tableau will not save the Secret Access Key field.

Troubleshooting

As documented here, when making an Athena query, noctua tries to keep the S3 results bucket tidy. It does this by trying to delete the Athena query output from S3. If the query initiator doesn't have permission to do this, it returns the following warning:

Info: (Data scanned: 625.16 GB) additional arguments ignored in warning() Warning: AccessDenied (HTTP 403). Access Denied

For the most part, this error message does not affect the query. However, it can cause operations such as knitting to fail. To remedy this, you disable the deletion behavior using noctua caching. Inside of R, use:

noctua_options(cache_size = 10)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connect-to-AWS-Resources.md

Connect-to-AWS-Resources.md

S3

R Example

Python Example

Athena

R

Python

Tableau and Tableau Prep

Troubleshooting

Files

Connect-to-AWS-Resources.md

Latest commit

History

Connect-to-AWS-Resources.md

File metadata and controls

S3

R Example

Python Example

Athena

R

Python

Tableau and Tableau Prep

Troubleshooting