-
Acknowledge people if there are any usage limitations or law regulations about the dataset.
For example, if some datasets are not allowed to be used for commercial purposes, or not even allowed to be downloaded by the public 🙅♂️, please note it here and provide URLs to their limitations or regulations if possible.
-
If your dataset requires an application, request, or permission 🤲 from its data owner, please note it here. Meanwhile, please mention such information in the later Data Source/Credit section.
For example, if you need to register an account, fill in an application form on its website, or send emails to the data owner, please record any necessary information 📝.
-
If you are not confident about its usage, we suggest setting your repo to private and discussing such issues with our organiser Mattew via Matthew.Altenburg@anu.edu.au.
Provide a brief introduction about the dataset. Any other people who know nothing about this repo should learn the most basic information and decide whether the dataset is what they are looking for 🧐.
You are expected to inform people the following information:
-
The time span or geopolitical information of the dataset 🌏
E.g., the dataset covers teenager vaping information in ACT from 2010 to 2020.
-
The background of the dataset 📋
E.g., the dataset is established by the ACT Government for the research of public health in Australia.
-
The data owner if the dataset is not original 👩💼
You should inform the essential information of the dataset owner and provide URLs to its publish page, GitHub repo, or personal website whenever possible.
- Where does this dataset come from?
- Provide information on the website and data owner 👨💼.
- What license is the dataset under?
- If it is not allowed for commercial use, modification, publication, or any other limitations 👩⚖️, please acknowledge regarding information in the announcement.
- (Optional) If this repo also includes cleaned dataset:
- Inform people that the cleaned dataset is also available under
(optional)_dataset-name_clean
. - If the cleaned dataset is also accessible on Hugging Face 🤗, please mention it here with the link.
- Inform people that the cleaned dataset is also available under
After people read your Overview of the Dataset, it's time to let them know what the dataset looks like. In this section, you are suggested to provide a bird's-eye view or an X-ray scan of your dataset to help people understand the structure of the dataset 🔍.
For example,
-
If the dataset is split into several topics/categories:
- What does each topic/category mean?
- Which file is corresponding to which topic/category?
- How many files/data/rows/pictures/etc. is included?
-
If the dataset has multiple data formats:
-
How many formats does the dataset have in total?
-
What are the different formats and what are they used for?
-
What is the size of the total files for each format?
-
-
If the dataset has deep paths/directories:
-
Why is the dataset organised in this way?
-
Which part is responded to which data?
-
You can show a tree structure of your dataset by using
tree
command.
-
You don't have to organise your dateset in the same structure, find the best way to treat your dataset 🙇♂️.
Remember the frustration when you first time used GitHub? Kindly remind you that not all GitHub users are tech-savvy , and your guidance here could enlighten a person's passion for programming 🤓.
Please patiently entail each step as if you are teaching a 6-year-old elementary schooler (no offence to any 6-year-old elementary schooler), or the old yourself when you struggled with GitHub decades ago.
You've come a long way, you know how people will appreciate your work 💙.
-
If your dataset can be downloaded from a publish page:
Provide URLs to the original publish page and let users decide whether they want to use your code.
-
Otherwise, if your dataset is collected by using data-scraping or data-crawling scripts 👩💻:
Please provide a detailed tutorial to navigate people from installing environments or dependencies, and running your code, to storing and post-processing the raw dataset on their local machine.
For example, you are suggested to include
- How to use
requirement.txt
to create and install a Python environment inconda
? - How to use
python3 script_name.py
to run your Python code? - How to use
curl -o [dataset_name] [dataset_url]
to download from a link? - How to use
source script_name.sh
to run your Bash script? - How to follow your
notebook_name.ipynb
to walk through each step? - If people want to clone your repo, how to set up Git/GitHub and use
git clone
command 🧑💻? - (Optional) If you are happy to upload your data cleaning or data collecting scripts, please organise all your tools inside
(optional)_utils
.
- How to use
A repository license on GitHub is a legal document 📄 that outlines the terms and conditions under which others can use, modify, and distribute the code in your repository.
When you add a license to your repository, you're giving others permission to use your code in certain ways, while also protecting your own rights 👩⚖️ as the creator of the code.
For choosing your GitHub repo license, please visit
- Licensing a repository - GitHub
- Choose an open source license - Choose a License
- License Selector - ufal.github.io
-
Click the gear icon ⚙️ on your repo webpage to set up:
-
Description: A short sentence about the repo.
E.g., The Teenager Vaping Data from ACT Govt
-
Website: A link to the data owner.
-
Topics: Tags that help organise, classify and promote your repo.
E.g., "australia", "dataset", "vape", "public-health"
-
-
Find and add your
.gitignore
template from gitignore - GitHub. -
How to contribute to your repo?
If you are looking for people to help you develop data-scraping scripts, or you need people to assist in data cleaning, please suggest people to open an Issue 🙋in your repo, and provide your contact information 📮 if necessary. You can also put this in the (Optional) Announcement section.