Data Science Specialization Capstone Project

This is the Capstone Project for the Johns Hopkins University Data Science Specialization on Coursera.

Project Overview

Excerpt from the project introduction page

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

Data Source

The data is from a corpus called HC Corpora. As explained in their website:

HC corpora is a collection of corpora for various languages freely available to download. The corpora have been collected from numerous different webpages, with the aim of getting a varied and comprehensive corpus of current use of the respective language. I have strived to search from many different types of sources, such as newspapers, magazines, (personal and professional) blogs and Twitter updates.

The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language. Each entry is tagged with it's date of publication. Where user comments are included they will be tagged with the date of the main entry. Each entry is tagged with the type of entry, based on the type of website it is collected from (e.g. newspaper or personal blog) If possible, each entry is tagged with one or more subjects based on the title or keywords of the entry (e.g. if the entry comes from the sports section of a newspaper it will be tagged with "sports" subject). In many cases it's not feasible to tag the entries or no subject is found by the automated process, in which case the entry is tagged with a '0'. To save space, the subject and type is given as a numerical code.

Note: the automatic language checker sometimes fails to differentiate very similar languages. This is why there are some foreign text in the files.

The raw data can be downloaded from here.

Additional Resources

Appendix - Specialization content

The Specialization is divided in 9 courses:

Data Scientist Toolbox
R Programming
Getting and Cleaning Data
Exploratory Data Analysis
Reproducible Research
Statistical Inference
Regression Models
Practical Machine Learning
Developing Data Products

Notes and course projects are available here.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
jsTest		jsTest
nGramCreation		nGramCreation
nGramPerf		nGramPerf
nlp_stanford		nlp_stanford
old		old
shinyapp		shinyapp
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Specialization Capstone Project

Project Overview

Data Source

Additional Resources

Appendix - Specialization content

About

Releases

Packages

Languages

sebastienplat/datascienceCapstone

Folders and files

Latest commit

History

Repository files navigation

Data Science Specialization Capstone Project

Project Overview

Data Source

Additional Resources

Appendix - Specialization content

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages