proposal.txt

PROJECT NAME
Barred Gates: Predicting College Admissions

TEAM NAME
Team Ivy

BACKGROUND AND MOTIVATION
Every year, four million students apply to college in the United States. The market for admission related services is $7 Billion. The process is fraught with anxiety and uncertainty for students and their parents. A large source of this anxiety is the lack of reasonably accurate forecasts for successful admissions, in spite of expensive consultants and experienced High School college counselors. The common wisdom is that, once you've reached a certain level of excellence, your admission to top schools is somewhat of a crap-shoot. We aim to take a small step toward improving such predictions by making sense of seeming randomness in the data.
The lack of precision in successful forecasting does have some rational reasons. Top schools have such an abundance of choice that the final selections among many qualified candidates can be somewhat random. Schools try to balance for diversity along ethnic, economic and geographic lines so the choices are dependent on year to year application pools.
The lack of forecasting prowess is exacerbated by organizations like the College Board which has a financial incentive to offer expensive standardized tests and study guides that offer little predictive ability into the applicability of a given student for a given college.
Through this project, we hope to provide some clarity, precision and reasonably accurate forecasting of admissions probabilities for prospective students. After all, each of us was faced with this uncertainty as we applied to university, and each of us would have benefited from some quantitative reassurance about our chances. We intend to make this available in a publicly accessible website, which is particularly timely given that many high school seniors are submitting applications as we speak--err, type.

PROJECT OBJECTIVES
First and foremost, we want to know whether there IS any latent rhyme or reason behind college admissions among the cream of the crop schools. Colleges are surprisingly opaque about admissions criteria. All of us would like to find out whether this is because they are trying to be secretive about their algorithms, or, more unsettlingly, if admissions decisions come down to unprincipled guidelines on the whim of the selection committee. Secondly, if we can find clear profiles of  admitted students, we want to test the common wisdom and our own intuitions about which factors matter the most. Is it the case that we've all been so heavily focused on SAT scores that we've ignored the importance of breadth in AP courses? Finally, we aim to supplement our own technical skills by tackling a project with several moving parts. Not only will we use screen-scraping skills to download forum messages, but we will use language processing to parse those messages in to (real-world, messy) data. We will need to confront the issues of missing data and selection bias. The people who report their statistics on web forums are probably not a representative population! We will learn how to evaluate and test several different models we haven't tried before, such as random forest with regression. Lastly, we'll put our new-found visualization and communication skills to test by designing a reactive web application for our results. 

WHAT DATA?
We have three main sources of data. Two will provide student-based data, giving us the credentials of students who were either accepted or rejected from our target colleges, while the other gives us state information about each college (details like admission rate and financial aid status). We will scrape this data from College Confidential message boards and CollegeData databases. All together, we estimate we will have 30,000 rows of data from these sources. The college-based information will come from the College Board and from U.S. News and World Report lists of top schools. This part of the data will be small, as we only aim to support 25 schools with our app.

MUST-HAVE FEATURES
What we absolutely must have is at least a linear regression model that takes a user's information (SAT scores, class rank, gender, etc) and the school they applying to, and spits out a probability that the user will be accepted to that school. We have chosen probabilities, rather than a simple yes or no answer, because we believe this will be much more conservative. The computation needs to be fast enough that an app can calculate it in real-time. The interactive Shiny application is also a critical piece; this project is meant to be helpful for current high school students. Not only will it output probabilities of getting in, but it will tell the user which factors of the application are most important for that school. 

OPTIONAL FEATURES
We also hope to try SVM and random forest with regression. If there's time, we'll also test an ensemble of all three approaches. On the user side, it would be nice if our app was reactive and would allow students to toggle various elements of their application to see how their likelihood of getting in will change as a function of those factors. 

DESIGN OVERVIEW
No AWS! We plan to do most of our model fitting and testing in Python, although eventually these calculations will also need to be ported in to R so that our project can interface with the Shiny application. We'll use screen scraping and NLTK to get our data. For our models, we will attempt linear regression, SVM, and random forest -- all using cross-validation, of course. 

VERIFICATION
We plan to verify both by iteratively leaving out a portion of the data for testing, but also by crowdsourcing-- asking our friends and colleagues to try out the user interface with their own information, and to collect anecdotal data about how accurate our predictions seemed to be. 

VISUALIZATION AND PRESENTATION
We will create a reactive app with R Shiny. This app will allow users to input their own statistics (such as AP score, SAT, etc) and the school they wish to query. It will output that student's probability of receiving an offer of admission from that school. It will also tell the user what factors of the application are most important in determining admissions decisions for that school (this will be derived from regression betas). Finally, because the website will be reactive, users will be able to toggle different elements of their application to see whether adding another AP class or boosting their SAT scores by 100 points will significantly affect their chances of getting in. The video will consist of a live demonstration of the product!

TIMELINE
Prior to November 9 (already done): Data structure set up, download messages from College Confidential, decide on data model and factors to include in analysis
Week of November 9: Work on parser scripts for data, use randomly-generated data to start building and testing code for analysis
Week of November 14: Test various models using both training/test data splits and having actual users input their stats
Week of November 21 (Thanksgiving): Finalize model to use for user interface, begin to build Shiny app
Week of November 28: Finish building user interface, work on iPython notebook
Week of December 4: Test user interface, record video, turn the project in!!

TEAM MEMBER CONTRIBUTIONS
David: College Confidential message board scraping and parsing, data structure management
Lauren: Collegedata website scraping and parsing, iPython notebook management
Kiran: Collegedata website scraping and parsing, model fitting, stats guru
Morgan: model fitting, R Shiny web app development, video production