Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop an algorithm to give the probability of a certain partner be a relative of the congressperson #98

Closed
cuducos opened this issue Oct 31, 2016 · 7 comments

Comments

@cuducos
Copy link
Collaborator

cuducos commented Oct 31, 2016

We have all the names of the congresspeople and the name of their parents.

That said @anaschwendler and I were discussing today the possibility of having an algorithm that receives as an input:

  • the name of a congressperson
  • the name of the relatives
  • the name of a partner of a company in which the congressperson has spent some public money

The algorithms would give us the probability of the following hypothesis: the partner and the congressperson are relatives.

We could balance more popular (e.g. Silva) and less popular family names (e.g. Sarney) with internal sources (we have thousands of full names in out dataset, including congresspeople and their parents) or with an external database (no ideia, but that should not be a big challenge).

PS: formally we don't have company partners in our dataset, but it's on our roadmap (and maybe the development of this algorithm doesn't depend on that).

(@g4brielvs feel free to jump in!)

@gabriel-almeida
Copy link

I was looking for some way to contribute with this project and this issue seems interesting, since I already solved a similar problem. Here are some ideas:

  • This can be approached with the logistic regression learning algorithm, since it has a probabilistic interpretation. Also, scikit-learn has it and it is not overly complex to understand the final result.
  • The features for the learning could be: the names in common, some measures of string similarity or a cleaver combination of both - need to think more about it. Related package: https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html
  • The "negative part" of the dataset used to learn could be made with some random combinations of the available people (be aware that use all combinations may be infeasible)
  • Some configurations of the learning algorithm depends on the use case:
    • If it is a kind of triage, just to filter very bad cases (many false positives but high recall);
    • If it needs to make sure about the answer, even if it allows a few cases to pass (many false negative but high precision);

I might do this in the future, but feel free to steal those ideas :)

@cuducos
Copy link
Collaborator Author

cuducos commented Nov 3, 2016

This is awesome! Many thanks, @gabriel-almeida!

I'm not sure I have the right skills to code that this quickly by myself, but surely this leaves this Issue way easier. Whoever wants to jump in, make yourself at home ; )

@eldersantos
Copy link

eldersantos commented Feb 15, 2017

Wikipedia provides a lot relationship between the politics, maybe it is a good source to scrape data.

Also, maybe I am being silly, but the names of all partners should be public info too, shouldn't?

@cuducos
Copy link
Collaborator Author

cuducos commented Feb 15, 2017

@eldersantos sure thing — we've discussed some pros & cons of using Wikipedia for that purpose at #15. As there were some relevant cons this issue is more focused on detecting family members when we can't find that data elsewhere (Facebook, Wikipedia, etc.), does that make sense?

@eldersantos
Copy link

eldersantos commented Feb 15, 2017 via email

@cuducos
Copy link
Collaborator Author

cuducos commented Feb 15, 2017

No need to say sorry — this link was in fact missing here in this thread ; )

@cuducos cuducos removed this from the Roadmap: Nepotism milestone Mar 24, 2017
@cuducos
Copy link
Collaborator Author

cuducos commented Mar 24, 2017

@gabriel-almeida has made an awesome contribution in in #119 — further discussion is welcomed there.

@cuducos cuducos closed this as completed Mar 24, 2017
cuducos pushed a commit that referenced this issue Feb 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants