Develop an algorithm to give the probability of a certain partner be a relative of the congressperson #98

cuducos · 2016-10-31T15:34:57Z

We have all the names of the congresspeople and the name of their parents.

That said @anaschwendler and I were discussing today the possibility of having an algorithm that receives as an input:

the name of a congressperson
the name of the relatives
the name of a partner of a company in which the congressperson has spent some public money

The algorithms would give us the probability of the following hypothesis: the partner and the congressperson are relatives.

We could balance more popular (e.g. Silva) and less popular family names (e.g. Sarney) with internal sources (we have thousands of full names in out dataset, including congresspeople and their parents) or with an external database (no ideia, but that should not be a big challenge).

PS: formally we don't have company partners in our dataset, but it's on our roadmap (and maybe the development of this algorithm doesn't depend on that).

(@g4brielvs feel free to jump in!)

gabriel-almeida · 2016-11-02T22:08:06Z

I was looking for some way to contribute with this project and this issue seems interesting, since I already solved a similar problem. Here are some ideas:

This can be approached with the logistic regression learning algorithm, since it has a probabilistic interpretation. Also, scikit-learn has it and it is not overly complex to understand the final result.
The features for the learning could be: the names in common, some measures of string similarity or a cleaver combination of both - need to think more about it. Related package: https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html
The "negative part" of the dataset used to learn could be made with some random combinations of the available people (be aware that use all combinations may be infeasible)
Some configurations of the learning algorithm depends on the use case:
- If it is a kind of triage, just to filter very bad cases (many false positives but high recall);
- If it needs to make sure about the answer, even if it allows a few cases to pass (many false negative but high precision);

I might do this in the future, but feel free to steal those ideas :)

cuducos · 2016-11-03T11:21:22Z

This is awesome! Many thanks, @gabriel-almeida!

I'm not sure I have the right skills to code that this quickly by myself, but surely this leaves this Issue way easier. Whoever wants to jump in, make yourself at home ; )

eldersantos · 2017-02-15T14:01:57Z

Wikipedia provides a lot relationship between the politics, maybe it is a good source to scrape data.

Also, maybe I am being silly, but the names of all partners should be public info too, shouldn't?

cuducos · 2017-02-15T14:21:07Z

@eldersantos sure thing — we've discussed some pros & cons of using Wikipedia for that purpose at #15. As there were some relevant cons this issue is more focused on detecting family members when we can't find that data elsewhere (Facebook, Wikipedia, etc.), does that make sense?

eldersantos · 2017-02-15T14:46:00Z

Absolutely, sorry for not check that subject on the other issue, in that case definitely we need an algorithm to solve it. Maybe we can check on the literature (academic papers) what is a good (complexity/time) approach:)

…

On Wed, 15 Feb 2017 at 12:21 Eduardo Cuducos ***@***.***> wrote: @eldersantos <https://github.com/eldersantos> sure thing — we've discussed some pros & cons of using Wikipedia for that purpose at #15 <#15>. As there were some relevant *cons* this issue is more focused on detecting family members when we can't find that data elsewhere (Facebook, Wikipedia, etc.), does that make sense? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#98 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALBvGHV738WiUP1l4e3HtFQqT4n07l2ks5rcwnagaJpZM4KlJhl> .

cuducos · 2017-02-15T14:54:21Z

No need to say sorry — this link was in fact missing here in this thread ; )

cuducos · 2017-03-24T13:36:37Z

@gabriel-almeida has made an awesome contribution in in #119 — further discussion is welcomed there.

Fix Docker command

cuducos added analysis hard labels Oct 31, 2016

cuducos added the hacktoberfest label Nov 3, 2016

gabriel-almeida mentioned this issue Nov 5, 2016

Family name classifier #107

Closed

cuducos modified the milestone: Roadmap: Nepotism Nov 7, 2016

cuducos added work in progress and removed hard labels Nov 7, 2016

This was referenced Dec 8, 2016

Centralized database of information #160

Closed

Find clusters of politicians spending with companies owned by each others relatives #18

Open

marcusrehm mentioned this issue Mar 11, 2017

Plugin to use Neo4j within Jupyter notebooks #200

Merged

cuducos removed this from the Roadmap: Nepotism milestone Mar 24, 2017

cuducos removed the hacktoberfest label Mar 24, 2017

cuducos closed this as completed Mar 24, 2017

cuducos pushed a commit that referenced this issue Feb 28, 2018

Merge pull request #98 from datasciencebr/cuducos-fix-docker-command

4171316

Fix Docker command

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop an algorithm to give the probability of a certain partner be a relative of the congressperson #98

Develop an algorithm to give the probability of a certain partner be a relative of the congressperson #98

cuducos commented Oct 31, 2016 •

edited

Loading

gabriel-almeida commented Nov 2, 2016

cuducos commented Nov 3, 2016

eldersantos commented Feb 15, 2017 •

edited

Loading

cuducos commented Feb 15, 2017

eldersantos commented Feb 15, 2017 via email

cuducos commented Feb 15, 2017

cuducos commented Mar 24, 2017

Develop an algorithm to give the probability of a certain partner be a relative of the congressperson #98

Develop an algorithm to give the probability of a certain partner be a relative of the congressperson #98

Comments

cuducos commented Oct 31, 2016 • edited Loading

gabriel-almeida commented Nov 2, 2016

cuducos commented Nov 3, 2016

eldersantos commented Feb 15, 2017 • edited Loading

cuducos commented Feb 15, 2017

eldersantos commented Feb 15, 2017 via email

cuducos commented Feb 15, 2017

cuducos commented Mar 24, 2017

cuducos commented Oct 31, 2016 •

edited

Loading

eldersantos commented Feb 15, 2017 •

edited

Loading