-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement better matching for facility information #70
Comments
I'll take this on today |
thanks @simonkassel !
Trying to match facilities from HIFLD to DH and HCRIS would be a great stretch goal. We're probably going to need to add in HIFLD data (per #49 (comment)) soon enough so this would help enable that. |
Notebook to generate those maps here https://github.com/simonkassel/covid19-healthsystemcapacity/blob/sk/facility-matching/notebooks/processing/01B_Mapping_Facilities.ipynb |
@lossyrob @daveluo I've been working away at this and have made some progress. The notebook where I do it is here and most of the underlying logic here. I'm using a combination of string similarity and distance matching to find plausible pairs. I'm finding matches for about 85% of the facilities within each dataset. I have been looking over them and they're not all perfect but they're pretty close. I'm not really sure that the remaining 15% are necessarily a fault of the matching process. There just seems to be a number of discrepancies between the two datasets. For example, one of them (I think DH) seems to have all the VA facilities but the other doesn't. and there are lots of cases in which there doesn't seem to be any logical DH pair for a correctly geocoded facility in the HCRIS dataset (or vice-versa). And sometimes it is even difficult to tell if two records are a match, just because of the complexity of the medical facilities and whether or not one complex is mutliple hispitals, etc. I will post some examples below but I have been examining them using these folium maps that I created for each state (it didn't seem to be able to render the whole datasets). They are here. I'm not sure if there is an easy way to host them but let me know if you think there would be a good way to do it and it would be useful. I can continue to tinker with this but would be curious to know if you see a best way to proceed from here. |
The goal of this issue is to implement a better matching algorithm for the facility-level processed data.
We use this data to produce the CovidCareMap US Healthcare System Capacity data on a facility level, which is then rolled up to the regional levels.
Currently (at commit b006476) the matching is implemented in spatial_join_facilities and run in the notebook Merge_Facility_Information
The way this is implemented is as follows:
EPSG:4326
(call themleft
andright
)EPSG:5070
Conus Albers for getting better meters distance in the US.left
by 1000 meters, creating a geodataframe of ~circular polygonssjoin
to find where the buffered facilite inleft
intersect with the point facilities inright
.left
facility and multipleright
facilities. Dedupe in the following way:difflib
(e.g. https://docs.python.org/2/library/difflib.html#difflib.SequenceMatcher.ratio), FuzzyWuzzy, or rapidfuzz (see use rapidfuzz instead of fuzzywuzzy #48), against the hospital name and address columns.right
facility that was joined with theleft
facility based on the best scoreright
facilities were matched with multipleleft
facilities; keep only the match that has the best score.left
andright
properties into a geodataframe in EPSG:4326. This result will have all facilities fromleft
, but not all fromright
(it's a left join).This methodology is not ideal in that:
left
is spatially matched toright
, but in reality there should be no match.This issue is to generate a new matching method that improves what we currently have.
There's other libraries to solve the scoring problem - one that I've seen used successfully is [dedupe(https://github.com/dedupeio/dedupe).
CovidCareMap.org is currently matching DH and HCRIS data. However there's other datasets we want to bring in (HIFLD being the first). Ideally this matching enhancement can have the ability to join N number of facility data.
The text was updated successfully, but these errors were encountered: