Implement better matching for facility information #70

lossyrob · 2020-03-25T04:49:13Z

The goal of this issue is to implement a better matching algorithm for the facility-level processed data.

We use this data to produce the CovidCareMap US Healthcare System Capacity data on a facility level, which is then rolled up to the regional levels.

Currently (at commit b006476) the matching is implemented in spatial_join_facilities and run in the notebook Merge_Facility_Information

The way this is implemented is as follows:

As input, take two facility GeoDataFrames in EPSG:4326 (call them left and right)
reproject to EPSG:5070 Conus Albers for getting better meters distance in the US.
Buffer left by 1000 meters, creating a geodataframe of ~circular polygons
Use GeoPanda's sjoin to find where the buffered facilite in left intersect with the point facilities in right.
Deduplicate matches - the intersection can happen between a left facility and multiple right facilities. Dedupe in the following way:
- Compute a similarity score that is a weighted average of sequence match ratio generated by difflib (e.g. https://docs.python.org/2/library/difflib.html#difflib.SequenceMatcher.ratio), FuzzyWuzzy, or rapidfuzz (see use rapidfuzz instead of fuzzywuzzy #48), against the hospital name and address columns.
- Choose the right facility that was joined with the left facility based on the best score
- Determine which right facilities were matched with multiple left facilities; keep only the match that has the best score.
Process the final dataset of point facility data having both left and right properties into a geodataframe in EPSG:4326. This result will have all facilities from left, but not all from right (it's a left join).

This methodology is not ideal in that:

It doesn't account for when a facility in left is spatially matched to right, but in reality there should be no match.
The way we are scoring matches - using the weighted average of the address and name ratios - is not great. Sometimes this will preference facilities that are clearly not the same, but the string matching turns out better.

This issue is to generate a new matching method that improves what we currently have.

There's other libraries to solve the scoring problem - one that I've seen used successfully is [dedupe(https://github.com/dedupeio/dedupe).

CovidCareMap.org is currently matching DH and HCRIS data. However there's other datasets we want to bring in (HIFLD being the first). Ideally this matching enhancement can have the ability to join N number of facility data.

The text was updated successfully, but these errors were encountered:

simonkassel · 2020-03-25T14:37:17Z

I'll take this on today

daveluo · 2020-03-25T15:08:11Z

thanks @simonkassel !

CovidCareMap.org is currently matching DH and HCRIS data. However there's other datasets we want to bring in (HIFLD being the first). Ideally this matching enhancement can have the ability to join N number of facility data.

Trying to match facilities from HIFLD to DH and HCRIS would be a great stretch goal. We're probably going to need to add in HIFLD data (per #49 (comment)) soon enough so this would help enable that.

simonkassel · 2020-03-26T15:30:08Z

Looks like part of the problem is that the coordinates in the DH data are not great. I'm re-geocoding the DH data using the same process as HCRIS and comparing the results. Here's one example:

The original is orange while the re-geocoded version is purple

simonkassel · 2020-03-27T14:31:31Z

Notebook to generate those maps here https://github.com/simonkassel/covid19-healthsystemcapacity/blob/sk/facility-matching/notebooks/processing/01B_Mapping_Facilities.ipynb

simonkassel · 2020-03-31T18:42:51Z

@lossyrob @daveluo I've been working away at this and have made some progress. The notebook where I do it is here and most of the underlying logic here. I'm using a combination of string similarity and distance matching to find plausible pairs.

I'm finding matches for about 85% of the facilities within each dataset. I have been looking over them and they're not all perfect but they're pretty close. I'm not really sure that the remaining 15% are necessarily a fault of the matching process. There just seems to be a number of discrepancies between the two datasets. For example, one of them (I think DH) seems to have all the VA facilities but the other doesn't. and there are lots of cases in which there doesn't seem to be any logical DH pair for a correctly geocoded facility in the HCRIS dataset (or vice-versa). And sometimes it is even difficult to tell if two records are a match, just because of the complexity of the medical facilities and whether or not one complex is mutliple hispitals, etc.

I will post some examples below but I have been examining them using these folium maps that I created for each state (it didn't seem to be able to render the whole datasets). They are here. I'm not sure if there is an easy way to host them but let me know if you think there would be a good way to do it and it would be useful.

I can continue to tinker with this but would be curious to know if you see a best way to proceed from here.

simonkassel · 2020-03-31T19:30:32Z

Here's an example from central california: the purple marker is in the HCRIS dataset and the orange one is in the DH, they correctly did not match. If you zoom in you can see they are at different hospitals but there is no credible match for either in the other dataset

In this case, the lower point is a match, two points on top of each other but the orange marker is a VA that doesn't seem to be included in HCRIS

One more, see the two distantly connected points: they are both in the same network (CPMC) and there are two other centers that match with each other elsewhere in the city. So are these different facilities or is it an administrative address or something? Kind of hard to say

lossyrob added the Estimate Hospital System Capacity label Mar 25, 2020

lossyrob mentioned this issue Mar 25, 2020

use rapidfuzz instead of fuzzywuzzy #48

Closed

simonkassel mentioned this issue Mar 26, 2020

match dh and hcris facility data #81

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement better matching for facility information #70

Implement better matching for facility information #70

lossyrob commented Mar 25, 2020

simonkassel commented Mar 25, 2020

daveluo commented Mar 25, 2020 •

edited

Loading

simonkassel commented Mar 26, 2020

simonkassel commented Mar 27, 2020 •

edited

Loading

simonkassel commented Mar 31, 2020 •

edited

Loading

simonkassel commented Mar 31, 2020

Implement better matching for facility information #70

Implement better matching for facility information #70

Comments

lossyrob commented Mar 25, 2020

simonkassel commented Mar 25, 2020

daveluo commented Mar 25, 2020 • edited Loading

simonkassel commented Mar 26, 2020

simonkassel commented Mar 27, 2020 • edited Loading

simonkassel commented Mar 31, 2020 • edited Loading

simonkassel commented Mar 31, 2020

daveluo commented Mar 25, 2020 •

edited

Loading

simonkassel commented Mar 27, 2020 •

edited

Loading

simonkassel commented Mar 31, 2020 •

edited

Loading