Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement better matching for facility information #70

Open
lossyrob opened this issue Mar 25, 2020 · 6 comments
Open

Implement better matching for facility information #70

lossyrob opened this issue Mar 25, 2020 · 6 comments

Comments

@lossyrob
Copy link
Collaborator

The goal of this issue is to implement a better matching algorithm for the facility-level processed data.

We use this data to produce the CovidCareMap US Healthcare System Capacity data on a facility level, which is then rolled up to the regional levels.

Currently (at commit b006476) the matching is implemented in spatial_join_facilities and run in the notebook Merge_Facility_Information

The way this is implemented is as follows:

  • As input, take two facility GeoDataFrames in EPSG:4326 (call them left and right)
  • reproject to EPSG:5070 Conus Albers for getting better meters distance in the US.
  • Buffer left by 1000 meters, creating a geodataframe of ~circular polygons
  • Use GeoPanda's sjoin to find where the buffered facilite in left intersect with the point facilities in right.
  • Deduplicate matches - the intersection can happen between a left facility and multiple right facilities. Dedupe in the following way:
  • Process the final dataset of point facility data having both left and right properties into a geodataframe in EPSG:4326. This result will have all facilities from left, but not all from right (it's a left join).

This methodology is not ideal in that:

  • It doesn't account for when a facility in left is spatially matched to right, but in reality there should be no match.
  • The way we are scoring matches - using the weighted average of the address and name ratios - is not great. Sometimes this will preference facilities that are clearly not the same, but the string matching turns out better.

This issue is to generate a new matching method that improves what we currently have.

There's other libraries to solve the scoring problem - one that I've seen used successfully is [dedupe(https://github.com/dedupeio/dedupe).

CovidCareMap.org is currently matching DH and HCRIS data. However there's other datasets we want to bring in (HIFLD being the first). Ideally this matching enhancement can have the ability to join N number of facility data.

@simonkassel
Copy link
Contributor

I'll take this on today

@daveluo
Copy link
Collaborator

daveluo commented Mar 25, 2020

thanks @simonkassel !

CovidCareMap.org is currently matching DH and HCRIS data. However there's other datasets we want to bring in (HIFLD being the first). Ideally this matching enhancement can have the ability to join N number of facility data.

Trying to match facilities from HIFLD to DH and HCRIS would be a great stretch goal. We're probably going to need to add in HIFLD data (per #49 (comment)) soon enough so this would help enable that.

@simonkassel
Copy link
Contributor

Looks like part of the problem is that the coordinates in the DH data are not great. I'm re-geocoding the DH data using the same process as HCRIS and comparing the results. Here's one example:

m1
m2
m3

The original is orange while the re-geocoded version is purple

@simonkassel
Copy link
Contributor

simonkassel commented Mar 27, 2020

@simonkassel
Copy link
Contributor

simonkassel commented Mar 31, 2020

@lossyrob @daveluo I've been working away at this and have made some progress. The notebook where I do it is here and most of the underlying logic here. I'm using a combination of string similarity and distance matching to find plausible pairs.

I'm finding matches for about 85% of the facilities within each dataset. I have been looking over them and they're not all perfect but they're pretty close. I'm not really sure that the remaining 15% are necessarily a fault of the matching process. There just seems to be a number of discrepancies between the two datasets. For example, one of them (I think DH) seems to have all the VA facilities but the other doesn't. and there are lots of cases in which there doesn't seem to be any logical DH pair for a correctly geocoded facility in the HCRIS dataset (or vice-versa). And sometimes it is even difficult to tell if two records are a match, just because of the complexity of the medical facilities and whether or not one complex is mutliple hispitals, etc.

I will post some examples below but I have been examining them using these folium maps that I created for each state (it didn't seem to be able to render the whole datasets). They are here. I'm not sure if there is an easy way to host them but let me know if you think there would be a good way to do it and it would be useful.

I can continue to tinker with this but would be curious to know if you see a best way to proceed from here.

@simonkassel
Copy link
Contributor

Here's an example from central california: the purple marker is in the HCRIS dataset and the orange one is in the DH, they correctly did not match. If you zoom in you can see they are at different hospitals but there is no credible match for either in the other dataset

image
image
image

In this case, the lower point is a match, two points on top of each other but the orange marker is a VA that doesn't seem to be included in HCRIS
image
image

One more, see the two distantly connected points: they are both in the same network (CPMC) and there are two other centers that match with each other elsewhere in the city. So are these different facilities or is it an administrative address or something? Kind of hard to say
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants