Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Gap #174

Merged
merged 57 commits into from
Nov 25, 2024
Merged

Data Gap #174

merged 57 commits into from
Nov 25, 2024

Conversation

i-be-snek
Copy link
Collaborator

@i-be-snek i-be-snek commented Oct 17, 2024

This PR fills the data gap between event levels.

Taken as is from #173 (updated checklist, 2024/11/03):

  • 1. Admin_Areas: if the joint set of L2/L3 is larger than the one we get in L1, and automatically, the extra areas need to fill in L1
  • 2. Impact categories: if L1 is NULL, but impact information is found in L2 or L3, there are several cases
    • a. L2 is not NULL, L3 is NULL, sum up the numbers from L2, and fill in L1

    • b. L2 is NULL, and L3 is not NULL

      • - (i) for the same admin_area, and impact in different locations, sum up the numbers and fill a new record in L2
      • - (ii) next, sum up the numbers from L2 to fill in L1
    • c. L2 and L3 are not NULL

      • - (i) for the same admin_area, and impact in different locations, sum up the numbers in L3 and compare with the number in L2, if the number is smaller, revise the ''Num_Min" to this number, and if the number is larger, then change the Num_max in L2 (ideally this would not happen, because we require the model get the total in L2)
      • - (ii) if the admin_area in L3 is not in L2, sum up the numbers and create a new record in L2
      • - (iii) Next, sum up the all numbers from L2, and fill in L1 with the range
    • Special case: if L1 is not null but the sum of the impact is smaller than L2, use the aggregated L2 number

  • 3. time: normally the time information in L2 and L3 is missing, for any case without a year, just fill the year in L2 and L3 using the same as the L1

Check #101 for more details.

Update: ⚠️
To be able to apply the data gap and the conversion and inflation adjustment, it would be easier to handle a de-duplicated version of the dataset. So I'm now modifying the db insertion code to create a de-duplicated copy of the data in parquet -- then we can manipulate that easily before a final insertion in the table. I'll work on this while we in the meantime work out the details for how to adjust certain cases of l3->l2 areas and imapcts (2) and how to handle the data gap before adjusting for inflation (3)


PR Description

This PR contains a script to fill the data gap as described above, alongside some utility functions in Database/scr/normalize_data.py to facilitate the process.
The PR also contains a copy of the RAW full run (in chunks of 25 each) with the geojson objects de-duplicated (in Database/output/full_run_25_deduplicated and full_run_25_deduplicated. The processed version of this with the data gap filled is found in Database/output/full_run_25_deduplicated_data_gap.

This PR does not insert the newly processed data into the database because the inflation adjustment and currency conversion steps should happen first so that Damage and Insured Damage categories are normalized properly!

How to test:

  • Run the data gap script:
    poetry run python3 Database/fill_data_gap.py -i Database/output/full_run_25_deduplicated -o Database/output/<OUTPUT_DIR>
  • Check the logs (data_gap.log) to see what the data gap has filled (logs exist for everything except for the time gap, which only fills l2/l3 with a start and end year if it's not missing from l1). ⚠️ Note that new logs would be written to the same file with a timestamp.
  • Inspect different impact categories manually by loading the parquet files
  • Run the data gap script on the files twice: I found that the filling process creates more instances that need to be filled. I don't have time to investigate why, but it may be safe to run the data gap script on the files several times. Ideally, if things went perfectly, there shouldn't be any "data gap filling" left to do when the script is re-run on the same files it output.

@liniiiiii
Copy link
Collaborator

pls link to this issue #173 (comment) as well, summarize the to do list, thanks

@i-be-snek
Copy link
Collaborator Author

i-be-snek commented Oct 19, 2024

pls link to this issue #173 (comment) as well, summarize the to do list, thanks

Hi @liniiiiii
I am finding it hard to understand this sentence.
I understand there is a todolist in #173 (comment), thanks for putting it together!

as well, summarize the to do list, thanks

What do you mean by this? Is this a task for me?

@liniiiiii
Copy link
Collaborator

liniiiiii commented Oct 19, 2024 via email

@i-be-snek i-be-snek force-pushed the data-gap branch 4 times, most recently from f7e283e to c6ad6b0 Compare October 31, 2024 07:38
@i-be-snek i-be-snek linked an issue Oct 31, 2024 that may be closed by this pull request
6 tasks
@i-be-snek
Copy link
Collaborator Author

i-be-snek commented Nov 1, 2024

@liniiiiii

For the location data gap filling:

Location, the Admin_Areas in L1, should cover all Admin_Areas in L2, and Admin_Areas in L2 should cover all Admin_Area in L3

Filling areas found in l2 and l3 into l1 is now implemented. But I am wondering what to do with filling l3 areas into l2. That's because in l2, we can have multiple records for the same event.

Here is an example from l2 where we have 3 records for the same event.

l2 Administrative_Areas_Norm -- before data gap filling
index                          
0      [Belgium]
1       [Netherlands]
2      [United Kingdom]

If, for example, we find that l3 records contain the countries France, Belgium, and the UK, but "France" is not found in any of the l2 records, how should they be filled? Should the result be as shown below?

l2 Administrative_Areas_Norm -- after data gap filling?
index                          
0      [Belgium, France]
1       [Netherlands, France]
2      [United Kingdom, France]

Because I feel like that introduces errors and distorts the real data. What do you think?

@liniiiiii
Copy link
Collaborator

@liniiiiii

For the location data gap filling:

Location, the Admin_Areas in L1, should cover all Admin_Areas in L2, and Admin_Areas in L2 should cover all Admin_Area in L3

Filling areas found in l2 and l3 into l1 is now implemented. But I am wondering what to do with filling l3 areas into l2. That's because in l2, we can have multiple records for the same event.

Here is an example from l2 where we have 3 records for the same event.

l2 Administrative_Areas_Norm -- before data gap filling
index                          
0      [Belgium]
1       [Netherlands]
2      [United Kingdom]

If, for example, we find that l3 records contain the countries France, Belgium, and the UK, but "France" is not found in any of the l2 records, how should they be filled? Should the result be as shown below?

l2 Administrative_Areas_Norm -- after data gap filling?
index                          
0      [Belgium, France]
1       [Netherlands, France]
2      [United Kingdom, France]

Because I feel like that introduces errors and distorts the real data. What do you think?

@i-be-snek , I think in L2, need to add a record for France instead of adding into other countries like below. The idea is that, we assume the total impact of France is the impact from L3.
0 [Belgium]
1 [Netherlands]
2 [United Kingdom]
3 [France]

I also can imagine other case like
L2

0 [Belgium, France]
1 [Belgium]
2 [France]

L3

0 [Belgium]
1 [Belgium]
2 [France]
3 [Germany]

In this case, in L2 #0, will not have any information from L3, we will neglect the case where several counties in L2. the L2 will be

0 [Belgium, France] keep original 
1 [Belgium] extend with the L3 sum (#0 and #1)
2 [France] extend with L3 #2 
3 [Germany] extend with L3 #3

@i-be-snek
Copy link
Collaborator Author

@liniiiiii so, when I add "Germany" in your last example (index 3), what impact information would it have? We would be synthetically creating an L2 record that doesn't exist. Is that what we want to do?

@i-be-snek
Copy link
Collaborator Author

@liniiiiii what is our motivation for treating l2 records with more than one administrative area in a "special" way?
In the example you gave:

0 [Belgium, France] keep original 
1 [Belgium] extend with the L3 sum (#0 and #1)
2 [France] extend with L3 #2 
3 [Germany] extend with L3 #3

Should index l2_0 not be extended with the sum of l3_0, l3_1 (for Belgium) and l3_2 (for France)?

@i-be-snek
Copy link
Collaborator Author

Also, what if we have an l2 case like this one:

0 [India] # event in March, same location
1 [India] # event in April, same location
2 [Burma]
3 [Pakistan]

If l3 has any records for India, would we have to extend both index 0 and 1?

@liniiiiii
Copy link
Collaborator

@liniiiiii so, when I add "Germany" in your last example (index 3), what impact information would it have? We would be synthetically creating an L2 record that doesn't exist. Is that what we want to do?

@i-be-snek , yes, we create a record in L2 for "Germany" because we want to aggregate information to the country level, let's say 20 deaths in Bavaria in L3, but no deaths information in L2 for "Germany", we automatically take 20 deaths from L3 and put in L2 to present the impact information in Germany.

@liniiiiii
Copy link
Collaborator

@liniiiiii what is our motivation for treating l2 records with more than one administrative area in a "special" way? In the example you gave:

0 [Belgium, France] keep original 
1 [Belgium] extend with the L3 sum (#0 and #1)
2 [France] extend with L3 #2 
3 [Germany] extend with L3 #3

Should index l2_0 not be extended with the sum of l3_0, l3_1 (for Belgium) and l3_2 (for France)?
@i-be-snek , so this > 0 [Belgium, France] keep original is kind of corner case, ideally, we want the individual country level information, not information like this, so, in analysis, if we know 20 deaths total in Belgium, 10 deaths in France in L2, will be better for analysis perspective. Therefore, we just leave these corner cases in the database, and not necessary to deal with them.

@liniiiiii
Copy link
Collaborator

Also, what if we have an l2 case like this one:

0 [India] # event in March, same location
1 [India] # event in April, same location
2 [Burma]
3 [Pakistan]

If l3 has any records for India, would we have to extend both index 0 and 1?

@i-be-snek , yes, I think it's the best way to do it to extend both, because we know that the date information in L2 and L3 are missing in most of the cases

@i-be-snek
Copy link
Collaborator Author

Also, what if we have an l2 case like this one:

0 [India] # event in March, same location
1 [India] # event in April, same location
2 [Burma]
3 [Pakistan]

If l3 has any records for India, would we have to extend both index 0 and 1?

@i-be-snek , yes, I think it's the best way to do it to extend both, because we know that the date information in L2 and L3 are missing in most of the cases

Hmm, shouldn't we at least extend based on the date if it's not missing? Also are we basically extending all data from l3 to l2 then, if we decide to extend these two if l3 records of India exist?

I'm trying to understand what to code, sorry for dragging this discussion on for so long... some bits of this are a bit confusing to me.

@liniiiiii
Copy link
Collaborator

Also are we basically extending all data from l3 to l2 then, if we decide to extend these two if l3 records of India exist?

@i-be-snek , what do you mean by Also are we basically extending all data from l3 to l2 then, if we decide to extend these two if l3 records of India exist?? No problem, I think we need to clarify this process!

@i-be-snek
Copy link
Collaborator Author

@liniiiiii I'm now following the rules here: #101 (comment)
Do you think that's okay? Ar they up to date?

@i-be-snek
Copy link
Collaborator Author

i-be-snek commented Nov 3, 2024

@liniiiiii

what do you mean by Also are we basically extending all data from l3 to l2 then, if we decide to extend these two if l3 records of India exist?? No problem, I think we need to clarify this process!

I am asking if we then extend every record of l2 that shares an administrative area for l3 (except for the corner cases).
This is the example you gave earlier:

L2

0 [Belgium, France]
1 [Belgium]
2 [France]

L3

0 [Belgium]
1 [Belgium]
2 [France]
3 [Germany]

In this case, in L2 #0, will not have any information from L3, we will neglect the case where several counties in L2. the L2 will be

0 [Belgium, France] keep original 
1 [Belgium] extend with the L3 sum (#0 and #1)
2 [France] extend with L3 #2 
3 [Germany] extend with L3 #3

So here you wrote #extend with L3 for l2_2 and l2_3, and we also do the same for l2_1 but by extending the sums of two entries.

This part #extend with L3 is what confuses me. If a record exists in both L2 and L3, why should we extend the one in L2? I thought the data gap was for inconsistent counts + missing values, but here you are suggesting that we extend L2 with all data from L3 that shares an administrative area with L2

@liniiiiii
Copy link
Collaborator

@liniiiiii

what do you mean by Also are we basically extending all data from l3 to l2 then, if we decide to extend these two if l3 records of India exist?? No problem, I think we need to clarify this process!

I am asking if we then extend every record of l2 that shares an administrative area for l3 (except for the corner cases). This is the example you gave earlier:

L2

0 [Belgium, France]
1 [Belgium]
2 [France]

L3

0 [Belgium]
1 [Belgium]
2 [France]
3 [Germany]

In this case, in L2 #0, will not have any information from L3, we will neglect the case where several counties in L2. the L2 will be

0 [Belgium, France] keep original 
1 [Belgium] extend with the L3 sum (#0 and #1)
2 [France] extend with L3 #2 
3 [Germany] extend with L3 #3

So here you wrote #extend with L3 for l2_2 and l2_3, and we also do the same for l2_1 but by extending the sums of two entries.

This part #extend with L3 is what confuses me. If a record exists in both L2 and L3, why should we extend the one in L2? I thought the data gap was for inconsistent counts + missing values, but here you are suggesting that we extend L2 with all data from L3 that shares an administrative area with L2

@i-be-snek , I think I described it same as the #101 (comment), I see few points are not up to date, I will update them now, so here, I mean, "extend" is to make sure the L2 is always having a larger range than L3, let's see an example,

L2 deaths

0 [Belgium, France] 60
1 [Belgium] 10
2 [France] 40

L3

0 [Belgium] 10
1 [Belgium] 5
2 [France] 30
3 [Germany] 10

So, as you can see the example, here, the deaths in L3 of Belgium are 15 and L2 are 10, therefore, in L2, the deaths in Belgium should be revised to [10,15] to make the data consistent cross levels. For France, the L3 is 30, while L2 is 40, so in L2, the deaths in France should be changed to [30,40], that's what I mean "extend". And for Germany, we need to create a new record in L2.

@i-be-snek
Copy link
Collaborator Author

i-be-snek commented Nov 15, 2024

@liniiiiii As you can see now, the checklist is full. However, I noticed that the list does not include discrepancy testing when L1 impacts are not NULL... In that case, we would need to check if the sum of impacts in L2 <= L1, right?

Update: I implemented it anyway :) added to the checklist!

@i-be-snek
Copy link
Collaborator Author

@liniiiiii I just updated the documents after fixing the issues we discussed. I hope you can take a look! There should not be any more duplicates now 🙂

Let me know if you find anything funky 😀

Big thanks! 😁

@liniiiiii
Copy link
Collaborator

@liniiiiii I just updated the documents after fixing the issues we discussed. I hope you can take a look! There should not be any more duplicates now 🙂

Let me know if you find anything funky 😀

Big thanks! 😁

@i-be-snek, the duplicates and the L2-->L1 filling are solved, thanks so much, but I find that in L2, some records are originally 0, but when I read them with fastparquet, they show as None, like the buildings_damaged category below, ID AhhLrWJ, and for L3 records as well, could you have a look for it, thanks!
image

Sorry, I found another issue in the main event filtering, I think the Main Event is convert to None, but the record is not deleted in the database, like below, I think it's the case of Terrorist attack, and another is Ygtqr0d, the geomagnetic storm, could you just delete them from the database, because they are false positives, thanks!

   "Event_ID": "AlQHVgF",
      "Sources": "https://en.wikipedia.org/wiki/2_World_Trade_Center_(1971%E2%80%932001)",
      "Event_Names": "2 World Trade Center (1971\u20132001)",
image image

@i-be-snek
Copy link
Collaborator Author

@liniiiiii

About the Nones turning to zero, I think this will be easy to fix. As for dropping those, I can simply pre-drop any event that doesn't have that value filled. I think they would not be inserted into the database due to its validation rules but it's better that we delete them now.

@liniiiiii
Copy link
Collaborator

@liniiiiii

About the Nones turning to zero, I think this will be easy to fix. As for dropping those, I can simply pre-drop any event that doesn't have that value filled. I think they would not be inserted into the database due to its validation rules but it's better that we delete them now.

@i-be-snek yes, thanks, I find the impactdb.V1.1.db file contains none like below, just a quick check for this.
image

@i-be-snek
Copy link
Collaborator Author

@liniiiiii

For the first issue, I looked at the original records and they are NULL converted to None, no zeros there.

image

Are you sure it's the right example/Event_ID? 🤔
By the way, I don't think loading it with fastparquet would change 0->None or None->0.

@liniiiiii
Copy link
Collaborator

@liniiiiii

For the first issue, I looked at the original records and they are NULL converted to None, no zeros there.

image Are you sure it's the right example/Event_ID? 🤔 By the way, I don't think loading it with fastparquet would change `0`->`None` or `None`->`0`.

@i-be-snek , sorry I think I made a mistake here, the first issue is not a problem, but could you in this step also filter the None or in the inserting process, set up a double check for the NULL in the final db file?

@i-be-snek
Copy link
Collaborator Author

@liniiiiii No worries.

but could you in this step also filter the None or in the inserting process, set up a double check for the NULL in the final db file?

I don't understand. What should we double check?

@liniiiiii
Copy link
Collaborator

@liniiiiii No worries.

but could you in this step also filter the None or in the inserting process, set up a double check for the NULL in the final db file?

I don't understand. What should we double check?

@i-be-snek , I mean this, in the final db file, there are still None in L2, maybe L3 as well, I didn't check all of them
image

@i-be-snek
Copy link
Collaborator Author

@liniiiiii

I think you can now check the files again. I've also inserted the data (after the data gap post-processing) into impactdb.v1.0.dg_filled.db. There is also impactdb.v1.2.raw.db which is the raw database with the validation bug fixed (where None values end up inside the databases in non-nullable columns. That means that you can compare impactdb.v1.2.raw.db with impactdb.v1.0.dg_filled.db. The diff between those two databases represents all the fixed inconsistencies.

One thing I wanted to point your attention to is that many rows are dropped because they have a start year but no end year. You can see those in the error logs for l1. Check:

  • impactdb.v1.1.dg_filled.db_insertion_errors/db_insert_errors_l1_Total_Summary_1732451197.json
  • impactdb.v1.2.raw.db_insertion_errors/db_insert_errors_l1_Total_Summary_1732449229.json

Because of this rule, ~600 events are being thrown away. I noticed that in the schema in the repository, both start and end year for l1 are not nullable, but they are in the schema in the journal paper:

Repo:
image

Paper:
image

It may be worth setting the record straight for that one now so I can modify the rules. Which of the two is correct?

@liniiiiii
Copy link
Collaborator

@i-be-snek , thanks for the update, the data gap is fixed as I tested the two db files and compared the difference, and regarding to the End_Date_Year, pls follow the paper rule to adjust the code, and we had a short email discussion with Gabriele before, and we simply set the End_Date_Year is nullable for all events not only for droughts, and also the end year is not captured by the model as I manually checked few articles like 2016 Vietnam floods, the start and end year should be the same. You can adapt the code in this branch if you like and then I can review and approve it, or if you want a new pr for inserting after the currency pr, then this PR can be approved and merged, thanks!
image

image

@i-be-snek
Copy link
Collaborator Author

@liniiiiii

@i-be-snek , thanks for the update, the data gap is fixed as I tested the two db files and compared the difference,

Glad to know the PR is through 🎉

and regarding to the End_Date_Year, pls follow the paper rule to adjust the code, and we had a short email discussion with Gabriele before, and we simply set the End_Date_Year is nullable for all events not only for droughts, and also the end year is not captured by the model as I manually checked few articles like 2016 Vietnam floods,

So that should follow the schema in the paper? That would mean we should fix the schema in the guidelines and in the SQL schema files.

And just to be clear on this: according to the schema in the paper, only a start date is required for all event types. Is that correct?

the start and end year should be the same.

So does that mean I should fill missing end years with the start year in L1? For L2 and L3, the start and end year will automatically be filled by whatever was in L1.

You can adapt the code in this branch if you like and then I can review and approve it, or if you want a new pr for inserting after the currency pr, then this PR can be approved and merged, thanks!

I think we can fix it in this PR. :)

@liniiiiii
Copy link
Collaborator

@liniiiiii

@i-be-snek , thanks for the update, the data gap is fixed as I tested the two db files and compared the difference,
Glad to know the PR is through 🎉

and regarding to the End_Date_Year, pls follow the paper rule to adjust the code, and we had a short email discussion with Gabriele before, and we simply set the End_Date_Year is nullable for all events not only for droughts, and also the end year is not captured by the model as I manually checked few articles like 2016 Vietnam floods,

So that should follow the schema in the paper? That would mean we should fix the schema in the guidelines and in the SQL schema files.And just to be clear on this: according to the schema in the paper, only a start date is required for all event types. Is that correct?

---Yes, I will update the guideline and schema soon as well as the processing rules we have now for data gap etc, and request a pr later

the start and end year should be the same. So does that mean I should fill missing end years with the start year in L1? For L2 and L3, the start and end year will automatically be filled by whatever was in L1.

---nooo, we just leave it missing, I just manually checked few cases like this, maybe some of the events are not documented with end year, so we don't assume they are the same as the start.

You can adapt the code in this branch if you like and then I can review and approve it, or if you want a new pr for inserting after the currency pr, then this PR can be approved and merged, thanks!
I think we can fix it in this PR. :)

---ok, nice, thanks!

@i-be-snek
Copy link
Collaborator Author

The change in validation rules (allowing NULL end years) will be applied in a later PR because it depends on the currency one. I'll pass the PR now.

@i-be-snek i-be-snek merged commit 1e98655 into main Nov 25, 2024
1 check passed
@i-be-snek i-be-snek deleted the data-gap branch January 17, 2025 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants