-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Gap #174
Conversation
pls link to this issue #173 (comment) as well, summarize the to do list, thanks |
Hi @liniiiiii
What do you mean by this? Is this a task for me? |
Hi Shorouq, I mean this branch can link to the issue I tag as well, the last task for the data gap, which is the same as the issue you linked, in the todo list, I just summarise them and put together, then we have a overview of the tasks, thanks!
在 2024-10-19 16:18:00,Shorouq ***@***.***> 写道:
pls link to this issue #173 (comment) as well, summarize the to do list, thanks
Hi @liniiiiii
I am finding it hard to understand this sentence.
I understand there is a todolist in #173 (comment), thanks for putting it together!
as well, summarize the to do list, thanks
What do you mean by this? Is this a task for me?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
f7e283e
to
c6ad6b0
Compare
For the location data gap filling:
Filling areas found in l2 and l3 into l1 is now implemented. But I am wondering what to do with filling l3 areas into l2. That's because in l2, we can have multiple records for the same event. Here is an example from l2 where we have 3 records for the same event. l2 Administrative_Areas_Norm -- before data gap filling
index
0 [Belgium]
1 [Netherlands]
2 [United Kingdom] If, for example, we find that l3 records contain the countries France, Belgium, and the UK, but "France" is not found in any of the l2 records, how should they be filled? Should the result be as shown below? l2 Administrative_Areas_Norm -- after data gap filling?
index
0 [Belgium, France]
1 [Netherlands, France]
2 [United Kingdom, France] Because I feel like that introduces errors and distorts the real data. What do you think? |
@i-be-snek , I think in L2, need to add a record for France instead of adding into other countries like below. The idea is that, we assume the total impact of France is the impact from L3. I also can imagine other case like 0 [Belgium, France]
1 [Belgium]
2 [France] L3 0 [Belgium]
1 [Belgium]
2 [France]
3 [Germany] In this case, in L2 #0, will not have any information from L3, we will neglect the case where several counties in L2. the L2 will be 0 [Belgium, France] keep original
1 [Belgium] extend with the L3 sum (#0 and #1)
2 [France] extend with L3 #2
3 [Germany] extend with L3 #3 |
@liniiiiii so, when I add "Germany" in your last example (index 3), what impact information would it have? We would be synthetically creating an L2 record that doesn't exist. Is that what we want to do? |
@liniiiiii what is our motivation for treating l2 records with more than one administrative area in a "special" way? 0 [Belgium, France] keep original
1 [Belgium] extend with the L3 sum (#0 and #1)
2 [France] extend with L3 #2
3 [Germany] extend with L3 #3 Should index l2_0 not be extended with the sum of l3_0, l3_1 (for Belgium) and l3_2 (for France)? |
Also, what if we have an l2 case like this one: 0 [India] # event in March, same location
1 [India] # event in April, same location
2 [Burma]
3 [Pakistan] If l3 has any records for India, would we have to extend both index 0 and 1? |
@i-be-snek , yes, we create a record in L2 for "Germany" because we want to aggregate information to the country level, let's say 20 deaths in Bavaria in L3, but no deaths information in L2 for "Germany", we automatically take 20 deaths from L3 and put in L2 to present the impact information in Germany. |
|
@i-be-snek , yes, I think it's the best way to do it to extend both, because we know that the date information in L2 and L3 are missing in most of the cases |
Hmm, shouldn't we at least extend based on the date if it's not missing? Also are we basically extending all data from l3 to l2 then, if we decide to extend these two if l3 records of India exist? I'm trying to understand what to code, sorry for dragging this discussion on for so long... some bits of this are a bit confusing to me. |
@i-be-snek , what do you mean by Also are we basically extending all data from l3 to l2 then, if we decide to extend these two if l3 records of India exist?? No problem, I think we need to clarify this process! |
@liniiiiii I'm now following the rules here: #101 (comment) |
I am asking if we then extend every record of l2 that shares an administrative area for l3 (except for the corner cases). L2 0 [Belgium, France]
1 [Belgium]
2 [France] L3 0 [Belgium]
1 [Belgium]
2 [France]
3 [Germany]
0 [Belgium, France] keep original
1 [Belgium] extend with the L3 sum (#0 and #1)
2 [France] extend with L3 #2
3 [Germany] extend with L3 #3 So here you wrote This part |
@i-be-snek , I think I described it same as the #101 (comment), I see few points are not up to date, I will update them now, so here, I mean, "extend" is to make sure the L2 is always having a larger range than L3, let's see an example,
So, as you can see the example, here, the deaths in L3 of Belgium are 15 and L2 are 10, therefore, in L2, the deaths in Belgium should be revised to [10,15] to make the data consistent cross levels. For France, the L3 is 30, while L2 is 40, so in L2, the deaths in France should be changed to [30,40], that's what I mean "extend". And for Germany, we need to create a new record in L2. |
@liniiiiii As you can see now, the checklist is full. However, I noticed that the list does not include discrepancy testing when L1 impacts are not NULL... In that case, we would need to check if the sum of impacts in L2 <= L1, right? Update: I implemented it anyway :) added to the checklist! |
@liniiiiii I just updated the documents after fixing the issues we discussed. I hope you can take a look! There should not be any more duplicates now 🙂 Let me know if you find anything funky 😀 Big thanks! 😁 |
@i-be-snek, the duplicates and the L2-->L1 filling are solved, thanks so much, but I find that in L2, some records are originally Sorry, I found another issue in the main event filtering, I think the Main Event is convert to None, but the record is not deleted in the database, like below, I think it's the case of Terrorist attack, and another is Ygtqr0d, the geomagnetic storm, could you just delete them from the database, because they are false positives, thanks!
|
About the Nones turning to zero, I think this will be easy to fix. As for dropping those, I can simply pre-drop any event that doesn't have that value filled. I think they would not be inserted into the database due to its validation rules but it's better that we delete them now. |
@i-be-snek yes, thanks, I find the impactdb.V1.1.db file contains |
@i-be-snek , sorry I think I made a mistake here, the first issue is not a problem, but could you in this step also filter the |
@liniiiiii No worries.
I don't understand. What should we double check? |
@i-be-snek , I mean this, in the final db file, there are still |
I think you can now check the files again. I've also inserted the data (after the data gap post-processing) into One thing I wanted to point your attention to is that many rows are dropped because they have a start year but no end year. You can see those in the error logs for l1. Check:
Because of this rule, ~600 events are being thrown away. I noticed that in the schema in the repository, both start and end year for l1 are not nullable, but they are in the schema in the journal paper: It may be worth setting the record straight for that one now so I can modify the rules. Which of the two is correct? |
@i-be-snek , thanks for the update, the data gap is fixed as I tested the two db files and compared the difference, and regarding to the End_Date_Year, pls follow the paper rule to adjust the code, and we had a short email discussion with Gabriele before, and we simply set the End_Date_Year is nullable for all events not only for droughts, and also the end year is not captured by the model as I manually checked few articles like 2016 Vietnam floods, the start and end year should be the same. You can adapt the code in this branch if you like and then I can review and approve it, or if you want a new pr for inserting after the currency pr, then this PR can be approved and merged, thanks! |
Glad to know the PR is through 🎉
So that should follow the schema in the paper? That would mean we should fix the schema in the guidelines and in the SQL schema files. And just to be clear on this: according to the schema in the paper, only a start date is required for all event types. Is that correct?
So does that mean I should fill missing end years with the start year in L1? For L2 and L3, the start and end year will automatically be filled by whatever was in L1.
I think we can fix it in this PR. :) |
---Yes, I will update the guideline and schema soon as well as the processing rules we have now for data gap etc, and request a pr later
---nooo, we just leave it missing, I just manually checked few cases like this, maybe some of the events are not documented with end year, so we don't assume they are the same as the start.
---ok, nice, thanks! |
The change in validation rules (allowing NULL end years) will be applied in a later PR because it depends on the currency one. I'll pass the PR now. |
This PR fills the data gap between event levels.
Taken as is from #173 (updated checklist, 2024/11/03):
a. L2 is not NULL, L3 is NULL, sum up the numbers from L2, and fill in L1
b. L2 is NULL, and L3 is not NULL
c. L2 and L3 are not NULL
Special case: if L1 is not null but the sum of the impact is smaller than L2, use the aggregated L2 number
Check #101 for more details.
Update:⚠️
To be able to apply the data gap and the conversion and inflation adjustment, it would be easier to handle a de-duplicated version of the dataset. So I'm now modifying the db insertion code to create a de-duplicated copy of the data in parquet -- then we can manipulate that easily before a final insertion in the table. I'll work on this while we in the meantime work out the details for how to adjust certain cases of l3->l2 areas and imapcts (2) and how to handle the data gap before adjusting for inflation (3)
PR Description
This PR contains a script to fill the data gap as described above, alongside some utility functions in
Database/scr/normalize_data.py
to facilitate the process.The PR also contains a copy of the RAW full run (in chunks of 25 each) with the geojson objects de-duplicated (in
Database/output/full_run_25_deduplicated
andfull_run_25_deduplicated
. The processed version of this with the data gap filled is found inDatabase/output/full_run_25_deduplicated_data_gap
.This PR does not insert the newly processed data into the database because the inflation adjustment and currency conversion steps should happen first so that Damage and Insured Damage categories are normalized properly!
How to test:
data_gap.log
) to see what the data gap has filled (logs exist for everything except for the time gap, which only fills l2/l3 with a start and end year if it's not missing from l1).