Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor process_wildcards and add support for TFM_MIG #166

Merged
merged 11 commits into from
Jan 31, 2024
Merged

Conversation

siddharth-krishna
Copy link
Collaborator

@siddharth-krishna siddharth-krishna commented Jan 25, 2024

Fixes #83

Hold off on the merge until I cleanup some remaining TODOs. But I thought I'd get your feedback on the main direction of the PR.

@olejandro
Copy link
Member

Nice! 🚀 Demos 4 are all at 100% now, except for 4-all, which apparently misses 1 record. Is there a duplicate record in the ground truth that is causing this (since the missing record is not reported and GDX diff sees no difference)?

"PastYears": model.past_years,
"ModelYears": model.model_years,
}.items():
for k, v in [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious as to why it is better to use a list tuples here instead of a dictionary. Should the dictionary below be converted as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is because iterating through a set results in a non-deterministic order (see also #50 and #67). This makes it very difficult to debug regressions, because I rely on the --verbose flag and a diff tool to find out which transformation caused the regression. But with nondeterministic behaviour, there are too many changes in the diff..

I'm considering adding a check to CI that runs the tool on a small benchmark ~5-10 times and ensures that all intermediate tables are identical across runs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the dictionary below unchanged because the loop only assigns to a dictionary, so the order doesn't matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark idea sounds good!

@olejandro
Copy link
Member

Actually should we rename process_transform_insert to e.g. process_transform_tables and get rid of the TODO? Also the warning should say "Dropped xxx transform tables" to indicate that other tfm tables may be included...

@olejandro
Copy link
Member

For the TFM_UPD tables, we should be actually adding records instead of modifying existing rows. The reason is the sourcescen field, that we currently don't support, which allows filtering rows by the file they are coming from. I.e. by using sourcescen one could apply UPD tables to the same data instead of on top of each other:

  • let's say original data has value 1
  • one udate table does *0.5
  • another *2
  • applying them to the same data would result in 0.5 and 2 (one would overwrite the other depending on the order in which GAMS reads them)
  • applying them on top of each other would result in 1

@siddharth-krishna
Copy link
Collaborator Author

Is there a duplicate record in the ground truth that is causing this (since the missing record is not reported and GDX diff sees no difference)?

Nice catch, yes, COM_PEAK is defined identically in 2 DD files, so our dd_to_csv.py creates a file with a duplicated row:

$ cat benchmarks/csv/DemoS_004-all/COM_PEAK.csv 
REG,COM_GRP
REG1,ELC
REG1,ELC

Is the solution to fix the ground truth, or to modify dd_to_csv.py to pick one definition over the other?

applying them on top of each other would result in 1

Regarding TFM_UPD: I don't understand how you can use sourcescen to apply the updates on top of each other if the behaviour is adding new rows instead of updating existing rows? Unless the new row added by the *0.5 update has the same source_filename column -- but in that case if sourcescen is set to the file containing the original data 1, you'll end up with one row with 0.5 and one row with 1; and you can't end up with one row 0.5 and one row 2, right?

If the new row added by the *0.5 udpate does not retain the source_filename column, then there's no way to stack the updates and end up with 1.

Unless this is the behaviour of Veda and we really want to emulate it, I prefer the simpler/easier-to-understand behaviour of TFM_UPD which updates the row inplace. Thus having update rows *0.5 and *2 results in 1, which seems intuitive.

The trouble with adding 2 rows with 0.5 and 2 to the output is how should the user determine which row overwrites the other? Is it clear that the update table in the file that is specified last in the command line argument results in the last row?

@olejandro
Copy link
Member

Is the solution to fix the ground truth, or to modify dd_to_csv.py to pick one definition over the other?

How about just dropping duplicates when calculating the number of rows used in calculating the match?

I guess, there is a similar issue when reporting the number of rows per parameter (generated and GT).

@olejandro
Copy link
Member

olejandro commented Jan 26, 2024

Regarding TFM_UPD: I don't understand how you can use sourcescen to apply the updates on top of each other if the behaviour is adding new rows instead of updating existing rows?

Well, one would normally use sourcescen to avoid this behaviour rather than to produce it. :-) Applying updates on top of each other is often the problem; not the other way around.

However, once we support sourcescen (btw, we already include info on sourcefile), I believe it will be easy to support applying the updates on top of each other with it as well.

Since you are updating the transform, I believe it makes sense to change the behaviour from updating the row in place to adding a new row already (without doing anything about sourcescen at this point). I don't think it will cause any regressions, but I may be wrong. :-)

@Antti-L
Copy link

Antti-L commented Jan 27, 2024

Comment @olejandro :

-let's say original data has value 1
-one udate table does *0.5
-another *2
-applying them to the same data would result in 0.5 and 2 (one would overwrite the other depending on the order in which GAMS reads them)
-applying them on top of each other would result in 1

Remember that there are two important order considerations here. TFM_UPD only works when the TFM_UPD table is present in an alphabetically later scenario than the source data to be used for the UPD, and the final value seen by GAMS will be according to the scenario order as defined by the user for the GAMS run. The final value could thus, in fact, be any one of {1, 0.5, 2}.

Comment @siddharth-krishna :

If the new row added by the *0.5 udpate does not retain the source_filename column, then there's no way to stack the updates and end up with 1.
I prefer the simpler/easier-to-understand behaviour of TFM_UPD which updates the row inplace. Thus having update rows *0.5 and *2 results in 1, which seems intuitive.

The idea of TFM_UPD is not to update (change) the source data. The idea is that the scenarios U(i) where the TFM_UPDs are specified will define parameter values that depend on a (alphabetically preceding) source scenario (even when using Sourcescen the source scenario must be alphabetically preceding). The values in the source scenario should not be changed at all. In other words, the resulting value (e.g. 0.5*1) must be inserted into the U(i) scenarios where the TFM_UPD tables are specified, and not to the scenario(s) where the source data is specified. The final values seen by GAMS will depend on the scenario order, as defined by the user for the GAMS run. I suspect that it wouldn't even be possible to implement an approach using "inplace" updating such that it would work fully consistently with the *.DD files VEDA produces.

In my view, the new row added by the *0.5 udpate would thus need to be identified with the same scenario name as the TFM_UPD table. In that way "stacked updates" would work as they work when using DD files that VEDA produces.

The trouble with adding 2 rows with 0.5 and 2 to the output is how should the user determine which row overwrites the other? Is it clear that the update table in the file that is specified last in the command line argument results in the last row?

The user specifies the order of the scenarios for each GAMS run, and therefore you must have a mechanism for defining that order. I am not sure if the command line argument is a good way for that, if it includes Excel file names (which do not have one-to-one correspondence with scenarios. For example, the Base scenario may consist of many Excel files, and each Subres scenario consists of two files.

@siddharth-krishna
Copy link
Collaborator Author

@Antti-L thanks for the details on how Veda handles TFM_UPD.

@olejandro I tried to change the code so that TFM_UPD adds rows to the FI_T table instead of inplace updating it. But regression tests fail as additional rows are generated. For example, in Demo 1, ACT_COST output is:

REG,YEAR,PRC,CUR,VALUE
REG1,2005,IMPNRGZ,MEuro05,1111
REG1,2005,IMPMATZ,MEuro05,1111
REG1,2005,IMPDEMZ,MEuro05,1111
REG1,2005,IMPNRGZ,MEuro05,2222
REG1,2005,IMPMATZ,MEuro05,2222
REG1,2005,IMPDEMZ,MEuro05,2222
REG1,2005,IMPDEMZ,MEuro05,8888

Whereas the ground truth only has rows 4,5, and 7 (header is row 0). Furthermore Gams errors with Element is redefined.

I guess if we do this we also need to remove duplicate rows for parameters (counting all columns except values)? If so, I can make this change and the removing duplicates in a separate PR.

@Antti-L
Copy link

Antti-L commented Jan 31, 2024

I guess if we do this we also need to remove duplicate rows for parameters (counting all columns except values)?

If you are not going to write out the DD file for each scenario (like VEDA does), but only a single DD file for all data, then I think yes, you would need to do that. But I think ideally the tool would write out the data by scenario, because it is more transparent, and each scenario may have also some specific GAMS statements (like $ONEPS for zero handling). Is that not planned? If not, how would you handle e.g. $ONEPS / $OFFEPS, which the user may request for individual scenarios?

However, of course you could also write all scenarios into a single DD file, but successively (all data for one scenario, then all data for the next etc.). You just need to use also $ONMULTI for that to work (like with separate files).

Ahh, and yes, I see that the TFM_UPD in SysSettings by itself generates one ACT_COST duplicate in the SysSettings scenario (for IMPDEMZ). So yes, these duplicates within scenarios would also have to be eliminated (the last value generated should always survive according to the VEDA rule).

@olejandro
Copy link
Member

@Antti-L thanks for you feedback. We should open an issue on $ONEPS / $OFFEPS to take a look at it in detail. Currently the tool outputs a single dd file.

@siddharth-krishna, ok, I will merge this then. Nice to see that the change in handling TFM_UPD has already resulted in a higher number of correct rows. What should be done first in that new PR, is to remove duplicates which are generated by the rows in the same table. This way our approach for measuring regression will be unaffected.

Comment on lines 1589 to +1594
known_columns = config.known_columns[datatypes.Tag.tfm_ins] | query_columns
if table.tag == datatypes.Tag.tfm_mig:
# Also allow attribute2, year2 etc for TFM_MIG tables
known_columns.update(
(c + "2" for c in config.known_columns[datatypes.Tag.tfm_ins])
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
known_columns = config.known_columns[datatypes.Tag.tfm_ins] | query_columns
if table.tag == datatypes.Tag.tfm_mig:
# Also allow attribute2, year2 etc for TFM_MIG tables
known_columns.update(
(c + "2" for c in config.known_columns[datatypes.Tag.tfm_ins])
)
known_columns = config.known_columns[datatypes.Tag(table.tag)] | query_columns

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be table specific?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, we could also expand the tags file with info on whether a specific column is a query column and generate a list of query columns from there.

"PastYears": model.past_years,
"ModelYears": model.model_years,
}.items():
for k, v in [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark idea sounds good!

@olejandro olejandro merged commit 6a9aa98 into main Jan 31, 2024
1 check failed
@olejandro olejandro deleted the sidk-tfm-mig branch January 31, 2024 17:47
@siddharth-krishna
Copy link
Collaborator Author

Oh I was going to revert the last commit before merging! But no problem, the next PR should bring the additional rows down again.

@olejandro
Copy link
Member

Ups... Sorry, my fault! It fits well here though. 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support MIG tables
3 participants