Refactor process_wildcards and add support for TFM_MIG #166

siddharth-krishna · 2024-01-25T10:04:40Z

Fixes #83

Hold off on the merge until I cleanup some remaining TODOs. But I thought I'd get your feedback on the main direction of the PR.

olejandro · 2024-01-25T20:24:33Z

Nice! 🚀 Demos 4 are all at 100% now, except for 4-all, which apparently misses 1 record. Is there a duplicate record in the ground truth that is causing this (since the missing record is not reported and GDX diff sees no difference)?

olejandro · 2024-01-25T20:33:15Z

xl2times/transforms.py

-        "PastYears": model.past_years,
-        "ModelYears": model.model_years,
-    }.items():
+    for k, v in [


Just curious as to why it is better to use a list tuples here instead of a dictionary. Should the dictionary below be converted as well?

Ah, this is because iterating through a set results in a non-deterministic order (see also #50 and #67). This makes it very difficult to debug regressions, because I rely on the --verbose flag and a diff tool to find out which transformation caused the regression. But with nondeterministic behaviour, there are too many changes in the diff..

I'm considering adding a check to CI that runs the tool on a small benchmark ~5-10 times and ensures that all intermediate tables are identical across runs.

I left the dictionary below unchanged because the loop only assigns to a dictionary, so the order doesn't matter.

The benchmark idea sounds good!

olejandro · 2024-01-25T21:07:08Z

Actually should we rename process_transform_insert to e.g. process_transform_tables and get rid of the TODO? Also the warning should say "Dropped xxx transform tables" to indicate that other tfm tables may be included...

olejandro · 2024-01-25T21:28:08Z

For the TFM_UPD tables, we should be actually adding records instead of modifying existing rows. The reason is the sourcescen field, that we currently don't support, which allows filtering rows by the file they are coming from. I.e. by using sourcescen one could apply UPD tables to the same data instead of on top of each other:

let's say original data has value 1
one udate table does *0.5
another *2
applying them to the same data would result in 0.5 and 2 (one would overwrite the other depending on the order in which GAMS reads them)
applying them on top of each other would result in 1

siddharth-krishna · 2024-01-26T07:59:07Z

Is there a duplicate record in the ground truth that is causing this (since the missing record is not reported and GDX diff sees no difference)?

Nice catch, yes, COM_PEAK is defined identically in 2 DD files, so our dd_to_csv.py creates a file with a duplicated row:

$ cat benchmarks/csv/DemoS_004-all/COM_PEAK.csv 
REG,COM_GRP
REG1,ELC
REG1,ELC

Is the solution to fix the ground truth, or to modify dd_to_csv.py to pick one definition over the other?

applying them on top of each other would result in 1

Regarding TFM_UPD: I don't understand how you can use sourcescen to apply the updates on top of each other if the behaviour is adding new rows instead of updating existing rows? Unless the new row added by the *0.5 update has the same source_filename column -- but in that case if sourcescen is set to the file containing the original data 1, you'll end up with one row with 0.5 and one row with 1; and you can't end up with one row 0.5 and one row 2, right?

If the new row added by the *0.5 udpate does not retain the source_filename column, then there's no way to stack the updates and end up with 1.

Unless this is the behaviour of Veda and we really want to emulate it, I prefer the simpler/easier-to-understand behaviour of TFM_UPD which updates the row inplace. Thus having update rows *0.5 and *2 results in 1, which seems intuitive.

The trouble with adding 2 rows with 0.5 and 2 to the output is how should the user determine which row overwrites the other? Is it clear that the update table in the file that is specified last in the command line argument results in the last row?

olejandro · 2024-01-26T14:00:34Z

Is the solution to fix the ground truth, or to modify dd_to_csv.py to pick one definition over the other?

How about just dropping duplicates when calculating the number of rows used in calculating the match?

I guess, there is a similar issue when reporting the number of rows per parameter (generated and GT).

olejandro · 2024-01-26T14:13:35Z

Regarding TFM_UPD: I don't understand how you can use sourcescen to apply the updates on top of each other if the behaviour is adding new rows instead of updating existing rows?

Well, one would normally use sourcescen to avoid this behaviour rather than to produce it. :-) Applying updates on top of each other is often the problem; not the other way around.

However, once we support sourcescen (btw, we already include info on sourcefile), I believe it will be easy to support applying the updates on top of each other with it as well.

Since you are updating the transform, I believe it makes sense to change the behaviour from updating the row in place to adding a new row already (without doing anything about sourcescen at this point). I don't think it will cause any regressions, but I may be wrong. :-)

Antti-L · 2024-01-27T15:39:52Z

Comment @olejandro :

-let's say original data has value 1
-one udate table does *0.5
-another *2
-applying them to the same data would result in 0.5 and 2 (one would overwrite the other depending on the order in which GAMS reads them)
-applying them on top of each other would result in 1

Remember that there are two important order considerations here. TFM_UPD only works when the TFM_UPD table is present in an alphabetically later scenario than the source data to be used for the UPD, and the final value seen by GAMS will be according to the scenario order as defined by the user for the GAMS run. The final value could thus, in fact, be any one of {1, 0.5, 2}.

Comment @siddharth-krishna :

If the new row added by the *0.5 udpate does not retain the source_filename column, then there's no way to stack the updates and end up with 1.
I prefer the simpler/easier-to-understand behaviour of TFM_UPD which updates the row inplace. Thus having update rows *0.5 and *2 results in 1, which seems intuitive.

The idea of TFM_UPD is not to update (change) the source data. The idea is that the scenarios U(i) where the TFM_UPDs are specified will define parameter values that depend on a (alphabetically preceding) source scenario (even when using Sourcescen the source scenario must be alphabetically preceding). The values in the source scenario should not be changed at all. In other words, the resulting value (e.g. 0.5*1) must be inserted into the U(i) scenarios where the TFM_UPD tables are specified, and not to the scenario(s) where the source data is specified. The final values seen by GAMS will depend on the scenario order, as defined by the user for the GAMS run. I suspect that it wouldn't even be possible to implement an approach using "inplace" updating such that it would work fully consistently with the *.DD files VEDA produces.

In my view, the new row added by the *0.5 udpate would thus need to be identified with the same scenario name as the TFM_UPD table. In that way "stacked updates" would work as they work when using DD files that VEDA produces.

The trouble with adding 2 rows with 0.5 and 2 to the output is how should the user determine which row overwrites the other? Is it clear that the update table in the file that is specified last in the command line argument results in the last row?

The user specifies the order of the scenarios for each GAMS run, and therefore you must have a mechanism for defining that order. I am not sure if the command line argument is a good way for that, if it includes Excel file names (which do not have one-to-one correspondence with scenarios. For example, the Base scenario may consist of many Excel files, and each Subres scenario consists of two files.

siddharth-krishna · 2024-01-31T09:10:24Z

@Antti-L thanks for the details on how Veda handles TFM_UPD.

@olejandro I tried to change the code so that TFM_UPD adds rows to the FI_T table instead of inplace updating it. But regression tests fail as additional rows are generated. For example, in Demo 1, ACT_COST output is:

REG,YEAR,PRC,CUR,VALUE
REG1,2005,IMPNRGZ,MEuro05,1111
REG1,2005,IMPMATZ,MEuro05,1111
REG1,2005,IMPDEMZ,MEuro05,1111
REG1,2005,IMPNRGZ,MEuro05,2222
REG1,2005,IMPMATZ,MEuro05,2222
REG1,2005,IMPDEMZ,MEuro05,2222
REG1,2005,IMPDEMZ,MEuro05,8888

Whereas the ground truth only has rows 4,5, and 7 (header is row 0). Furthermore Gams errors with Element is redefined.

I guess if we do this we also need to remove duplicate rows for parameters (counting all columns except values)? If so, I can make this change and the removing duplicates in a separate PR.

Antti-L · 2024-01-31T10:42:38Z

I guess if we do this we also need to remove duplicate rows for parameters (counting all columns except values)?

If you are not going to write out the DD file for each scenario (like VEDA does), but only a single DD file for all data, then I think yes, you would need to do that. But I think ideally the tool would write out the data by scenario, because it is more transparent, and each scenario may have also some specific GAMS statements (like $ONEPS for zero handling). Is that not planned? If not, how would you handle e.g. $ONEPS / $OFFEPS, which the user may request for individual scenarios?

However, of course you could also write all scenarios into a single DD file, but successively (all data for one scenario, then all data for the next etc.). You just need to use also $ONMULTI for that to work (like with separate files).

Ahh, and yes, I see that the TFM_UPD in SysSettings by itself generates one ACT_COST duplicate in the SysSettings scenario (for IMPDEMZ). So yes, these duplicates within scenarios would also have to be eliminated (the last value generated should always survive according to the VEDA rule).

olejandro · 2024-01-31T17:43:10Z

@Antti-L thanks for you feedback. We should open an issue on $ONEPS / $OFFEPS to take a look at it in detail. Currently the tool outputs a single dd file.

@siddharth-krishna, ok, I will merge this then. Nice to see that the change in handling TFM_UPD has already resulted in a higher number of correct rows. What should be done first in that new PR, is to remove duplicates which are generated by the rows in the same table. This way our approach for measuring regression will be unaffected.

olejandro · 2024-01-25T20:57:15Z

xl2times/transforms.py

            known_columns = config.known_columns[datatypes.Tag.tfm_ins] | query_columns
+            if table.tag == datatypes.Tag.tfm_mig:
+                # Also allow attribute2, year2 etc for TFM_MIG tables
+                known_columns.update(
+                    (c + "2" for c in config.known_columns[datatypes.Tag.tfm_ins])
+                )


Suggested change

known_columns = config.known_columns[datatypes.Tag.tfm_ins] | query_columns

if table.tag == datatypes.Tag.tfm_mig:

# Also allow attribute2, year2 etc for TFM_MIG tables

known_columns.update(

(c + "2" for c in config.known_columns[datatypes.Tag.tfm_ins])

)

known_columns = config.known_columns[datatypes.Tag(table.tag)] | query_columns

Should this not be table specific?

Btw, we could also expand the tags file with info on whether a specific column is a query column and generate a list of query columns from there.

olejandro · 2024-01-26T13:53:57Z

xl2times/transforms.py

-        "PastYears": model.past_years,
-        "ModelYears": model.model_years,
-    }.items():
+    for k, v in [


The benchmark idea sounds good!

siddharth-krishna · 2024-01-31T19:46:12Z

Oh I was going to revert the last commit before merging! But no problem, the next PR should bring the additional rows down again.

olejandro · 2024-01-31T19:53:43Z

Ups... Sorry, my fault! It fits well here though. 🤷

siddharth-krishna added 6 commits January 25, 2024 08:35

Make process_regions deterministic

06f2bfe

Migrate TFM_UPD code to new style

9129dc5

Migrate TFM_INS code to new style

2ecfc3c

Migrate TFM_INS-TXT code to new style

dec2029

Add support for TFM_MIG tables

511eb6f

Cleanup

df50f79

siddharth-krishna requested a review from olejandro January 25, 2024 10:04

Merge branch 'main' into sidk-tfm-mig

76bda6c

olejandro reviewed Jan 25, 2024

View reviewed changes

Merge branch 'main' into sidk-tfm-mig

35e001e

siddharth-krishna added 2 commits January 31, 2024 07:55

process_transform_insert -> process_transform_tables

94054d2

TFM_UPD adds rows to table instead of inplace updating

1e36dd2

Merge branch 'main' into sidk-tfm-mig

c01ce29

olejandro approved these changes Jan 31, 2024

View reviewed changes

olejandro merged commit 6a9aa98 into main Jan 31, 2024
1 check failed

olejandro deleted the sidk-tfm-mig branch January 31, 2024 17:47

This was referenced Feb 1, 2024

Add CI check that tool is deterministic #171

Open

Add query columns to veda-tags.json #172

Closed

siddharth-krishna mentioned this pull request Feb 1, 2024

Remove rows with duplicate query cols and cleanup handling of TFM variants #173

Merged

siddharth-krishna mentioned this pull request Mar 6, 2024

Harmonising wildcard processing #209

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor process_wildcards and add support for TFM_MIG #166

Refactor process_wildcards and add support for TFM_MIG #166

siddharth-krishna commented Jan 25, 2024 •

edited

Loading

olejandro commented Jan 25, 2024

olejandro Jan 25, 2024

siddharth-krishna Jan 26, 2024

siddharth-krishna Jan 26, 2024

olejandro Jan 26, 2024

olejandro commented Jan 25, 2024

olejandro commented Jan 25, 2024

siddharth-krishna commented Jan 26, 2024

olejandro commented Jan 26, 2024

olejandro commented Jan 26, 2024 •

edited

Loading

Antti-L commented Jan 27, 2024 •

edited

Loading

siddharth-krishna commented Jan 31, 2024

Antti-L commented Jan 31, 2024 •

edited

Loading

olejandro commented Jan 31, 2024

olejandro Jan 25, 2024

olejandro Jan 25, 2024

olejandro Jan 26, 2024

olejandro Jan 26, 2024

siddharth-krishna commented Jan 31, 2024

olejandro commented Jan 31, 2024

Refactor process_wildcards and add support for TFM_MIG #166

Refactor process_wildcards and add support for TFM_MIG #166

Conversation

siddharth-krishna commented Jan 25, 2024 • edited Loading

olejandro commented Jan 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olejandro commented Jan 25, 2024

olejandro commented Jan 25, 2024

siddharth-krishna commented Jan 26, 2024

olejandro commented Jan 26, 2024

olejandro commented Jan 26, 2024 • edited Loading

Antti-L commented Jan 27, 2024 • edited Loading

siddharth-krishna commented Jan 31, 2024

Antti-L commented Jan 31, 2024 • edited Loading

olejandro commented Jan 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddharth-krishna commented Jan 31, 2024

olejandro commented Jan 31, 2024

siddharth-krishna commented Jan 25, 2024 •

edited

Loading

olejandro commented Jan 26, 2024 •

edited

Loading

Antti-L commented Jan 27, 2024 •

edited

Loading

Antti-L commented Jan 31, 2024 •

edited

Loading