concat uses pandas append instead of concat #500

pjuergens · 2021-02-26T15:48:17Z

The implementation of pyams concat method loops over pandas append instead of using pandas concat method. Especially for large dataframes with subannual data concat is much faster than append.

I already implemented pandas concat in pyams concat and will push it in a minute. I hope the new feature will find its way in the next release.

pjuergens · 2021-02-26T16:00:34Z

I'm not sure yet how to contribute, i.e. create a branch and upload code :) I'll have a deeper look at it on monday. For now I just upload the new Code here for not forgetting it... The code replaces/extends the concat-method in core.py

def concat(dfs, ignore_meta_conflict=False, **kwargs):
    """Concatenate a series of IamDataFrame-like objects

    Parameters
    ----------
    dfs : list of IamDataFrames
        a list of :class:`IamDataFrame` instances
    ignore_meta_conflict : bool, default False
        If False and `other` is an IamDataFrame, raise an error if
        any meta columns present in `self` and `other` are not identical.
    kwargs
        Passed to :class:`IamDataFrame(other, **kwargs) <IamDataFrame>`
        if at least one of dfs is not already an IamDataFrame
    
    Returns
    -------
    IamDataFrame
    
    Raises
    ------
    ValueError
        If time domain or other timeseries data index dimension don't match
    """
    if isstr(dfs) or not hasattr(dfs, '__iter__'):
        msg = 'Argument must be a non-string iterable (e.g., list or tuple)'
        raise TypeError(msg)
    
    for i in range(len(dfs)):
        if not isinstance(dfs[i], IamDataFrame):
            dfs[i] = IamDataFrame(dfs[i], **kwargs)
            ignore_meta_conflict = True
            
        if dfs[0].time_col != dfs[i].time_col:
            raise ValueError('Incompatible time format (`year` vs. `time`)')
        
        if dfs[0]._data.index.names != dfs[i]._data.index.names:
            raise ValueError('Incompatible timeseries data index dimensions')

    _df = dfs[0].copy()
    
    # merge `meta` tables
    for df in dfs:
        _df.meta = merge_meta(_df.meta, df.meta, ignore_meta_conflict)
        
    # concatenate data (verify integrity for no duplicates)
    _dfs_data = [df._data for df in dfs]
    _data = pd.concat(_dfs_data, verify_integrity=True)
    
    # merge extra columns in `data` and set `self._LONG_IDX`
    for df in dfs:
        _df.extra_cols += [i for i in df.extra_cols
                           if i not in _df.extra_cols]
    _df._LONG_IDX = IAMC_IDX + [_df.time_col] + _df.extra_cols
    _df._data = _data.sort_index()

    return _df```

danielhuppmann · 2021-02-26T16:32:36Z

Thanks for the suggestion - I have a few ideas on how to improve that, but I'll wait until you have mastered the first steps of working with git(hub)... I'm pretty sure that there is some value for you in learning that beyond improving pyam performance. Happy to assist bilaterally if you join our Slack channel, see https://pyam-iamc.readthedocs.io/en/stable/contributing.html

pjuergens · 2021-03-01T14:46:22Z

I think I managed to create the pull request :)

This was referenced Mar 1, 2021

Update concat-method pjuergens/pyam#1

Closed

Concat #501

Closed

danielhuppmann mentioned this issue Mar 21, 2021

Improve performance of pyam.concat() #510

Merged

4 tasks

danielhuppmann closed this as completed in #510 Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concat uses pandas append instead of concat #500

concat uses pandas append instead of concat #500

pjuergens commented Feb 26, 2021

pjuergens commented Feb 26, 2021

danielhuppmann commented Feb 26, 2021

pjuergens commented Mar 1, 2021

concat uses pandas append instead of concat #500

concat uses pandas append instead of concat #500

Comments

pjuergens commented Feb 26, 2021

pjuergens commented Feb 26, 2021

danielhuppmann commented Feb 26, 2021

pjuergens commented Mar 1, 2021