Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concat uses pandas append instead of concat #500

Closed
pjuergens opened this issue Feb 26, 2021 · 3 comments · Fixed by #510
Closed

concat uses pandas append instead of concat #500

pjuergens opened this issue Feb 26, 2021 · 3 comments · Fixed by #510

Comments

@pjuergens
Copy link
Contributor

The implementation of pyams concat method loops over pandas append instead of using pandas concat method. Especially for large dataframes with subannual data concat is much faster than append.

I already implemented pandas concat in pyams concat and will push it in a minute. I hope the new feature will find its way in the next release.

@pjuergens
Copy link
Contributor Author

I'm not sure yet how to contribute, i.e. create a branch and upload code :) I'll have a deeper look at it on monday. For now I just upload the new Code here for not forgetting it... The code replaces/extends the concat-method in core.py

def concat(dfs, ignore_meta_conflict=False, **kwargs):
    """Concatenate a series of IamDataFrame-like objects

    Parameters
    ----------
    dfs : list of IamDataFrames
        a list of :class:`IamDataFrame` instances
    ignore_meta_conflict : bool, default False
        If False and `other` is an IamDataFrame, raise an error if
        any meta columns present in `self` and `other` are not identical.
    kwargs
        Passed to :class:`IamDataFrame(other, **kwargs) <IamDataFrame>`
        if at least one of dfs is not already an IamDataFrame
    
    Returns
    -------
    IamDataFrame
    
    Raises
    ------
    ValueError
        If time domain or other timeseries data index dimension don't match
    """
    if isstr(dfs) or not hasattr(dfs, '__iter__'):
        msg = 'Argument must be a non-string iterable (e.g., list or tuple)'
        raise TypeError(msg)
    
    for i in range(len(dfs)):
        if not isinstance(dfs[i], IamDataFrame):
            dfs[i] = IamDataFrame(dfs[i], **kwargs)
            ignore_meta_conflict = True
            
        if dfs[0].time_col != dfs[i].time_col:
            raise ValueError('Incompatible time format (`year` vs. `time`)')
        
        if dfs[0]._data.index.names != dfs[i]._data.index.names:
            raise ValueError('Incompatible timeseries data index dimensions')

    _df = dfs[0].copy()
    
    # merge `meta` tables
    for df in dfs:
        _df.meta = merge_meta(_df.meta, df.meta, ignore_meta_conflict)
        
    # concatenate data (verify integrity for no duplicates)
    _dfs_data = [df._data for df in dfs]
    _data = pd.concat(_dfs_data, verify_integrity=True)
    
    # merge extra columns in `data` and set `self._LONG_IDX`
    for df in dfs:
        _df.extra_cols += [i for i in df.extra_cols
                           if i not in _df.extra_cols]
    _df._LONG_IDX = IAMC_IDX + [_df.time_col] + _df.extra_cols
    _df._data = _data.sort_index()

    return _df```

@danielhuppmann
Copy link
Member

Thanks for the suggestion - I have a few ideas on how to improve that, but I'll wait until you have mastered the first steps of working with git(hub)... I'm pretty sure that there is some value for you in learning that beyond improving pyam performance. Happy to assist bilaterally if you join our Slack channel, see https://pyam-iamc.readthedocs.io/en/stable/contributing.html

This was referenced Mar 1, 2021
@pjuergens
Copy link
Contributor Author

I think I managed to create the pull request :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants