Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-1: Purpose and guidelines for pandas enhancement proposals #47444

Merged
merged 18 commits into from
Aug 3, 2022
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
8bde84f
PDEP-1: Purpose and guidelines for pandas enhancement proposals
datapythonista Jun 21, 2022
0b43492
black
datapythonista Jun 21, 2022
3d9a75b
Update PR number
datapythonista Jun 21, 2022
6d9d34b
Update web/pandas/pdeps/accepted/0001-purpose-and-guidelines.md
datapythonista Jun 21, 2022
a0e6cda
Update web/pandas/pdeps/accepted/0001-purpose-and-guidelines.md
datapythonista Jun 21, 2022
a0d7276
Update web/pandas/pdeps/accepted/0001-purpose-and-guidelines.md
datapythonista Jun 21, 2022
a8295b8
Implemented PDEPs including pandas version
datapythonista Jun 21, 2022
1e408dd
Merge branch 'pdep' of github.com:datapythonista/pandas into pdep
datapythonista Jun 21, 2022
291de8d
Merge remote-tracking branch 'upstream/main' into pdep
datapythonista Jun 25, 2022
2ce2164
Merge remote-tracking branch 'upstream/main' into pdep
datapythonista Jun 27, 2022
ebf1687
Addressed feedback from reviews, couple of visualization fixes, and s…
datapythonista Jun 27, 2022
05d43a5
Merge main
datapythonista Jul 30, 2022
d20de1e
Addressing comments from reviews
datapythonista Jul 30, 2022
9b37d11
Update web/pandas/pdeps/0001-purpose-and-guidelines.md
datapythonista Aug 3, 2022
55b3887
Update web/pandas/pdeps/0001-purpose-and-guidelines.md
datapythonista Aug 3, 2022
8c34db0
Update web/pandas/pdeps/0001-purpose-and-guidelines.md
datapythonista Aug 3, 2022
4f3343b
Update web/pandas/pdeps/0001-purpose-and-guidelines.md
datapythonista Aug 3, 2022
7c1a725
Last review comments and dates updated
datapythonista Aug 3, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 35 additions & 36 deletions web/pandas/about/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,35 @@ fundamental changes to the project that are likely to take months or
years of developer time. Smaller-scoped items will continue to be
tracked on our [issue tracker](https://github.com/pandas-dev/pandas/issues).

See [Roadmap evolution](#roadmap-evolution) for proposing
changes to this document.
The roadmap is defined as a set of major enhancement proposals named PDEPs.
For more information about PDEPs, and how to submit one, please refer to
[PEDP-1](/pdeps/accepted/0001-puropose-and-guidelines.html).

## Extensibility
## PDEPs
simonjayhawkins marked this conversation as resolved.
Show resolved Hide resolved

{% for pdep_type in ["under_discussion", "accepted", "implemented", "rejected"] %}

<h3 id="pdeps-{{pdep_type}}">{{ pdep_type.replace("_", " ").capitalize() }}</h3>

<ul>
{% for pdep in pdeps[pdep_type] %}
<li><a href="{{ pdep.url }}">{{ pdep.title }}</a></li>
{% else %}
<li>There are currently no PDEPs with this status</li>
{% endfor %}
</ul>

{% endfor %}

## Roadmap points pending a PDEP

<div class="alert alert-warning" role="alert">
pandas is in the process of moving roadmap points to PDEPs (implemented in
June 2022). During the transition, some roadmap points will exist as PDEPs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
June 2022). During the transition, some roadmap points will exist as PDEPs,
July 2022). During the transition, some roadmap points will exist as PDEPs,

to match the PDEP created date? but probably August.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah switch to August :->

while others will exist as sections below.
</div>

### Extensibility

Pandas `extending.extension-types` allow
for extending NumPy types with custom data types and array storage.
Expand All @@ -33,7 +58,7 @@ library, making their behavior more consistent with the handling of
NumPy arrays. We'll do this by cleaning up pandas' internals and
adding new methods to the extension array interface.

## String data type
### String data type

Currently, pandas stores text data in an `object` -dtype NumPy array.
The current implementation has two primary drawbacks: First, `object`
Expand All @@ -54,7 +79,7 @@ work, we may need to implement certain operations expected by pandas
users (for example the algorithm used in, `Series.str.upper`). That work
may be done outside of pandas.

## Apache Arrow interoperability
### Apache Arrow interoperability

[Apache Arrow](https://arrow.apache.org) is a cross-language development
platform for in-memory data. The Arrow logical types are closely aligned
Expand All @@ -65,7 +90,7 @@ data types within pandas. This will let us take advantage of its I/O
capabilities and provide for better interoperability with other
languages and libraries using Arrow.

## Block manager rewrite
### Block manager rewrite

We'd like to replace pandas current internal data structures (a
collection of 1 or 2-D arrays) with a simpler collection of 1-D arrays.
Expand All @@ -92,7 +117,7 @@ See [these design
documents](https://dev.pandas.io/pandas2/internal-architecture.html#removal-of-blockmanager-new-dataframe-internals)
for more.

## Decoupling of indexing and internals
### Decoupling of indexing and internals

The code for getting and setting values in pandas' data structures
needs refactoring. In particular, we must clearly separate code that
Expand All @@ -107,7 +132,7 @@ Indexing is a complicated API with many subtleties. This refactor will
require care and attention. More details are discussed at
<https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restructuring-indexing-code>

## Numba-accelerated operations
### Numba-accelerated operations

[Numba](https://numba.pydata.org) is a JIT compiler for Python code.
We'd like to provide ways for users to apply their own Numba-jitted
Expand All @@ -119,7 +144,7 @@ window contexts). This will improve the performance of
user-defined-functions in these operations by staying within compiled
code.

## Documentation improvements
### Documentation improvements

We'd like to improve the content, structure, and presentation of the
pandas documentation. Some specific goals include
Expand All @@ -134,7 +159,7 @@ pandas documentation. Some specific goals include
subsections of the documentation to make navigation and finding
content easier.

## Performance monitoring
### Performance monitoring

Pandas uses [airspeed velocity](https://asv.readthedocs.io/en/stable/)
to monitor for performance regressions. ASV itself is a fabulous tool,
Expand All @@ -154,29 +179,3 @@ We'd like to fund improvements and maintenance of these tools to
<https://pyperf.readthedocs.io/en/latest/system.html>
- Build a GitHub bot to request ASV runs *before* a PR is merged.
Currently, the benchmarks are only run nightly.

## Roadmap Evolution

Pandas continues to evolve. The direction is primarily determined by
community interest. Everyone is welcome to review existing items on the
roadmap and to propose a new item.

Each item on the roadmap should be a short summary of a larger design
proposal. The proposal should include

1. Short summary of the changes, which would be appropriate for
inclusion in the roadmap if accepted.
2. Motivation for the changes.
3. An explanation of why the change is in scope for pandas.
4. Detailed design: Preferably with example-usage (even if not
implemented yet) and API documentation
5. API Change: Any API changes that may result from the proposal.

That proposal may then be submitted as a GitHub issue, where the pandas
maintainers can review and comment on the design. The [pandas mailing
list](https://mail.python.org/mailman/listinfo/pandas-dev) should be
notified of the proposal.

When there's agreement that an implementation would be welcome, the
roadmap should be updated to include the summary and a link to the
discussion issue.
3 changes: 3 additions & 0 deletions web/pandas/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ main:
- pandas_web.Preprocessors.blog_add_posts
- pandas_web.Preprocessors.maintainers_add_info
- pandas_web.Preprocessors.home_add_releases
- pandas_web.Preprocessors.roadmap_pdeps
markdown_extensions:
- toc
- tables
Expand Down Expand Up @@ -157,3 +158,5 @@ sponsors:
logo: /static/img/partners/r_studio.svg
kind: partner
description: "Wes McKinney"
roadmap:
pdeps_path: pdeps
121 changes: 121 additions & 0 deletions web/pandas/pdeps/accepted/0001-purpose-and-guidelines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# PDEP-1: Purpose and guidelines

- Date: 21 June 2022
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Created" instead of "Date"? Since this will typically the date when the process is started, and not when it was eg accepted, "created" might denote that more correctly (and both NEPs and PEPs seem to use that)

- Status: Under discussion
- Discussion: [#47444](https://github.com/pandas-dev/pandas/pull/47444)
- Author: [Marc Garcia](https://github.com/datapythonista)
- Revision: 1

## PDEP definition, purpose and scope

A PDEP (pandas enhancement proposal) is a proposal to a **major** change in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A PDEP (pandas enhancement proposal) is a proposal to a **major** change in
A PDEP (pandas enhancement proposal) is a proposal for a **major** change in

? (not fully sure, but "proposal to" sounds like it should be followed by a verb

pandas, in a similar way as a Python [PEP](https://peps.python.org/pep-0001/)
or a NumPy [NEP](https://numpy.org/neps/nep-0000.html).

Bug fixes and conceptually minor changes (e.g. adding a parameter to a function)
are out of the scope of PDEPs. A PDEP should be used for changes that are not
immediate and not obvious, and are expected to require a significant amount of
discussion and require detailed documentation before being implemented.

PDEP are appropriate for user facing changes, internal changes and organizational
discussions. Examples of topics worth a PDEP could include moving a module from
pandas to a separate repository, a refactoring of the pandas block manager or
a proposal of a new code of conduct.

## PDEP guidelines

### Target audience

A PDEP is a public document available to anyone, but the main stakeholders to
consider when writing a PDEP are:

- The core development team, who will have the final decision on whether a PDEP
is approved or not
- Developers of pandas and other related projects, and experienced users. Their
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Developers of pandas and other related projects, and experienced users. Their
- Contributors to pandas and other related projects, and experienced users. Their

? (might be a bit more generic wording, as "developer" sounds more code-centric?)

feedback is highly encouraged and appreciated, to make sure all points of views
are taken into consideration
- The wider pandas community, in particular users, who may or may not have feedback
on the proposal, but should know and be able to understand the future direction of
the project

### PDEP authors

Anyone can propose a PDEP, but in most cases developers of pandas itself and related
projects are expected to author PDEPs. If you are unsure if you should be opening
an issue or creating a PDEP, it's probably safe to start by
[opening an issue](https://github.com/pandas-dev/pandas/issues/new/choose), which can
be eventually moved to a PDEP.

### Workflow

#### Submitting a PDEP

Proposing a PDEP is done by creating a PR adding a new file to `web/pdeps/accepted/`.
The file is a markdown file, you can use `web/pdeps/accepted/0001.md` as a reference
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it would be convenient to have an actual blank template PDEP file to be filled out.

It might also be nice to have a script that generates the next PDEP number for you(I anticipate it might be hard to find the next PDEP number if a lot of PDEPs are submitted in the future).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd personally leave that for later, once we need it. Those sound like good ideas to me, but I wouldn't implement them initially, I would wait to see how things work first. If we merge like PDEP per month, I think checking the last PDEP number before merging is easier than a system to autogenerate them. And for a template, I'd wait to have few actual PDEPs before deciding if it helps, or if PDEPs are too different from one to another.

Does it make sense to you to start simple and iterate later as we have more experience?

for the expected format.

By default, we expect a PDEP will be accepted, so the PR of a PDEP should be done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to also have a public comment period on PDEP's like tensorflow has with their RFC's?

We should probably also clarify when voting happens(define what proportion of core team members need to consider a PDEP ready before voting like @mroeschke stated) and how long core team members have to vote.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to discuss and hear other opinions. But to me personally, I wouldn't have a voting period or deadline.

I see PDEPs more like ongoing discussions, with feedback and updates, than just a voting process. Also, I assume tensorflow core devs are mainly google employees, so imposing a deadline makes more sense, as they are supposed to be working on the project X hours. But for a mostly volunteer developed project like pandas, I'd leave it more open.

But again, I'd start by seeing how things work, and if we see PDEPs keep open for too long, and seems like having a timeframe should help, surely worth giving it a try.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But again, I'd start by seeing how things work, and if we see PDEPs keep open for too long, and seems like having a timeframe should help, surely worth giving it a try.

I agree, just having a PDEP is akin to a new process, so having a process for the PDEP is maybe like introducing another meta process at the same time. So in response to this comment and others related to the process, I agree with @datapythonista that to begin with we just discuss whether we want to use PDEPs, what we want out of a PDEP, and roughly what it should contain and iterate on the gaps over time. The only caveat here, is that once someone has spent effort preparing a PDEP and it is approved that we don't have "blockers" implementing, so that does affect the approval process to some degree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree - we need a voting process here - we could emulate that of NEP which is pretty simple

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a voting process here

I think what Marc proposed above is to defer a discussion about the decision process (basically our governance model) for a subsequent discussion.

(which sounds good to me to keep the discussion manageable)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure this is true - sure a lot will easily be accepted but some might be controversial and ultimately not accepted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. maybe just qualify this by changing By default, we expect a PDEP will be accepted to something like By default, we expect a PDEP (with some preliminary discussion) will be accepted

before this text we have...

A PDEP should be used for changes that are not
immediate and not obvious, and are expected to require a significant amount of
discussion and require detailed documentation before being implemented.

so the PEP is part of the detailed documentation and an issue should perhaps be the initial part of the discussion. (just as I would normally expect say a bug fix PR to have an associated issue)

and we also have

If you are unsure if you should be opening
an issue or creating a PDEP, it's probably safe to start by
opening an issue, which can
be eventually moved to a PDEP.

I think that maybe issues should always be opened before submitting a PEP, either as a specific issue or an existing issue concluding that a PEP is required. Maybe we also need to ensure that @pandas-dev/pandas-core is also notified in these cases.

If we have some form of gate for opening a PEP, then most should be accepted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also just leave out the "By default, we expect a PDEP will be accepted" ? To me it doesn't seem to add much, rather than a potentially wrong/confusing message (in general PDEPs are for topics that are not trivial and thus will not always be accepted)

in the `accepted` but we will keep `Status: Under discussion` until it is ready to
be merged. If a PDEP is finally rejected, its status and directory will be updated
by the core team before merging, once the decision is made. Please make sure you
select the option `Allow edits and access to secrets by maintainers` when opening the PR.

#### Accepted PDEP

A PDEP can only be accepted by the core development team, if the proposal is considered
worth implementing. Decisions will be made based on the process detailed in the
[pandas governance document](https://github.com/pandas-dev/pandas-governance/blob/master/governance.md).
In general, more than one approval will be needed before the PR is merged. And
there should not be any `Request changes` review at the time of merging.

Once a PDEP is accepted, any contributions can be made toward the implementing the PDEP with an open-ended completion timeline . The
pandas project development, with a mix of volunteers and developers paid from
different sources, and development priorities are difficult to understand or
forecast. For companies, institutions or individuals with interest in seeing a
PDEP being implemented, or to in general see progress to the pandas roadmap,
please check how you can help in the [contributing page](/contribute.html).

#### Implemented PDEP

Once a PDEP is implemented and available in the main branch of pandas, its
mroeschke marked this conversation as resolved.
Show resolved Hide resolved
status will be changed to implemented, so there is visibility that the PDEP
is not part of the roadmap and future plans, but a change that it already
happened. The first pandas version in which the PDEP implementation is
available will also be included in the PDEP.

#### Rejected PDEP

A PDEP can be rejected when the final decision is that its implementation is
not the best for the interests of the project. They are as useful as accepted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "accepted" seems a bit strange here, since this is about PDEPs that are not accepted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to contrast rejected PDEPs against accepted PDEPs; i.e. "even rejected PDEPs are useful". I think the wording is okay, but maybe "Rejected PDEPs are just as useful as accepted PDEPs..." would make it more clear?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed the first "as" in the sentence, so I read it as "they are useful as accepted PDEP", instead of "they are as useful as accepted PDEPs", hence the confusion.

PDEPs, since there are discussions that are worth having, and decisions about
changes to pandas being made. They will be merged with `Status: Rejected`, so
there is visibility on what was discussed and what was the outcome of the
discussion. A PDEP can be rejected for different reasons, for example good ideas
that aren't backward-compatible, and the breaking changes aren't considered worth
implementing.

#### Invalid PDEP

For submitted PDEPs that do not contain proper documentation, are out of scope, or
are not useful to the community for any other reason, the PR will be closed after
discussion with the author, instead of merging them as rejected. This is to not
add noise to the list of rejected PDEPs, which should contain documentation as
good as an accepted PDEP, but where the final decision was to not implement the changes.

## Evolution of PDEPs

Most PDEPs aren't expected to change after accepted. Once there is agreement in the changes,
and they are implemented, the PDEP will be only useful to understand why the development happened,
and the details of the discussion.

But in some cases, a PDEP can be updated. For example, a PDEP defining a procedure or
a policy, like this one (PDEP-1). Or cases when after attempting the implementation,
new knowledge is obtained that makes the original PDEP obsolete, and changes are
required. When there are specific changes to be made to the original PDEP, this will
be edited, its `Revision: X` label will be increased by one, and a note will be added
to the `PDEP-N history` section. This will let readers understand that the PDEP has
changed and avoid confusion.

### PDEP-1 History

- 21 June 2022: Initial version
8 changes: 6 additions & 2 deletions web/pandas/static/css/pandas.css
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,19 @@ h1 {
color: #130654;
}
h2 {
font-size: 1.45rem;
font-size: 1.8rem;
font-weight: 700;
color: black;
color: #130654;
}
h3 {
font-size: 1.3rem;
font-weight: 600;
color: black;
}
h3 a {
color: black;
text-decoration: underline dotted !important;
}
a {
color: #130654;
}
Expand Down
54 changes: 54 additions & 0 deletions web/pandas_web.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
import importlib
import operator
import os
import pathlib
import re
import shutil
import sys
Expand Down Expand Up @@ -185,6 +186,59 @@ def home_add_releases(context):
)
return context

@staticmethod
def roadmap_pdeps(context):
"""
PDEP's (pandas enhancement proposals) are not part of the bar
navigation. They are included as lists in the "Roadmap" page
and linked from there. This preprocessor obtains the list of
PDEP's in different status from the directory tree and GitHub.
"""
context["pdeps"] = {
"accepted": [],
"rejected": [],
"under_discussion": [],
"implemented": [],
}
# accepted, rejected and implemented
pdeps_path = (
pathlib.Path(context["source_path"]) / context["roadmap"]["pdeps_path"]
)
for status in ("accepted", "rejected", "implemented"):
status_dir = pdeps_path / status
if not status_dir.is_dir():
continue
for pdep in sorted(status_dir.iterdir()):
if pdep.suffix != ".md":
continue
html_file = pdep.with_suffix(".html").name
print(pdep)
with pdep.open() as f:
title = f.readline()[2:] # removing markdown title "# "
context["pdeps"][status].append(
{
"title": title,
"url": f"/pdeps/{status}/{html_file}",
}
)

# under discussion
github_repo_url = context["main"]["github_repo_url"]
resp = requests.get(
"https://api.github.com/search/issues?"
f"q=is:pr is:open label:PDEP repo:{github_repo_url}"
)
if context["ignore_io_errors"] and resp.status_code == 403:
return context
resp.raise_for_status()

for pdep in resp.json()["items"]:
context["pdeps"]["under_discussion"].append(
{"title": pdep["title"], "url": pdep["url"]}
)

return context


def get_callable(obj_as_str: str) -> object:
"""
Expand Down