Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add iterators for non pdo drivers. #2718

Merged
merged 12 commits into from
Sep 15, 2017

Conversation

jenkoian
Copy link

@jenkoian jenkoian commented May 8, 2017

Allows result sets to be iterated over without running out of memory on
large data sets.

--

Before: Iterating (e.g. foreaching) over a large data set would often run out of memory.
After: Iterating over a large data set uses iterators using the cursor so that memory usage remains low.

That's the aim anyway, tested (manually) the SQLSrv driver and it has the desired effect, could do with some help testing the other drivers if anyone is able? I would have liked to add some unit tests but none of the existing stuff is unit tested and I'm not really sure where I would even start, again any help here appreciated also.

Small update: I have reverted the changes for the mysqli driver as I don't quite understand that driver enough, it seems to behave unlike any of the others, again any help here much appreciated. Also came across the DataAccessTest which should cover most of this, it certainly brought out the issues with the mysqli driver, so that's good.

Another update: I have removed all the stuff with iterators, favouring generators. This makes things a lot easier as we can just re-use the existing fetch functionality. Generators are not without their issues though as they cannot be re-used, see the discussion below for more info, in particular this comment.

Update (20170724): Due to concerns over consistency both PDO and non PDO can now be iterated but not rewindable. A rewindable generator has been included if users wish to use this (possible this will be removed though?).

Update (20170914): StatementIterator class created which is now used by all non PDO statements. Extracting to own class makes it easier to test.

@jvasseur
Copy link

You could use a generator instead of creating classes for the iterators, it would simplify the code needed for the feature.

@jenkoian
Copy link
Author

@jvasseur agreed that would be easier...as getIterator needs to return a Traversable though not sure it would work? Did consider an anonymous class here and indeed started with that but (can't remember why now) couldn't quite get it to work, although admittedly I didn't spend very long looking into why.

@jvasseur
Copy link

The Generator class implements the Iterator interface so the return type of getIterator is not a problem.

@jenkoian
Copy link
Author

@jvasseur wow, I did not know that! Thanks, will update PR, this makes it way easier.

@jenkoian
Copy link
Author

@jvasseur updated PR, thanks! Will likely give it some time for feedback before rebasing so's we can all forget that whole iterator thing ever happened 😆 Have manually tested the SQL server implementation and it works.

@deeky666
Copy link
Member

While the idea is good, using generators is a BC break when doing the following:

$result = $statement->getIterator();

foreach ($result as $row) {
}

// will error because generators cannot be reused
foreach ($result as $row) {
}

Not sure if we should care though. /cc @Ocramius

@Ocramius
Copy link
Member

Ocramius commented May 17, 2017 via email

@morozov
Copy link
Member

morozov commented May 18, 2017

@Ocramius could you elaborate a bit more, besides the technical possibility and BC, what could it be needed for?

Generators do not allow to be rewound on purpose, and I think the same is applicable to the statements: as a developer, I'd better be informed that I'm re-fetching data from the same statement twice and redesign my logic instead of just fetching the same data again.

@Ocramius
Copy link
Member

@morozov we could accumulate the data that was already iterated upon

@deeky666
Copy link
Member

@morozov I tend to agree with your argumentation but TBH I still find it confusing that generators implement \Iterator although rewind() is definitely something that generators do not support. It might be a matter of taste but when I want an iterator from a statement and it tells me I get \Iterator, I expect it to be an object that implements the interface correctly. There is no word in the documentation about allowing objects not to be rewindable. IMO PHP has brought a lot of confusion with that to the userland and personally I am not a friend of pushing it further.

@Ocramius do you mean "caching" fetched data in the object while iterating? What is wrong with simply executing the statement again on rewind?

@Ocramius
Copy link
Member

What is wrong with simply executing the statement again on rewind?

Side effects

@deeky666
Copy link
Member

@Ocramius can you give an example? Funfact: I am actually working on 3.0 version of the driver component since two weeks or so and have come up with this (draft, mysqli).

@Ocramius
Copy link
Member

@deeky666 for example, a query that selects values from a sequence will allocate new sequence values, instead of retrieving the same ones over multiple calls.

@deeky666
Copy link
Member

@Ocramius good point. That's reason enough for me. So then iterators should probably really be forward only here. I don't see how we could make them rewindable then without storing fetched rows in memory.

@jenkoian
Copy link
Author

I've added a failing test case for the re-usable iterator problem.

@jenkoian
Copy link
Author

I added an example of a RewindableGenerator to the mysqli driver after I noticed Symfony used a RewindableGenerator, feedback welcome. If this works I will look to roll out to the other drivers.

$data = $this->fetchAll();

return new \ArrayIterator($data);
$that = $this;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use $this inside the closure instead of doing that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Been working on a legacy codebase for too long...thanks

return $g();
}

public function count()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to implement the countable interface here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you're right, will remove 👍

@jenkoian jenkoian force-pushed the efficient-iterating branch from 57522c0 to 790a7ec Compare May 19, 2017 16:20

public function getIterator()
{
$g = $this->generator;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return ($this->generator)();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would simply call the generator callable at each foreach () {} (because each foreach will cause a ->getIterator() call).

That's most likely what we do NOT want, as a query will be repeated.

The test case is as following:

$result = someQuery();

foreach ($result as $value) {
}

foreach ($result as $value) {
}

self::assertCount(1, $executedQueries);

$j++;
}

$this->assertEquals($i, $j);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The amount of executed queries is to be checked here - you can use something like SELECT CURRENT_TIME_IN_MICROSECONDS() (or equivalent - I made this SQL up) to verify that the result doesn't change over multiple calls.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a way of counting the number of executed queries using DebugStack take a look, if that doesn't quite get what we need I will do the microseconds thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, I just checked: the logger is only used outside statements, not inside them :-(

This means that the logger will only log the creation of the statement and its first execution.

A stateful query would most likely be better. Sorry for not noticing this before!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply and sorry for my delay. I've pushed something which I'm pretty sure is what you mean. However, I don't think this works. I can't seem to get microseconds. Only thing I can do to get around this is to add a sleep call 😞 . Any ideas?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is likely because of the MySQL version. 5.6 and up seems to support it. Travis is on a pre 5.6 version of MySQL IIRC although I think the new trusty boxes may be newer. So I guess we could update travis to build against the newer versions if we want (if this is indeed the issue!).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that, yes (using trusty on travis-ci). Ideally, we'd test this feature with SQLite in-memory though, unless we want something mysql-specific

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaict, travis is using mariadb which I think supports microseconds, so think it should be fine?

On another note, I noticed the build is failing and it's because pdo doesn't actually support rewinding the iterator (well not by default is my understanding, may be able to with unbuffered queries) so made me questions if we even needed to support this for non pdo drivers, or if we should, should we wrap the pdo statements so they can be rewound too?

Copy link
Member

@Ocramius Ocramius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requires testing against stateful queries

@Ocramius
Copy link
Member

On another note, I noticed the build is failing and it's because pdo doesn't actually support rewinding the iterator (well not by default is my understanding, may be able to with unbuffered queries) so made me questions if we even needed to support this for non pdo drivers, or if we should, should we wrap the pdo statements so they can be rewound too?

If an iterator cannot be rewinded, we'll just need to build an accumulator and rewind ourselves (at the cost of memory): providing inconsistent features is worse than providing features at all :-)

@jenkoian jenkoian force-pushed the efficient-iterating branch from b078876 to 926c94c Compare June 23, 2017 08:00
@jenkoian
Copy link
Author

@Ocramius to clarify, if we're going to provider rewindable iterators for non pdo drivers we should add something to do the same for pdo drivers? If so, I can take a look at doing that.

@Ocramius
Copy link
Member

we should add something to do the same for pdo drivers?

At least looking at what is feasible first.

Rewindable iterator with emulation means memory usage.
Rewindable iterator should NOT perform a query a second time.

@jenkoian
Copy link
Author

@Ocramius I've committed an example of a PDO iterator where we store a cache and can be rewound. Just wanted to remind ourselves of what the PR intentions were and where we're at now.

Initial: Non PDO drivers were running out of memory when iterating over large data sets due to fetchAll() being used and essentially just shoving all results into memory. A generator was added to MysqliStatement (with a view to roll out to the other non PDO drivers too) to solve this issue.

Iterator couldn't be reiterated or couldn't be rewound. So a RewindableGenerator was added so that the iterator could be rewound and thus iterated over multiple times.

PDO couldn't be reiterated noticed after adding an iterator that could be reiterated (rewound) for non PDO drivers that PDO drivers could not actually be reiterated, not sure if this is a bug or by design.

Latest I've added an example of a PDOStatementIterator which basically wraps PDOStatement and makes it reiterable. This is done via a static cache (basically memoization) which means that if large data sets are iterated over they will likely fill up memory.

So...we're basically gone from non PDO drivers running out of memory when iterating over large data to PDO drivers now running out of memory when iterating over large data sets 😆

Anyway, let me know how you wish to proceed 😄

@Ocramius
Copy link
Member

@jenkoian the alternative is that we make them ALL non-rewindable, and that's it. That means that the user is left the decision to call iterator_to_array() or equivalent, and potentially having the application run out of memory.

Thoughts?

@jenkoian
Copy link
Author

Options as I can see it are:

a) Consistent, BC Break
Make both PDO and NPDO (non PDO) rewindable, this is consistent but breaks BC for PDO as previously large data sets iterated over wouldn't suffer OOM.

b) Consistent, doesn't break BC
Make NPDO iterable without risk of OOM but not rewindable. Leave PDO as non rewindable.

c) Inconsistent, doesn't break BC
Make NPDO iterable without risk of OOM and rewindable. Leave PDO as non rewindable.

If you're looking for consistency I would say option b would be the best choice. IMO I think option c is the best and inconsistency here can be tolerated due to shortcoming of PDO.

@Ocramius
Copy link
Member

Ocramius commented Jul 7, 2017

b) Consistent, doesn't break BC
Make NPDO iterable without risk of OOM but not rewindable. Leave PDO as non rewindable.

Fairly sure we need this scenario then: can't afford OOM

Ian Jenkins added 7 commits September 14, 2017 08:32
…ons.

Only added to mysqli for now, can add to other drivers (assuming they
support reseting the pointer in some way) after feedback.
statements in a rewindable iterator. Skipped the test for iterating
multiple times with a message regarding that iterators are non
rewindable.
@jenkoian jenkoian force-pushed the efficient-iterating branch 2 times, most recently from 21e1740 to 581511e Compare September 14, 2017 07:39
Copy link
Member

@morozov morozov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a few minor issues. Looks good overall.

public function testGettingIteratorDoesNotCallFetch()
{
$stmt = $this->createMock(Statement::class);
$stmt->expects($this->never())->method('fetch');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also add all other fetch* methods here (fetchAll(), fetchColumn()).

@@ -20,6 +20,8 @@
namespace Doctrine\DBAL\Driver\OCI8;

use Doctrine\DBAL\Driver\Statement;
use Doctrine\DBAL\Driver\StatementIterator;
use \IteratorAggregate;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leading namespace separator is redundant.


public function testIterationCallsFetchOncePerStep()
{
$values = ['foo', '', 'bar', '0', 'baz', 0, 'qux', null, 'quz', false];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add one more value after false to prove that fetching false terminates iteration, while other empty values don't (see original example).

@jenkoian
Copy link
Author

jenkoian commented Sep 15, 2017

Thanks @morozov - minor issues resolved. Anything else I need to do, or can we get this in? (I will rebase to a single commit once happy this is done) @Ocramius

@Ocramius Ocramius self-assigned this Sep 15, 2017
Copy link
Member

@Ocramius Ocramius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! It took a while, but we moved from a complex implementation to a really simple, clean and maintainable one. Thanks @jenkoian and @morozov!

@Ocramius
Copy link
Member

🚢

@jenkoian jenkoian deleted the efficient-iterating branch September 16, 2017 10:26
morozov added a commit to morozov/dbal that referenced this pull request Jan 16, 2020
The methods has more limitations and caveats than provides real use:

1. It fetches all data in memory which is often inefficient (see doctrine#2718).
2. It fetches rows in memory one by one instead of using `fetchAll()`.
4. It doesn't allow to specify the statement fetch mode since it's instantiated internally.
5. It can be easily replaced with:
   ```php
   foreach ($conn->executeQuery($query, $params, $types) as $value) {
      yield $function($value);
   }
   ```
morozov added a commit to morozov/dbal that referenced this pull request May 27, 2020
The methods has more limitations and caveats than provides real use:

1. It fetches all data in memory which is often inefficient (see doctrine#2718).
2. It fetches rows in memory one by one instead of using `fetchAll()`.
4. It doesn't allow to specify the statement fetch mode since it's instantiated internally.
5. It can be easily replaced with:
   ```php
   foreach ($conn->executeQuery($query, $params, $types) as $value) {
      yield $function($value);
   }
   ```
morozov added a commit to morozov/dbal that referenced this pull request May 27, 2020
The methods has more limitations and caveats than provides real use:

1. It fetches all data in memory which is often inefficient (see doctrine#2718).
2. It fetches rows in memory one by one instead of using `fetchAll()`.
4. It doesn't allow to specify the statement fetch mode since it's instantiated internally.
5. It can be easily replaced with:
   ```php
   foreach ($conn->executeQuery($query, $params, $types) as $value) {
      yield $function($value);
   }
   ```
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants