[Bug] Blocks read inconsistent: expected xxx blocks, actual xxx blocks #124

xianjingfeng · 2022-08-04T02:42:38Z

If we set spark.rss.data.replica.write=2 and spark.rss.data.replica=3,Data integrity cannot be guaranteed in any one shuffle server. right?
But in method org.apache.uniffle.storage.handler.impl.LocalFileQuorumClientReadHandler#readShuffleData, it just read from one shuffle server

The text was updated successfully, but these errors were encountered:

frankliee · 2022-08-04T03:15:13Z

Which version did you use？

Do you set spark.rss.data.replica.read=2 ? It ensures the bitmap metadata of blocks to be written to 2 servers.

As long as the read client gets the metadata from the 2 of servers, it can check the integrity of data from any one of server.

xianjingfeng · 2022-08-04T03:42:17Z

Do you set spark.rss.data.replica.read=2

Yes

As long as the read client gets the metadata from the 2 of servers, it can check the integrity of data from any one of server.

But this step seems execute before readShuffleData

xianjingfeng · 2022-08-04T03:45:40Z

Which version did you use

internal version 0.5.0-snapshot

frankliee · 2022-08-04T06:17:10Z

Do you set spark.rss.data.replica.read=2

Yes

As long as the read client gets the metadata from the 2 of servers, it can check the integrity of data from any one of server.

But this step seems execute before readShuffleData

The metadata is acquired in advance, but data integrity check is executed when all blocks have been fetched.
In current implementation, the client will only fetch “the first available” server to avoid the read cost.
But when the data in this first server is damaged, the final check will report "read inconsistent".

xianjingfeng · 2022-08-04T06:33:29Z

I know, but the application will fail

jerqi · 2022-08-04T06:42:08Z

Do you set spark.rss.data.replica.read=2

Yes

As long as the read client gets the metadata from the 2 of servers, it can check the integrity of data from any one of server.

But this step seems execute before readShuffleData

The metadata is acquired in advance, but data integrity check is executed when all blocks have been fetched. In current implementation, the client will only fetch “the first available” server to avoid the read cost. But when the data in this first server is damaged, the final check will report "read inconsistent".

I feel a little unreasonable about this implement. Should we read next shuffle server when the data isn't complete?

xianjingfeng · 2022-08-04T06:48:20Z

I feel a little unreasonable about this implement. Should we read next shuffle server when the data isn't complete?

I am trying to do this, and i think it needs to be fixed with #108 together

frankliee · 2022-08-04T06:57:09Z

I would be happy to review this PR, and you should avoid to fetch redundancy blocks from the another server (because the spark has consumed this blocks).
Rss has provided some skipping mechanisms for localfile and hdfs.
But I'am worry about memory data. @jerqi

jerqi · 2022-08-04T07:00:25Z

I would be happy to review this PR, and you should avoid to fetch redundancy blocks from the another server (because the spark has consumed this blocks). Rss has provided some skipping mechanisms for localfile and hdfs. But I'am worry about memory data. @jerqi

In my opinion, memory data should also have data skip ability, and our read memory process should be optimized.

xianjingfeng · 2022-08-04T07:07:49Z

Get

frankliee · 2022-08-04T07:08:33Z

This will change server's memory storage to add "index" like hdfs

jerqi · 2022-08-04T07:14:29Z

This will change server's memory storage to add "index" like hdfs

This problem will should discuss in another issue, we also should have a simple design doc.

### What changes were proposed in this pull request? Add fallback mechanism for blocks read inconsistent ### Why are the changes needed? When the data in this first server is damaged, application will fail. #124 #129 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Already added

jerqi · 2022-11-28T16:04:27Z

closed by #276

xianjingfeng mentioned this issue Aug 5, 2022

[Improvement] try read from backup shuffle servers when fetched data is inconsistent #129

Closed

xianjingfeng mentioned this issue Oct 24, 2022

[ISSUE-124] Add fallback mechanism for blocks read inconsistent #276

Merged

xianjingfeng mentioned this issue Nov 4, 2022

[Improvement] Skip blocks when read from memory #294

Merged

jerqi closed this as completed Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Blocks read inconsistent: expected xxx blocks, actual xxx blocks #124

[Bug] Blocks read inconsistent: expected xxx blocks, actual xxx blocks #124

xianjingfeng commented Aug 4, 2022

frankliee commented Aug 4, 2022 •

edited

Loading

xianjingfeng commented Aug 4, 2022

xianjingfeng commented Aug 4, 2022

frankliee commented Aug 4, 2022

xianjingfeng commented Aug 4, 2022

jerqi commented Aug 4, 2022

xianjingfeng commented Aug 4, 2022

frankliee commented Aug 4, 2022

jerqi commented Aug 4, 2022

xianjingfeng commented Aug 4, 2022

frankliee commented Aug 4, 2022

jerqi commented Aug 4, 2022

jerqi commented Nov 28, 2022

[Bug] Blocks read inconsistent: expected xxx blocks, actual xxx blocks #124

[Bug] Blocks read inconsistent: expected xxx blocks, actual xxx blocks #124

Comments

xianjingfeng commented Aug 4, 2022

frankliee commented Aug 4, 2022 • edited Loading

xianjingfeng commented Aug 4, 2022

xianjingfeng commented Aug 4, 2022

frankliee commented Aug 4, 2022

xianjingfeng commented Aug 4, 2022

jerqi commented Aug 4, 2022

xianjingfeng commented Aug 4, 2022

frankliee commented Aug 4, 2022

jerqi commented Aug 4, 2022

xianjingfeng commented Aug 4, 2022

frankliee commented Aug 4, 2022

jerqi commented Aug 4, 2022

jerqi commented Nov 28, 2022

frankliee commented Aug 4, 2022 •

edited

Loading