Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Novel constructions for Proof-of-Replication #4

Closed
nicola opened this issue Mar 29, 2018 · 14 comments
Closed

Novel constructions for Proof-of-Replication #4

nicola opened this issue Mar 29, 2018 · 14 comments

Comments

@nicola
Copy link

nicola commented Mar 29, 2018

Work in Progress: This is a work in progress. For comments and suggestions contact us at research@protocol.ai

We as Protocol Labs actively support these areas of research with grants, bounties and direct collaborations. We plan to fund research related to these open problems. Reach out if you want to work on or are working on these problems.


Proof-of-Replication

Proof-of-Replication (PoRep) is a new kind of Proof-of-Storage, that can be used to show that some data D has been replicated to its own uniquely dedicated physical storage. This document presents Proof-of-Replication and its related open problems.

Introduction

This document presents Proof-of-Replication (PoRep), motivates its use cases, provides a definition and outlines the list of related open problems. For a more formal presentation of PoRep schemes, please refer to our Technical Report on Proof-of-Replication.

Motivation

Centralized Cloud. The current cloud infrastructure operates in a centralized setting: the cloud provider is trusted by the client to offer their service correctly, for example storing data, or perform computation. The centralized setting makes it difficult for a client to spot malfunctioning and malicious behaviors. This leads to a cloud infrastructure where only trusted providers participate.

Untrusted Cloud. In the past decade there has been extensive work on delegating computation and storage in an untrusted cloud setting. If the cloud provider operations were to be verifiable, then clients would not need to trust the providers. The intuition is that a provider must generate a proof for their the service that a client can verify. In a naive scenario, a client could store the data on a cloud storage provider and challenge the provider to serve the entire data back in order to verify that the storage provider is still storing the data. However, this straw-man solution would be impractical for standard cloud usage. Practical solutions rely on cryptographic assumptions where the storage provider generates a short cryptographic proof that the client can cheaply verify. There are several cryptographic proof systems for proving storage, we refer in particular Proof-of-Retrievability (PoR) and Proof-of-Data-Possession (PDP).

Decentralized Storage Networks. Decentralized Storage Networks are peer-to-peer systems where nodes (either altruistically or with incentives) store data of other peers. Similar to the untrusted cloud setting, client nodes verify cryptographic proofs rather than relying on trust to ensure that other nodes are storing their data.

Proof-of-Replication. Current schemes only guarantee that the storage provider had the file at the time the proof was generated. However, this does not guarantee that the file was stored through time and that some storage was dedicated to that particular file. In addition, in systems where providers are rewarded proportionally to their storage (e.g. the mining reward in Filecoin), current existing schemes cannot be used, since storage providers can lie about their storage. Proof-of-Replication is a novel proof system, where storage providers cannot lie about storing files, since they would have to store a unique physical permutation (that we refer to as replica) for each file or copy they claim to store.

Note: For a more detailed use-cases for Proof-of-Replication, please refer to our Technical Report (Section 1.3).

Problems with current PoR and PDP schemes

State-of-the-art Proof-of-Retrievability (PoR) and Proof-of-Data-Possession schemes used today in verifiable cloud storage and in decentralized storage networks, only guarantee that a provider had possession of the data at the time of the challenge/response interaction and they are subject to the following three attacks:

  • Sybil Attack: Malicious storage providers could pretend to store (and get paid for) more copies than the ones physically stored by creating multiple Sybil identities, but storing the data only once.
  • Outsourcing Attack: Malicious storage providers could commit to store more data than the amount they can physically store, relying on quickly fetching data from other storage providers.
  • Generation Attack: Malicious storage providers could claim to be storing a large amount of data which they are instead efficiently generating on-demand using a small program. If the program is smaller than the purportedly stored data, this inflates the malicious storage provider’s likelihood of winning a block reward in Filecoin, which is proportional to the storage provider’s storage currently in use.

Note: For a more detailed survey of the different Proofs-of-Storage, please refer to our Technical Report (Section 1.2)

Problem Definition

Proof-of-Replication allows a storage provider to convince a user that some data D has been replicated to its own unique storage. This definition is stricter than PoR and PDP, since it forces the prover to store a unique copy of the file.

Definition

Proof-of-Replication enables a prover P to convince a verifier V that P is storing a replica R, a physically independent copy of some data D, unique to P. The scheme is defined by a tuple of polynomial time algorithms (Setup, Prove, Verify). The assumption is that generation of a replica after Setup must be difficult (if not impossible) to generate.

  • Setup: On setup, either a party or both (depending on the scheme) generate a unique permutation of the original data D, which we call replica R.
  • Prove: On receiving a challenge, the prover must generate a proof that it is in possession of the replica and that it was derived from data D. The prover must only be able to respond to the challenge successfully if it is in possession of the replica, since would be difficult (if not impossible) to generate a replica that can be used to generate the proof at this stage
  • Verify: On receiving the proof, the verifier checks the validity of the proof and accepts or rejects the proof.

Note: For a formal definition of Proof-of-Replication, please refer to our Technical Report (Section 2).

Time-bounded Proof-of-Replication (Visual)

Timing assumption. Time-bounded Proof-of-Replication are constructions of PoRep with timing assumptions. The assumption is that generation of the replica (hence the Setup) takes some time t that is substantially larger than the time it takes to produce a proof (hence time(Prove)) and the round-trip time (RTT) for sending a challenge and receiving a proof.

Distinguishing Malicious provers. A malicious prover that does not have R, must obtain it (or generate it), before the Prove step. A verifier can distinguish an honest prover from a malicious prover, since the malicious one will take too long to answer the challenge. A verifier will reject if receiving the proof from the prover takes longer than a timeout (bounded between proving time and sealing time).

The following figure shows the different steps of the Proof of Replication on a timeline. We refer to the the process of generating a replica as "sealing".

Note: For a formal definition of Time-bounded Proof-of-Replication, please refer to our Technical Report (Section 2.1).

Beyond Time-bounded Proof-of-Replication

There exist other Proof-of-Replication schemes that do not rely on timing assumption but instead of trust assumption. For example, the replica could be generated via a multi-party computation with the assumption that all the involved parties will not collude in the future.

Open Problems

While there are some candidate constructions of the Proof-of-Replication, there are still improvements for making such constructions more practical. An ideal PoRep construction should have as many of the following properties as possible.

Desirable properties

  • Transparency: There is no such extra information that a prover can use to generate valid PoRep proofs without storing the replica itself during Prove.
  • Practical Slow Permutation: If Prove time takes time(Prove) and round-trip time is RTT, then the theoretical lower bound for time(Setup) to be secure is RTT+time(Prove). A slow permutation is practical if the time gap between time(Setup) and RTT+time(Prove) is small without breaking security. The smaller the time gap, the more practical the permutation.
  • Short Verification Time: The verification time is short if it is constant in the security parameter O(λ), for a small constant, or sublinear in the size of the data. (Note: For a reasonable data size, e.g. 10GB, a constant verification might concretely take longer than a sublinear verification)
  • Short Proofs: The proofs are short if they are constant in the security parameter O(λ), for a small constant, or sublinear in the size of the data (Note: For a reasonable data size, e.g. 10GB, a constant proof size might be concretely larger than a sublinear proof size)
  • Non-parallelizable Permutation: The permutation should not be parallelizable, a prover should not get any advantage by adding computational power. This is important for the Filecoin network, miners should get their returns proportional to their storage rather than their power. (Note: we allow for exploration of Rational Proof-of-Replication, where parallelization is allowed, however it is more expensive to generate proofs).
  • Fast Parallelizable Decoding: The algorithm for retrieving the original data from the replica can be parallelized.
  • No Size Expansion: The size of the replica should be equivalent (or a constant) to the size of the original data.

Concrete Open Problems

  1. New PoRep constructions: Are there constructions of a transparent PoRep with practically slow permutation, fast verification time, short proofs and parallelizable decoding? (see current Proof-of-Replication construction). In addition, are there radically different constructions that achieve the same goals?
  2. PoRep without timing assumption: Are there other PoRep schemes that are not time-bounded? If so, are there PoRep that do not require a trusted party to generate the replica? Would they work in the Filecoin setting?
  3. Aggregating Proofs: In the setting of Filecoin, we require miners to repeatedly report valid Proof-of-Replication. Is there a practical way to aggregate and or sequentially compose multiple proof of replication (maintaining short proof size and fast verification)? (For more details, see Proof-of-Spacetime Novel constructions for Proof-of-Spacetime #5)

Useful Readings

  • Filecoin Whitepaper (paper)
  • Proof of Replication (paper)
  • Filecoin Proof of Replication motivation (video)
  • Proof of Replication Construction (video, slides)
@nicola nicola changed the title Improved Proof-of-Replication Novel constructions for Proof-of-Replication Mar 29, 2018
@mdetemad
Copy link

It seems that the given PoRep constructions support only static data. What about the updates?

@nicola
Copy link
Author

nicola commented May 11, 2018

@mohammatt, currently you would have to send the updated data (or the diffs) to the prover, both of you must agree on the hash of the data, and you must re-perform the seal.

It would be interesting to look into PoRep with a more efficient data update!

@Ayms
Copy link

Ayms commented May 14, 2018

Sorry to insist with that question, but what prevents a peer A or a group of peers colluding together, having zero storage to pretend storing data D just by sending to a single entity B the PoRep setup for each of them, so B will do the job and store the replicas for data D then get from A (and others) the challenges and reply back via A (and others)?

B will indeed be storing n non deduplicated replicas of data D, but still those that are pretending storing data D are not the right ones

The incentive to do this might be that it's cheaper/easier to store in one big storage than n different ones, or to hide the real location of some contents

@nicola
Copy link
Author

nicola commented May 14, 2018

Hey thanks for your question!

Say that A and B both promise you to store your file F1. Instead of storing it themselves they give to to C. Now, in order to reply PoRep challenges, A and B proxy the challenges to C and get respectively their proofs and send them to you.

Now:

  • If we use PoRep, there is no way that C can store only one copy of F1 (PoRep enforces that there is a unique independent copy for each of the data requests), so A and B don't really get any advantage from storing F1 with C (since they must provide space for F1 either on their machines or on C machine)
  • Now, PoRep by itself only guarantees physically independent copies. There are two ways to go around making sure that your copies are on two different locations:
    • Have a Proof of Location (!) so the prover must send a proof of location and a proof of replication.
    • Make it rational for the prover to not delegate storage: e.g. (1) adding retrieval rewards, C has no incentive two storage two copies of the same file, since C wants to optimize retrieval rewards income by having more files instead of multiple copies of one, (2) putting a collateral for loosing files, now A and B would rather store the files themselves or at least not delegating it to the same provider

@Ayms
Copy link

Ayms commented May 14, 2018

So I am correct and this is a serious issue I think, because A and B (in your example) are sybils

Indeed the incentive does not look completely obvious and C is still storing two copies, now I gave two examples and probably people will find others, A and B could just be home users, cheating and in addition slowing down the network, and from a legal standpoint illegal content could be deemed to be stored by A and B, while C is storing it

Proof of retrieval might not solve this, proof of location, yes, now how to do it? Maybe like I sketched already, even if a kind of brutal (and maybe trivial but I don't see other means for now): overload periodically and sequencially one of the provers, until it starts failing the challenges, challenge the others, those that fail the test are cheating and banned

@mdetemad
Copy link

I agree with @Ayms, PoRep does not prevent outsourcing the outsourced data.
Also, what does the "physically independent copies" mean?

@nicola
Copy link
Author

nicola commented May 16, 2018

If A and B are asked to store the same file, they can outsource to C, but C has to store twice the same file since A and B must be able to prove that they have independent physical copies.

Outsourcing is not prevented in the sense that you cannot outsource your data, but it's prevented in the sense that you cannot rely on someone else storing the data that you should be storing. In other words, if A and B both are meant to store the same file, at any point in time in the network, there must be two copies, A cannot just rely on the copy that B has.

As long as A and B have a way to store their own copies, then, they are not cheating!

@mdetemad
Copy link

@nicola, I think we are talking about different things. What you are saying can be achieved using different replicas/encodings. I.e., whether A and B are storing the data or moving it onto C, there are always two (or whatever number is agreed on) copies. My question is about the "physically independent copies". Does it mean that the two copies are guaranteed to be stored on different machines?

@ZenGround0
Copy link

ZenGround0 commented May 16, 2018

@mohammatt this looks like a problem of lack of shared definitions. I hope this helps.

From @nicola's comments above:

Now, PoRep by itself only guarantees physically independent copies. There are two ways to go around making sure that your copies are on two different locations

"Physically independent copies" means that every bit of each copy is physically stored on hardware somewhere. PoRep does not ensure that the bits of both copies are kept on units of hardware that satisfy any separation requirements (e.g. different disks, different disks controlled by separate machines). @nicola refers to this separate problem above as "making sure that your copies are on two different locations." This concern is certainly useful to think about, but it is separate from the PoRep abstraction's goals.

Yes, you can "outsource" the data in the sense that a single machine A can store replicas 1 and 2. No you cannot outsource the data in the sense @nicola defines in the first comment because machine A cannot store a single copy, say 1, and be able to prove that it is storing copies 1 and 2.

--edit--
This last paragraph is phrased poorly and caused some confusion below. It was an attempt to highlight that outsourcing has a precise meaning in this context. The second paragraph of @nicola's previous comment stated what I was going for more clearly.

@Ayms
Copy link

Ayms commented May 16, 2018

Yes, you can "outsource" the data in the sense that a single machine A can store replicas 1 and 2. No you cannot outsource the data in the sense @nicola defines in the first comment because machine A cannot store a single copy, say 1, and be able to prove that it is storing copies 1 and 2.

???, no, the outsourcing issue can have a cascading effect, here A (1) is better C who delegated 2 (from B) to D, then C can prove that it is storing 1 (from himself) and 2 (from D) when A (1) and/or B (2) get challenged

@mohammatt states that we can't know if A is storing on different devices or just one that might crash, which indeed is misleading in terms of wording in the specs and should be more clear

As well as you should highlight clearly my concern or put it as a problem to solve

If we take IPFS alone, this is not an issue, what we want is just to retrieve the pieces, we don't care if the peers are delegating storage, using only one device, or whatever they do

Now, IPFS alone might not fly, all P2P networks failed because of the lack of incentive for people to run peers/nodes, except bittorrent where the incentive is mostly to download "illegal" copyrighted content and where everybody freeride but not completely in fact, then the network succeeds to sustain itself

That's why there is Filecoin, now the peers are rewarded to store data, from my standpoint if they store everything on only one device that crashes without backup, then it is their problem, we can expect that others are storing the content too, but if we can't be sure about whom is storing what, then that's a different story, because sybils will be rewarded just to do quasi nothing, and it's incoherent with the Filecoin concepts to reward/pay those that are storing

I gave a fraudulent case off this thread, a simple one is the same as above: a group colluding to buy more storage from only one server and sharing the lower costs than having several servers (btw this extends @mohammatt remark too, everything is most likely stored on one physical device only, maybe raid duplicated, but still on one central machine...), a bit like a mining pool but much worse, members just do nothing

If this can't be considered as a major issue then maybe we could wonder why this was one of my first thoughts and just hope that I am the only one in the world to have such bad intents

@nicola
Copy link
Author

nicola commented May 16, 2018

Hey thanks for re-iterating, but I don't understand your attack, I am sorry :(

sybils will be rewarded just to do quasi nothing [..]
members just do nothing

  1. Let me clarify why "members can't just do nothing" is incorrect.
    I will go through the cases, tell me if there is anything missing, and let's try to use the same language:
  • if A and B promise to store the same file
    • if they do "nothing", then they fail to show the proofs when challenged
    • if they give the file to C, then C must store two distinct replicas; in this case:
      • either A and B are paying C off-band and this is fine!
      • or A and B are getting free storage from C (good luck finding a nice C!)
      • in the above two cases, A and B are basically reselling storage, but they also delegate their trust to C
      • if C stores zero files or only one file, then A, B will not be able to provide valid proofs (they need to provide valid proofs for each of copy of the data)!

So Proof-of-Replication guarantees that at any point in time, if A and B promised to find a way to store 2 copies of your data, then there must be two copies sitting around somewhere or A and B will lose their collateral. In this setting, there is no way A and B can lie about the total amount of storage that they can get access to (either their own, found for free somewhere or by paying someone)

if A is storing on different devices or just one that might crash, which indeed is misleading in terms of wording in the specs and should be more clear

  1. Correct, if you find anywhere that PoRep gives guarantee about "two separate distinct hard drives", please report that, since as you are pointing out, that's incorrect.

a group colluding to buy more storage from only one server and sharing the lower costs than having several servers

  1. This is true, however, they will all rely on this one server to be up online and doing the proofs and that all the users in this server don't screw up on each other. This is an equivalent of relying on C.

Now, what you are saying is correct, A and B can go to Amazon and pay amazon to store the files they are meant to store (C=Amazon). However, in this case, they need to pay Amazon, however, they can only pay this strategy rationally only if the they make a profit, meaning that the cost of storage in Filecoin is higher than Amazon.

I hope this clarify!

@Ayms
Copy link

Ayms commented May 16, 2018

Apparently it's difficult for the team to admit the issue and your are playing with words, while you already admitted it from the previous posts

"same" below = same person or colluding ones under the control of one "same"

Hey thanks for re-iterating, but I don't understand your attack, I am sorry :(

Too bad :-)

sybils will be rewarded just to do quasi nothing [..]
members just do nothing

Let me clarify why "members can't just do nothing" is incorrect.

Please see below

I will go through the cases, tell me if there is anything missing, and let's try to use the same language:

Yes, then please you guys stop changing what is doing A,B and C every post, or stop presenting irrelevant situation for them

if A and B promise to store the same file
if they do "nothing", then they fail to show the proofs when challenged
if they give the file to C, then C must store two distinct replicas; in this case:
either A and B are paying C off-band and this is fine!
or A and B are getting free storage from C (good luck finding a nice C!)
in the above two cases, A and B are basically reselling storage, but they also delegate their trust to C
if C stores zero files or only one file, then A, B will not be able to provide valid proofs (they need to provide valid proofs for each of copy of the data)!

Doing nothing === just relaying the data, which indeed is doing nothing in terms of storage, I thought that this was implicit since this is what we are talking about

A,B and C are the same, it's not supposed to be difficult to understand

So Proof-of-Replication guarantees that at any point in time, if A and B promised to find a way to store 2 copies of your data, then there must be two copies sitting around somewhere or A and B will lose their collateral. In this setting, there is no way A and B can lie about the total amount of storage that they can get access to (either their own, found for free somewhere or by paying someone)

Yes, this is the umpteenth time that this is repeated here, this is all what PoRep does, insuring a non deduplicated storage by "we don't know whom"

if A is storing on different devices or just one that might crash, which indeed is misleading in terms of wording in the specs and should be more clear

Correct, if you find anywhere that PoRep gives guarantee about "two separate distinct hard drives", please report that, since as you are pointing out, that's incorrect.

"physically independent copies" is unclear, cf above posts (not from myself)

a group colluding to buy more storage from only one server and sharing the lower costs than having several servers

This is true, however, they will all rely on this one server to be up online and doing the proofs and that all the users in this server don't screw up on each other. This is an equivalent of relying on C.

Please see above, they are the same again, if one server is not enough you can just add others

Now, what you are saying is correct,

Of course

A and B can go to Amazon and pay amazon to store the files they are meant to store (C=Amazon). However, in this case, they need to pay Amazon, however, they can only pay this strategy rationally only if the they make a profit, meaning that the cost of storage in Filecoin is higher than Amazon.

This is an "image" of the fraudulent case that I sent off this thread, funny that you mention it, do you really think that A and B hiding behind an anonymizer network would store on Amazon?

But indeed the incentive to store on Amazon vs Filecoin rewards might be unlikely, as unlikely as storing alone probably, then back to the mining pool collusion

I hope this clarify!

No, but as already suggested you can just mention it clearly as a non issue in the specs, so everybody is aware of it

Are you sure that you guys know how things are working in reality or are you just playing here and trying to feign/elude the problem?

@miyazono miyazono added the open RFP see https://github.com/protocol/research-rfps label May 22, 2018
@miyazono miyazono removed the open RFP see https://github.com/protocol/research-rfps label Jul 9, 2018
@jpeg07
Copy link
Contributor

jpeg07 commented Jan 28, 2022

Closing issue. Current discussions better routed to https://github.com/protocol/cryptonetlab

@jpeg07 jpeg07 closed this as completed Jan 28, 2022
@Ayms
Copy link

Ayms commented Jan 29, 2022

Still outstanding to see that anonymity is still nowhere in your roadmap/issues and/or Web 3 stuff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants