Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCap Refresh] <4th> Review of <TOPPOOL> #255

Open
TOPPOOL-LEE opened this issue Dec 13, 2024 · 14 comments
Open

[DataCap Refresh] <4th> Review of <TOPPOOL> #255

TOPPOOL-LEE opened this issue Dec 13, 2024 · 14 comments
Assignees
Labels
Diligence Audit in Process Governance team is reviewing the DataCap distributions and verifying the deals were within standards Refresh Applications received from existing Allocators for a refresh of DataCap allowance

Comments

@TOPPOOL-LEE
Copy link

TOPPOOL-LEE commented Dec 13, 2024

Basic info

  1. Type of allocator: [manual]

  2. Paste your JSON number: [https://github.com/v5 Notary Allocator Application: TOP POOL notary-governance#1046]

  3. Allocator verification: [yes]

  4. Allocator Application

  5. Compliance Report

  6. Previous reviews

Current allocation distribution

Client name DC granted status
Cell Painting Gallery 2PiB old
Stanford University 1.5 PiB old
Pangeo Community 1 PiB old
U.S. National Library of Medicine 0.25 PiB new

I. Cell Painting Gallery

  • DC requested: 5 PiB
  • DC granted so far: 4.94 PiB

II. Dataset Completion
<(https://registry.opendata.aws/cellpainting-gallery/)>

III. Does the list of SPs provided and updated in the issue match the list of SPs used for deals?
<Initial:
f02984331 --Singapore
f02883857 -- Singapore
f02852273 -- United Kingdom
f02973061 -- Russia
f02889193---Vietnam

Final:
f02984331 --Singapore
f02883857 -- Singapore
f02852273 -- United Kingdom
f02973061 -- Russia
f02889193---Vietnam
f01996719---CN >

IV. How many replicas has the client declared vs how many been made so far:
<The client added 1 SP and disclosed it. This better met the requirements of the three continents.>
6 vs 7

V. Please provide a list of SPs used for deals and their retrieval rates

SP ID % retrieval Meet the >75% retrieval?
f02852273 55.67 NO
f02984331 42.24 NO
f02883857 43.07 NO
f02973061 52.64 NO
f02889193 44.83 NO
f01996719 38.17 NO

We are very concerned about SP's retrieval and have reminded them several times.
We approved this time only after they passed our random testing of sectors.
TOPPOOL-LEE/Allocator-Pathway-TOP-POOL#42 (comment)
TOPPOOL-LEE/Allocator-Pathway-TOP-POOL#42 (comment)

I. | Stanford University

  • DC requested: 8PiB
  • DC granted so far: 6.25 PiB

II. Dataset Completion
<(http://storage.googleapis.com/thumos14_files/UCF101_videos.zip)>

III. Does the list of SPs provided and updated in the issue match the list of SPs used for deals?
<Initial:
f01422327, Japan
f02252024, Japan
f02252023, Japan
f01111110, Vietnam
f01909705, Vietnam
f03232064, Malaysia
f03232134, Malaysia

Final:
f01422327, Japan
f02252024, Japan
f02252023, Japan
f01111110, Vietnam
f01909705, Vietnam
f03232064, Malaysia
f03232134, Malaysia >

IV. How many replicas has the client declared vs how many been made so far:
Although the client filled in 4 backups when filling out the application form, The SPs that the client cooperated with are the same as the disclosed 7 SPs.
We have reminded they twice to pay attention to modifying the backup numbers, and we will continue to urge.
7 vs 7

V. Please provide a list of SPs used for deals and their retrieval rates

SP ID % retrieval Meet the >75% retrieval?
f03232064 89.73 yes
f03232134 86.93 yes
f01422327 55.67 NO
f02252024 91.02 yes
f02252023 1.56 NO
f01111110 52.07 NO
f01909705 45.34 NO

I. Pangeo Community

  • DC requested:7 PiB
  • DC granted so far: 1.75 PiB

II. Dataset Completion
<aws s3 ls --no-sign-request s3://cmip6-pds/
aws s3 ls --no-sign-request s3://esgf-world/>

III. Does the list of SPs provided and updated in the issue match the list of SPs used for deals?
<Initial:
f03231154, HongKong
f03228906, China
f03157910, Shenzhen, China
f03157905, Shenzhen, China
f03218576 US
f03215853 US
f01025366, Qingdao, China
f0122215,Qingdao, China

Final:
f03231154 HongKong
f03218576 US
f03215853 US
f03157905 Shenzhen, China

We have made recommendations on the retrieval rate of SPs and the cooperation of SPs.
TOPPOOL-LEE/Allocator-Pathway-TOP-POOL#53 (comment)

IV. How many replicas has the client declared vs how many been made so far:
8 vs 4

V. Please provide a list of SPs used for deals and their retrieval rates

SP ID % retrieval Meet the >75% retrieval?
f03231154 60.06 NO
f03218576 45.66 NO
f03215853 8.77 NO
f03157905 27.19 NO

I. U.S. National Library of Medicine

  • DC requested: 14 PiB
  • DC granted so far: 0.25 PiB

II. Dataset Completion
https://ftp.ncbi.nih.gov/biosample/biosample_set.xml.gz

III. Does the list of SPs provided and updated in the issue match the list of SPs used for deals?

IV. How many replicas has the client declared vs how many been made so far:

V. Please provide a list of SPs used for deals and their retrieval rates

First round, no cidbot data

Allocation summary

  1. Notes from the Allocator
    <We asked Sloan Digital Sky Survey why they use VPNs and closed it.
    We started supporting new datasets [DataCap Application] <Byte Tunneling> - <ByteTunneling_data_store_bc_fil_02> TOPPOOL-LEE/Allocator-Pathway-TOP-POOL#58 and rejected repeated data set applicants [DataCap Application] Digital Earth Africa TOPPOOL-LEE/Allocator-Pathway-TOP-POOL#67
    Of the four clients in this round, there are two new dataset.>

  2. Did the allocator report up to date any issues or discrepancies that occurred during the application processing?
    <! Yes, we always pay attention to clients' applications and operations, and the comments of the governance team. We always work according to the suggestions of the governance team.
    When we face clients, we can always find changes in clients' operations in the first time, including changes in retrieval rate, errors caused by network upgrades, etc. We always urge them to continue to improve.
    When bot has a bug, we will report it to the technical team in time.
    When we suspect that a client's technical solution has a loophole, we will compare it with the technical solutions of multiple clients.
    When we are not sure whether the spark retrieval system has not been updated, we will manually search sectors randomly.>

  3. What steps have been taken to minimize unfair or risky practices in the allocation process?
    <1. When we find that the client is not trustworthy, we will reject them directly.
    <2. When we are not sure whether the client is trustworthy, we will slow down, so that the number of quotas for their support is relatively small.
    <3. We pay close attention to the bot data. When the bot does not appear, we will choose to wait for the bot to appear. We will continue to support only if the cid report is intact.>

  4. How did these distributions add value to the Filecoin ecosystem?
    <We started looking for new datasets. Now, we have two new datasets. we will continue.>

  5. Please confirm that you have maintained the standards set forward in your application for each disbursement issued to clients and that you understand the Fil+ guidelines set forward in your application
    < Yes>

  6. Please confirm that you understand that by submitting this Github request, you will receive a diligence review that will require you to return to this issue to provide updates.
    < Yes>

@filecoin-watchdog filecoin-watchdog added Refresh Applications received from existing Allocators for a refresh of DataCap allowance Awaiting Governance/Watchdog Comment DataCap Refresh requests awaiting a public verification of the metrics outlined in Allocator App. labels Dec 16, 2024
@Kevin-FF-USA
Copy link
Collaborator

Hi @TOPPOOL-LEE,
Wanted to send you a quick thank you note, appreciate compiling this into the form for faster review. Next steps will be for a community comment or Watchdog account summary.

@TOPPOOL-LEE
Copy link
Author

Hello, @Kevin-FF-USA
glad to receive your reply. We would like to attend the meeting. Could you please reserve time for us? Thank you.

@filecoin-watchdog
Copy link
Collaborator

@TOPPOOL-LEE
Pangeo

  • Even though the allocator requested the client nearly three weeks ago to add more Storage Providers (SPs) and expand the distribution, this has not yet been implemented. Currently, the data is distributed across two geopolitical regions and three countries, whereas the allocator’s requirements (as stated in their allocation application) specify that the data must be stored on three different continents and in five different countries. These conditions have not been met so far.
    What is the allocator’s strategy to address this issue?
    Will cooperation with the client be discontinued if the conditions remain unmet?

  • So far, only four SPs have been used for deals.

  • One of the SPs has a retrieval rate of less than 15%.

U.S. National Library of Medicine

  • The allocator checked the client’s company name and used it as proof the dataset wasn’t stored before which is a mistake.
  • Was the allocator able to review the data intended for storage? The provided sample contains only 3GB.
  • The data is currently distributed across only two countries and one continent, which violates the allocator’s rules.
  • Has the allocator conducted proper Know Your Customer (KYC) checks?

Stanford University

  • An additional sample check revealed that this dataset originates from https://www.crcv.ucf.edu/data/UCF101.php, not Stanford University.
    Could the client and allocator clarify the dataset's details, including its source, ownership, and whether it is legal to store this data?
  • Was the allocator able to review the data intended for storage? The provided sample contains only 6GB.
  • The data is currently distributed across three countries and one continent.
  • The latest report shows nine replicas instead of the declared seven.
  • One SP has a retrieval rate of 0%.

Cell Painting Gallery

  • As mentioned in the previous review, this dataset has been stored multiple times, as evidenced by a GitHub search. The client should specify which folders are being stored on Filecoin. Narrowing the scope of data should be performed even after the DataCap (DC) is granted, ideally after the last review but preferably before the first allocation.
  • According to the IP checker and the report, the SP (f02984331) is based in Singapore, not Australia.

@filecoin-watchdog filecoin-watchdog added Awaiting Response from Allocator If there is a question that was raised in the issue that requires comment before moving forward. and removed Awaiting Governance/Watchdog Comment DataCap Refresh requests awaiting a public verification of the metrics outlined in Allocator App. labels Jan 3, 2025
@filecoin-watchdog
Copy link
Collaborator

@TOPPOOL-LEE
Also, is it clear to you what is connection between this client of yours and client of joshua-ne?
There is an undeniable similarity between the names of the client and the projects.

@TOPPOOL-LEE
Copy link
Author

TOPPOOL-LEE commented Jan 6, 2025

@filecoin-watchdog hey, thank you for your patient comments and kind guidance.
We would like to convey our sincere wishes: Happy New Year and enjoy your holidays.

Pangeo

In the previous round, we allocated a total of 0.75 PiB for this client. They collaborated with four SPs distributed across two continents. Before the allocation of 1 PiB in this round, we reminded them that they needed to increase their SPs.
WX20250106-191307@2x
Now, the client's remaining quota is 1 PiB (it not be used). The customer has paused, and I believe that when they start again, SPs from new continents will join. If no new SPs join before the next round , we will suspend support. Thank you for reminder; Let's be patient and wait for them to start,we will closely monitor the client's progress.

U.S. National Library of Medicine

We are looking for new datasets, rejecting duplicate data storage and reducing redundancy. We check for duplicate datasets by entering the dataset names at https://allocator.tech. If this method is incorrect, how should we properly assess it?

The size of the data case and the size of the data stored on the website are two different concepts. the client's data case was submitted using XML files, and due to limitations of the SQL server management system, the capacity of the XML data can only reach around 3GB.However, when you unzip it, you will find that there are 109G. we also looked at the website https://www.nlm.nih.gov, where you can find more information at https://support.nlm.nih.gov/kbArticle/?pn=KA-04293, which hosts a large amount of data.

You mentioned the geographical location of the SPs. we only allocated 256 TiB to them. If you allow us to continue allocating for them, I believe we can see more SPs from other continents joining. If you advise us not to allocate for them anymore, we will suspend support.

Stanford University

The dataset you mentioned comes from https://www.crcv.ucf.edu/data/UCF101.php.so good.we were already aware of it before approval. Since the client provided 13,320 items in their data case, and the UCF101.php website also shows exactly 13,320 items. The data on the UCF101.php site is provided by multiple universities and research institutions, and the UCF101 dataset contains more than two million frames.
WX20250106-184454@2x

Before we approved, we asked them about the data source and size and requested they reduce the total amount of data applied for.
WX20250106-191845@2x

They disclosed seven SPs, in the latest round, nine SPs participated, as the client added new SPs after the last round of signatures. Please trust that we will inquire about and intervene in the client's actions before supporting them in the next round.

Cell Painting Gallery

SP (f02984331) is based in Singapore, but the client's disclosed location is in Australia. We have reminded the client about this. Thank you for your reminder—it has been corrected.

In fact, we did not intend to support this client; however, in the last round, we promised to support our existing clients. but our allocator suddenly encountered a bot bug (the issue of duplicate data), causing all old clients' bots to report duplicate data, so we could not support them. Meanwhile, to verify whether the duplicate data was a client technical solution bug or a bot bug, we actively sought new clients and found four different technical solution clients, confirming that it was indeed a bot bug. Therefore, in this round, we must support our existing clients, so we continued to support this client.Now, We had shut it down.

@TOPPOOL-LEE
Copy link
Author

TOPPOOL-LEE commented Jan 7, 2025

@TOPPOOL-LEE Also, is it clear to you what is connection between this client of yours and client of joshua-ne? There is an undeniable similarity between the names of the client and the projects.

Yes, they are the joshua-ne client.

Due to the bot bug, we continuously sought new clients and technical solutions for verification, We found four clients, one of which is joshua-ne client. Please check the content:#233 (comment) We had disclosed it.

If we have collaborated with other allocators, we will disclose and admit it honestly. We will ensure that we are open, sincere and brave.

@TOPPOOL-LEE
Copy link
Author

TOPPOOL-LEE commented Jan 7, 2025

In this round, we followed the advice of the governance team:

  1. We shut down clients who used VPN, even though they gave clear reasons.
  2. We rejected new clients who applied with duplicate data sets, since redundant data is not welcome.
  3. We started to support old clients according to the advice of the governance team.
  4. We actively paid attention to the number of SP disclosures and actual numbers of each customer, and reminded clients on GitHub.
  5. Most of our clients are distributed in two continents. In the next round, we will actively ask them to meet three continents.Supporting new data sets and finding SPs in three continents will be the core of our next round of allocation.

@filecoin-watchdog
Copy link
Collaborator

filecoin-watchdog commented Jan 7, 2025

@TOPPOOL-LEE
U.S. National Library of Medicine

  • You should verify the proposed datasets against the Filecoin Plus Large Datasets Repository and allocator.tech. My concern is that you searched for the client’s company name rather than the dataset owner's name. The client is not the owner of the dataset. Instead of searching for Byte Tunneling, you should have searched for the U.S. National Library of Medicine. Let me know if you need further clarification.
  • The size explanation you provided is not what I was asking for. You should ensure you review the specific dataset the client intends to store. You referenced a source page containing a vast amount of data, but it’s unlikely the client plans to store all of it. The client must clearly specify the scope of the dataset.

If you allow us to continue allocating for them

I am providing my evaluation as a community member. I don’t have the authority to forbid anything, but I’m glad you value my guidance enough to consider incorporating it. 😊

Stanford University

Since the client provided 13,320 items in their data case, and the UCF101.php website also shows exactly 13,320 items.

That’s correct; however, the dataset is only 6GB. What is being stored beyond that? Were you able to download any portion of the data to confirm its content?
The client declared the dataset size as 2PiB. I have not seen any clarification regarding this inconsistency.

Before we approved, we asked them about the data source and size and requested they reduce the total amount of data applied for.

Did they respond to your request?

You still haven’t clarified whether it is legal to store this dataset. Did the client provide any evidence that they have consent to use this data?

Cell Painting Gallery
As I understand it, you preferred not to support a non-compliant client, but you were uncertain whether the client was being truthful. In such cases, I would suggest waiting until the bot issue is resolved or asking the client for additional reports or evidence to verify their claims. While the latter approach might be more susceptible to manipulation, it is preferable to granting another DataCap without proper validation.

@TOPPOOL-LEE Also, is it clear to you what the connection is between this client of yours and the client of joshua-ne? There is an undeniable similarity between the names of the clients and their projects.

I am not accusing you of anything. My concern is whether this client is compliant with the rules and whether you are aware of the potential parallel cooperation between your client and joshua-ne.

@TOPPOOL-LEE
Copy link
Author

TOPPOOL-LEE commented Jan 9, 2025

Thank you for your comprehensive guidance.

U.S. National Library of Medicine
We have only approved 256TiB for them. We checked at https://github.com/filecoin-project/filecoin-plus-large-datasets. It seems they are the new datasets.
Their data cases are small; although more than 90% of clients provide data cases of less than 10GiB, it is crucial to specify which content can be downloaded from the client's website. We will follow up on this later.
WX20250109-174528@2x

Stanford University
We advised them to reduce the total amount they applied for, they decreased from 10PiB to 8PiB.
According to your suggestion,we randomly sampled data from the client and downloaded it:
Get the payload CID corresponding to the random extraction f03230423 of f1qjzr2bmjjhhqjwv7nhgvn5ph6ddjhrzokonokea from TOPPOOL-LEE/Allocator-Pathway-TOP-POOL#55; the corresponding link is https://old.filecoin.tools/101443019.
WechatIMG1252
WechatIMG1253
WechatIMG1251

We have contacted the client to provide more content; if not, we can reject.
Tip: In the last round, we meet the bot bug. We invited many clients to test the solution; this client is from another joshhua-ne's, so we relaxed our requirements.

Cell Painting Gallery
This client submitted their application in October, and we conducted a strict review. We only signed for them less than 5P, but we performed 10 checks on them. This includes advising them to wait or provide more evidence when the bot data was wrong, the LDN has a total of 97 comments.

Thanks for your suggestions, we need to improve in these three areas:

  1. Ask clients which parts of the dataset they are storing and focus on the size of the data cases provided.
  2. Require clients to find SPs across three continents.
  3. Conduct a more comprehensive check to see if the datasets are duplicates.we will check both https://github.com/filecoin-project/filecoin-plus-large-datasets and https://allocator.tech.

@Kevin-FF-USA Kevin-FF-USA added Diligence Audit in Process Governance team is reviewing the DataCap distributions and verifying the deals were within standards and removed Awaiting Response from Allocator If there is a question that was raised in the issue that requires comment before moving forward. labels Jan 9, 2025
@filecoin-watchdog
Copy link
Collaborator

@TOPPOOL-LEE
Stanford University

  1. I’m not familiar with this tool, and for me, it only allows me to check the piece CID and several parameters of the agreement. Instead, I used datacapstats.io and checked for piece CIDs. I reviewed all the storage providers (SPs) associated with this client to confirm whether their pieces are retrievable or not.
    Unfortunately, neither of them makes their data publicly accessible, as there is no HTTP address available to retrieve the data.
  2. Have you been able to download the dataset yourself? Even if so, I, as a regular user of the network, wasn't able to access it. This goes against the principles of open data rules.
  3. This point depends on your response to the second question, as I’m curious whether you’ve confirmed what is actually stored within the piece. As I’ve mentioned several times already, the dataset sample is 6GB in size, and no additional data was declared. Therefore, I’m wondering what exactly is stored in the pieces.

@TOPPOOL-LEE
Copy link
Author

TOPPOOL-LEE commented Jan 10, 2025

Well, in response to your concerns, we downloaded the data again.Considering that there is too much content, we use links to display.You can open any link to check.
links (2).txt

@TOPPOOL-LEE
Copy link
Author

TOPPOOL-LEE commented Jan 10, 2025

WX20250110-184349@2x 401952859-c73a6245-0031-487d-84da-6932a53a14ae 401952804-95c6b2e0-37cd-47b9-84ea-7ae45be414ca WX20250110-184530@2x

@filecoin-watchdog
Copy link
Collaborator

@TOPPOOL-LEE I’m glad you were able to retrieve the data! Would it be possible for you to share a method that others could use to access it as well? As it stands, it seems that only you can retrieve it.

What you’ve shared aligns with the sample provided by the client, but it still doesn’t explain why the client required 2 PiB to accommodate the dataset size.

That said, I believe we’ve explored this topic thoroughly. From our discussion, it seems you might not have insight into why the client needed such a large amount of data.

@TOPPOOL-LEE
Copy link
Author

TOPPOOL-LEE commented Jan 11, 2025

We have made a list that anyone can see it. You can open the text, copy the link、open、 view the content
links.txt
The client used 9 SPs, applied for 8P of data, and needed about 1P of data.
From past experience, this client received affirmation and support, so we trusted them,now, when they applied to me, they faced rigorous scrutiny and questioning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Diligence Audit in Process Governance team is reviewing the DataCap distributions and verifying the deals were within standards Refresh Applications received from existing Allocators for a refresh of DataCap allowance
Projects
None yet
Development

No branches or pull requests

3 participants