Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCap Refresh] <4th> Review of <IPFSTT> #273

Open
nicelove666 opened this issue Jan 17, 2025 · 9 comments
Open

[DataCap Refresh] <4th> Review of <IPFSTT> #273

nicelove666 opened this issue Jan 17, 2025 · 9 comments
Assignees
Labels
Awaiting RKH Refresh request has been verified by Public, Watchdog, and Governance - now awaiting release of DC DataCap - Doubled Refresh Applications received from existing Allocators for a refresh of DataCap allowance

Comments

@nicelove666
Copy link

nicelove666 commented Jan 17, 2025

Basic info

  1. Type of allocator: [manual]
  1. Paste your JSON number: [yes]

  2. Allocator verification: [1006]

  1. Allocator Application
  2. Compliance Report
  1. Previous reviews

Current allocation distribution

Client name DC granted
NREL National Solar Radiation Database 5.75 PiB
HyperAI 3.75 PiB
sinergise 5.75 PiB
LAMOST DR7 1 PiB

I. NREL National Solar Radiation Database

  • DC requested: 8 PiB
  • DC granted so far: 5.75 PiB

II. Dataset Completion

s3://nrel-pds-nsrdb/ (420.8 TiB)
s3://nrel-pds-nsrdb/v3/(47.5 TiB)
s3://nrel-pds-nsrdb/v3/tmy/(4.0 TiB)
s3://nrel-pds-nsrdb/v3/tdy/(4.0 TiB)
s3://nrel-pds-nsrdb/v3/tgy/(4.0 TiB)
s3://nrel-pds-nsrdb/v3/puerto_rico/(114.6 GiB)
s3://nrel-pds-nsrdb/conus/(48.2 TiB)
s3://nrel-pds-nsrdb/full_disc/(81.8 TiB)
s3://nrel-pds-nsrdb/meteosat/(16.1 TiB)
s3://nrel-pds-nsrdb/himawari/(189.1 TiB)
s3://nrel-pds-nsrdb/india/(515.3 GiB)

III. Does the list of SPs provided and updated in the issue match the list of SPs used for deals?

Yes(The client disclosed the SPs in advance and amended the application form.)

IV. How many replicas has the client declared vs how many been made so far:

9 vs 9

V. Please provide a list of SPs used for deals and their retrieval rates

SP ID % retrieval Meet the >75% retrieval?
f01999119 86.19 YES
f03253580 24.99 NO
f03286667 82.1 YES
f03282101 83.17 YES
f03312989 90.74 YES
f03260592 71.5 NO
f03220172 56.64 NO
f03291373 89.36 YES
f03226688 86.98 YES

I. HyperAI

  • DC requested: 15 PiB
  • DC granted so far: 3.75 PiB

II. Dataset Completion

Xunlei BitTorrent
magnet:?xt=urn:btih:98F8E7FDDC919C0573AD5C99C31DE53D6866E071
magnet:?xt=urn:btih:ABB2DC586B2955AF86BB26FBFADEF227B0BF8DA7
magnet:?xt=urn:btih:AEFC07D18C9836EC5D019FDD5862EBEF770EBDC7
magnet:?xt=urn:btih:D00A14D1A6640DB8FDCDC6689E910F5D3BB07286
magnet:?xt=urn:btih:6DFD3D3B257C54FF86FE64D57AF45EB612DC402C
magnet:?xt=urn:btih:95A27F0CA2429022E0909B0B0BE3CF3FF13BEFC8

III. Does the list of SPs provided and updated in the issue match the list of SPs used for deals?

Yes(The client disclosed the SPs in advance. then, We advice them modify the SP to the latest)

IV. How many replicas has the client declared vs how many been made so far:

9 vs 9

V. Please provide a list of SPs used for deals and their retrieval rates

SP ID % retrieval Meet the >75% retrieval?
f02852273 38.92 NO
f02984331 32.92 NO
f02973061 37.18 NO
f02883857 1.17 NO
f02889193 24.89 NO
f01082888 6.9 NO
f03081958 63.51 NO
f01084413 42.78 NO
f01084941 34.16 NO

I. Sinergise

  • DC requested: 10 PiB
  • DC granted so far: 5.75 PiB

II. Dataset Completion

s3://sentinel-cogs/(16.4 PiB)
s3://sentinel-cogs-inventory/(3.4 TiB)

III. Does the list of SPs provided and updated in the issue match the list of SPs used for deals?

NO(The client disclosed 5 SPs.the list of SPs enlisted in the CID report:9 SPs)Although the application form did not keep the latest SP, it was disclosed in advance on GitHub.

IV. How many replicas has the client declared vs how many been made so far:

6 vs 9

V. Please provide a list of SPs used for deals and their retrieval rates

SP ID % retrieval Meet the >75% retrieval?
f03220176 43.11 NO
f03252730 78.62 YES
f03218576 40.98 NO
f09693 91.38 YES
f01926635 36.33 NO
f03157905 28.24 NO
f03231154 63.51 NO
f01025366 76.55 YES
f03253580 45.47 NO

Allocation summary

  1. Notes from the Allocator

Our goals for this round:1. Support old clients 2. Find new data sets 3. Ask clients which part they have stored
Of the three new LDNs supported in this round, one is a new dataset, two are new at https://allocator.tech/, and two are duplicated at https://github.com/filecoin-project/filecoin-plus-large-datasets/issues. We will strengthen the requirements in the future.

  1. Did the allocator report up to date any issues or discrepancies that occurred during the application processing?

Yes, we pay attention to the progress of all clients and treat them equally. When SP retrieval is slow, data backup is unreasonable, or new SPs are added, we ask questions and suspend support. We recently let clients to provide more information, individuals provide ID cards, companies provide business licenses. this has been opposed by clients. Only one client sent us an email (we had sent it to the governance team). The client's non-cooperation makes us suspicious. Maybe we should be more tolerant?

  1. What steps have been taken to minimize unfair or risky practices in the allocation process?

Understand their technical solutions、Regularly generate cid reports to follow up on data distribution、Ask them which part of the data is stored、Ask them to find new datasets

  1. How did these distributions add value to the Filecoin ecosystem?

We are actively looking for enterprise clients and new data sets.

  1. Please confirm that you have maintained the standards set forward in your application for each disbursement issued to clients and that you understand the Fil+ guidelines set forward in your application

Yes

  1. Please confirm that you understand that by submitting this Github request, you will receive a diligence review that will require you to return to this issue to provide updates.

Yes

@filecoin-watchdog filecoin-watchdog added Refresh Applications received from existing Allocators for a refresh of DataCap allowance Awaiting Community/Watchdog Comment DataCap Refresh requests awaiting a public verification of the metrics outlined in Allocator App. labels Jan 17, 2025
@filecoin-watchdog
Copy link
Collaborator

@nicelove666
MODRONS’ LAW

  • The allocator makes sure the retrieval is high and regularly checks cid reports, asking the client for any found discrepancies.
  • Instead increasing retrieval rate of f03224283 the client added 5 new SPs to sealing plan. At the same time they kept adding new data to the SP in question.
  • SPs list used for deals match the provided list.
  • 4 out of 10 SPs have retrieval of 0%.
  • Condition of 3 different continents and 5 different countries was not met.

LAMOSRT DR7

  • Condition of 3 different continents and 5 different countries was not met.
  • Mixed performance in retrievability.
  • Dataset Duplication: This dataset has been stored multiple times on the network.

sinergise

  • Website Issue: The company’s website (https://www.sinergise.com/) appears to be non-functional. Furthermore, Sinergise processes or hosts open data, while ownership remains with the original provider (ESA for the Sentinel project).
  • Dataset Duplication: Data from this satellite has been stored multiple times, as shown here: https://github.com/search?q=repo%3Afilecoin-project%2Ffilecoin-plus-large-datasets+s3%3A%2F%2Fsentinel-cogs&type=issues.
  • Without a detailed description or index of what is being stored, it is difficult to determine if the data set was stored before.
    Verifying uniqueness by searching a third-party company name on allocator.tech is insufficient, as it does not directly represent the dataset or the data owner.
  • SP f03231154 was not disclosed in the application, nor were any comments provided about that.
  • "SP’s f03253580, F01025366, F01926635, and f03157905 were disclosed only in the comments, making compliance analysis on that side tedious.
  • Geodiversity Rules: SPs do not meet geodiversity requirements (3 continents, 5 countries).
  • Retrievability: The retrievability rate should be improved, as 77% of SPs have retrievability below 75%.

HyperAI

  • Geodiversity Rules: Similar to the NREL application, all SPs are from a single continent and geographical region, violating geodiversity rules.
  • Retrievability: Retrievability performance is poor, with 20% of SPs reporting "0" retrievability, and 100% scoring below 75%.
  • Sample File Verification: Did the allocator download and confirm the content of the sample files ? Was requested DC justified ?
  • F01996719 was not disclosed in the application, nor were any comments provided about that.
  • Data Preparation: There is a need for clearer information on how the dataset can be utilized by the community or the network.

NREL National Solar Radiation Database

@filecoin-watchdog filecoin-watchdog added Awaiting Response from Allocator If there is a question that was raised in the issue that requires comment before moving forward. and removed Awaiting Community/Watchdog Comment DataCap Refresh requests awaiting a public verification of the metrics outlined in Allocator App. labels Jan 24, 2025
@nicelove666
Copy link
Author

nicelove666 commented Jan 25, 2025

@filecoin-watchdog Thank you the guidance.
Last round, we collaborated with many clients, which increased the workload for you. now, we are focusing on supporting existing clients. We have approved four clients: NREL National Solar Radiation Database, HyperAI, Sinergise, LAMOST DR7.

MODRONS’ LAW

  1. We did not approve them this round.
  2. We approved 1.75 PiB, with the last approval on October 24.
  3. In approval time, 65% of SPs had retrieval rates greater than 75%,we continuously urged them to improve the retrieval rate. 10 SPs only f03224283’retrieval rate is 0 (11.8 bot data). 10.31, we commented, “Fix the retrieval rate of f03224283, then we sign” ([DataCap Application] <MODRONS' LAW> - <MODRONS'LAW-EUD-1> nicelove666/Allocator-Pathway-IPFSTT#72 (comment)). no responded, we closed it.
  4. We write the answer in the application: “We require SPs to be distributed on three continents to meet the requirements of decentralization and diversity, but this is not necessary. If the SPs are only distributed on two continents, I hope they will report to me truthfully and not fake it through a VPN.”
Image

We will work with clients to ensure more decentralized distribution of DCs.
First, they must be honest.

LAMOSRT DR7

  1. July, client submitted the LDN. We have strict review and paused the allocation three times. We emailed the governance team to seek guidance. In the previous round, we received your guidance and continued to allocate for them.
  2. In the last round, we collaborated with too many new clients. After a lengthy discussion about "whether new clients or existing clients are better," we agreed with your viewpoint that "in commercial storage, we need to know who the paying customers are." Therefore, existing clients are better. In this round, we signed for them 1 PiB(they are existing client). Then, we asked them to look for new datasets.
  3. Overall, After six months, the SPs collaborating with this client are still supporting retrieval.
  4. July, we had not discussed the issue of duplicate storage of datasets.

Sinergise

  1. The client provided data cases:

    • s3://sentinel-cogs/(16.4 PiB)
    • s3://sentinel-cogs-inventory/(3.4 TiB)
  2. We checked the datasets at allocator.tech. In the next round, we will expand to https://github.com/filecoin-project/filecoin-plus-large-datasets/issues. Thank you for the reminder.

  3. We have asked the client which part of the dataset is stored ([DataCap Application] sinergise nicelove666/Allocator-Pathway-IPFSTT#92 (comment)). Perhaps we should ask the client multiple times and then download the files to confirm.

  4. The bot on December 24 did not show f03231154. f03231154 was added after the last signature. We requested the client to provide identification/company license, but the client seems very upset. they guess that we not sign, so they did not disclose it in advance?

  5. They added SPs and disclosed F03253580, F01025366, F01926635, and F03157905 in advance in the comments. What should we do?

  6. We are care retrieval rate. All SPs support Spark, we will urge them to continuously improve to reach 75%.

HyperAI

  1. The SPs in two continents, we will ask them find SP in other continents. Their data backup is not very reasonable, we don’t supported them for two months. If they find SP, we can support them.
  2. 10% of SPs have a retrieval rate of 0. When we checked at filspark dashboard, all SPs are supporting retrieval.
  3. Thank you for the reminder; we have already asked the client.
    4.We agree with your opinion: the client needs clearer information on how the dataset can be utilized by the community or the network.

NREL National Solar Radiation Database

  1. We didn't require SPs must to be in three continents, so the client is honest. If the geographical location of the SPs looks strange, we will use detection tools to check. You can list the SPs, and we will check and publish the results.
  2. In the future, we will expand from [allocator.tech] to (https://github.com/filecoin-project/filecoin-plus-large-datasets/issues.). Thanks for the reminder.
  3. SPs in two continents, we will remind them to look for more SPs.
  4. SP f01999119, we will be more careful in the future.
    Thank you.

@filecoin-watchdog
Copy link
Collaborator

@nicelove666

Sinergise
They added SPs and disclosed F03253580, F01025366, F01926635, and F03157905 in advance in the comments. What should we do?

Referring to the update of the list of SPs, it should be updated each time in the original application so as to keep the yen place with the current list of them.

Regarding geodiversification, remember that you can't create rules to deny them later. So writing one into the allocator's regulations is binding in this sense.

You also didn't address the question about VPN detection tools that were written into the application of your allocator, are they implemented ?

@Kevin-FF-USA Kevin-FF-USA added Diligence Audit in Process Governance team is reviewing the DataCap distributions and verifying the deals were within standards and removed Awaiting Response from Allocator If there is a question that was raised in the issue that requires comment before moving forward. labels Jan 27, 2025
@Kevin-FF-USA
Copy link
Collaborator

Hi @nicelove666

Thanks for submitting this application for refresh.
Wanted to send you a friendly update - as this works its way through the system you should see a comment from Galen on behalf of the Governance this week. If you have any questions or need support until then, please let us know.

Warmly,
-Kevin

@nicelove666
Copy link
Author

nicelove666 commented Jan 29, 2025

https://www.ipqualityscore.com/user/search
nicelove666/Allocator-Pathway-IPFSTT#97

boost provider storage-ask f03220172
multiaddrs=[/ip4/103.136.34.105/tcp/16888] id=12D3KooWMqMTa6TPh8K4fdWpLYX7dzqsBSZY5dLoWzm4KQENbJWD
Ask: f03220172
Price per GiB: 0.00000000005 FIL
Verified Price per GiB: 0 FIL
Max Piece size: 32 GiB
Min Piece size: 256 B

Image

boost provider storage-ask f01999119
multiaddrs=[/ip4/222.214.219.199/tcp/12381] id=12D3KooWHuK15dwCZ8LDTZdFMR4AmQFRaHynYZb648dGotPCmQQ7
Ask: f01999119
Price per GiB: 0 FIL
Verified Price per GiB: 0 FIL
Max Piece size: 32 GiB
Min Piece size: 256 B

Image

boost provider storage-ask f03286667
multiaddrs=[/ip4/117.147.208.212/tcp/60002] id=12D3KooWPkghN7vdSyB3jqJBfV9hVEZhwpHTiPFRowU85SGyWW8E
Ask: f03286667
Price per GiB: 0 FIL
Verified Price per GiB: 0 FIL
Max Piece size: 32 GiB
Min Piece size: 256 B

Image

boost provider storage-ask f03312989
multiaddrs=[/ip4/116.92.243.3/tcp/17133] id=12D3KooWPVLQMdrtkHbqGrUBRt5ZPvhyEGYPvwEGooY14cCQ2pQV
Ask: f03312989
Price per GiB: 0.00000000001 FIL
Verified Price per GiB: 0 FIL
Max Piece size: 32 GiB
Min Piece size: 256 B

Image

boost provider storage-ask f03220172
multiaddrs=[/ip4/103.136.34.105/tcp/16888] id=12D3KooWMqMTa6TPh8K4fdWpLYX7dzqsBSZY5dLoWzm4KQENbJWD
Ask: f03220172
Price per GiB: 0.00000000005 FIL
Verified Price per GiB: 0 FIL
Max Piece size: 32 GiB
Min Piece size: 256 B

Image

boost provider storage-ask f03282101
multiaddrs=[/ip4/222.214.219.205/tcp/12400] id=12D3KooWHtV62jZ74WVPa6v1TNkvSy6N7UFALKX8qQpkpTCiKPu8
Ask: f03282101
Price per GiB: 0 FIL
Verified Price per GiB: 0 FIL
Max Piece size: 32 GiB
Min Piece size: 256 B

Image

boost provider storage-ask f03253580
multiaddrs=[/ip4/39.109.85.6/tcp/1506] id=12D3KooWLWHNKvDwmby2RELtGtxAWCpMrhobRuMK4ervZogxSM8m
Ask: f03253580
Price per GiB: 0.00000000005 FIL
Verified Price per GiB: 0 FIL
Max Piece size: 32 GiB
Min Piece size: 256 B

Image

boost provider storage-ask f01999119
multiaddrs=[/ip4/222.214.219.199/tcp/12381] id=12D3KooWHuK15dwCZ8LDTZdFMR4AmQFRaHynYZb648dGotPCmQQ7
Ask: f01999119
Price per GiB: 0 FIL
Verified Price per GiB: 0 FIL
Max Piece size: 32 GiB
Min Piece size: 256 B

Image

@galen-mcandrew
Copy link
Collaborator

@nicelove666 can you summarize the evidence you are providing here? It appears to be investigations into potential VPN usage, but it would be helpful to get a summary from you, so that I do not misinterpret.

From the above conversation and my investigation, these are some areas raised:

  • Regional distribution to meet initial allocator application requirements
  • Further diligence on dataset preparation, such as indexing and dataset sampling
  • Continuing to increase retrieval standards and updating SP lists with clients

We are requesting an additional 20PiB of DataCap for this pathway.

@nicelove666
Copy link
Author

nicelove666 commented Feb 13, 2025

@galen-mcandrew thank you

@nicelove666 can you summarize the evidence you are providing here? It appears to be investigations into potential VPN usage, but it would be helpful to get a summary from you, so that I do not misinterpret.

https://www.ipqualityscore.com is a detection tool that quickly identifies suspicious activity across common types of fraud vectors such as bad bots, fraudulent transactions, account takeover fraud, proxies & VPNs, fake identities, and account opening abuse.
We use https://www.ipqualityscore.com to detect whether the SP uses VPN.
This time, we chose the SP in nicelove666/Allocator-Pathway-IPFSTT#97 for detection.
The process is as follows:
Get SP information from the filecoin network, including the order price, the size limit of each order data, and most importantly, the IP address and port of the order service. Then check whether the SP uses VPN through https://www.ipqualityscore.com based on the IP and port.
These are the information we obtained for f03220172:
“boost provider storage-ask f03220172
multiaddrs=[/ip4/103.136.34.105/tcp/16888] id=12D3KooWMqMTa6TPh8K4fdWpLYX7dzqsBSZY5dLoWzm4KQENbJWD
Ask: f03220172
Price per GiB: 0.00000000005 FIL
Verified Price per GiB: 0 FIL
Max Piece size: 32 GiB
Min Piece size: 256 B”
We tested the information at IPQS. Here are the results:

Image

There is a "fraud score" , we can infer whether the SP uses a VPN based on the size of the fraud score. A fraud score of 0 indicates that the SP does not use a VPN.

According to the website: If the score is over 75, the SP may be using VPN.

@nicelove666
Copy link
Author

nicelove666 commented Feb 13, 2025

  • Regional distribution to meet initial allocator application requirements> *

We will strive to require SP to be distributed in more areas

  • Further diligence on dataset preparation, such as indexing and dataset sampling

We agree with the proposal here.
We have a mature solution to handle the index of the dataset. The purpose of the index is to establish a connection between the original data of the dataset and the piece data encapsulated in the fil network, so as to facilitate the orderly encapsulation and storage of the data and query and retrieval.
The index is divided into a public dataset index and an enterprise dataset index.
Public dataset. We take commoncrawl.org as an example. commoncrawl.org provides a very convenient download method. By passing in the specified dataset parameters, you can easily download the source data. On this basis, we have developed a script program to download and package car files to process the data. As long as you pass in the paths file corresponding to the dataset to be downloaded as a parameter to start the script, you can automatically download and generate the car file. The paths file is an ordered array, in which each source data file is an array element. Therefore, as long as we record the position of this element, we can ensure that duplicate source data will not be packaged. We only need to download and package the source data paths recorded in the source dataset into car files in order. When packing the car file, we need to pack several source data of different sizes into a tar package of about 30G in array order, and then convert the tar package into a car file. After the conversion is successful, the piececid information corresponding to the car file and the location and coordinates of the original data contained in it will be recorded in a csv file. This csv file is the index file of this part of the data set. Based on it, the location of the corresponding data set in the fil network can be easily found and retrieved.
Enterprise data set, first of all, the enterprise data has been classified and sorted, so we only need to convert these data into car files in order and batches, and record the index relationship between the car files and the source data. For example, we need the data of a certain enterprise in November 2024. This batch of data is now stored in the directory "/store number/2024-11" of the enterprise's storage server. We only need to package the data in this directory in the order of the name and convert it into a car file. For each generated car file, we will record its corresponding piececid and the corresponding data information label in a file named "data x-store number-2024-11.csv", for example: baga6ea4seaqe6pxksytiswdbg2phlixkkimu4sgya3nwu4yvyhmrca3gsusjqbi, ddhhmm-ddhhmm, remark.
The csv file is a data search and retrieval file. We will enter it into our management system. When we need to retrieve data from the filecoin network, we only need to find the corresponding csv file first, and then locate the specific piececid that needs to be retrieved from the csv file.
The establishment of indexes also helps to recover SP data - when SP has hard disk damage or other problems that lead to data loss - we have helped SP recover data many times.

  • Continuing to increase retrieval standards and updating SP lists with clients

Well, we will do it.

@Kevin-FF-USA Kevin-FF-USA added Awaiting RKH Refresh request has been verified by Public, Watchdog, and Governance - now awaiting release of DC DataCap - Doubled and removed Diligence Audit in Process Governance team is reviewing the DataCap distributions and verifying the deals were within standards labels Feb 13, 2025
@Kevin-FF-USA
Copy link
Collaborator

@nicelove666

Friendly update on this refresh.

We are currently in the process of moving to a Metaallocator. In order for the tooling to work correctly an allocator can only use the DataCap balance they received through direct allocation from Root Key Holders, >>> or the DataCap received through Metaallocator. As a result, some of the metrics pages like Datacapstats, Pulse and other graphs might be a little confused during this update.

You will not lose any of the DataCap, but you will see that your refresh is amount of DC from refresh + remaining DC an allocator has left.

No action needed on your part, just a friendly note to thank you for your contributions and patience, and you may notice changes in your DataCap balance while the back end is updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting RKH Refresh request has been verified by Public, Watchdog, and Governance - now awaiting release of DC DataCap - Doubled Refresh Applications received from existing Allocators for a refresh of DataCap allowance
Projects
None yet
Development

No branches or pull requests

4 participants