Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mongo2Stor - CommonCrawl #3

Open
1 of 2 tasks
amughal opened this issue Apr 2, 2024 · 63 comments
Open
1 of 2 tasks

Mongo2Stor - CommonCrawl #3

amughal opened this issue Apr 2, 2024 · 63 comments
Assignees
Labels

Comments

@amughal
Copy link

amughal commented Apr 2, 2024

Version

1

DataCap Applicant

Mongo2Stor

Project ID

CommonCrawl

Data Owner Name

Common Crawl

Data Owner Country/Region

United States

Data Owner Industry

Not-for-Profit

Website

https://data.commoncrawl.org

Social Media Handle

https://twitter.com/commoncrawl

Social Media Type

Twitter

What is your role related to the dataset

Data Preparer

Total amount of DataCap being requested

5

Unit for total amount of DataCap being requested

PiB

Expected size of single dataset (one copy)

1

Unit for expected size of single dataset

PiB

Number of replicas to store

5

Weekly allocation of DataCap requested

400

Unit for weekly allocation of DataCap requested

TiB

On-chain address for first allocation

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

  • Use Custom Multisig

Identifier

No response

Share a brief history of your project and organization

Mongo2Stor (MongoStorage) is working as Storage Service Provider, DataPrep and consulting services in the Filecoin echo system. Based in Southern California, USA, Mongo2Stor is a FIL Green GOLD Certified and currently working through to be fully ESPA certified provider. The founders have vast experience in networks and systems, and have gone through multiple sessions, presentation at ESPA and featured in the Zero to One Service Provider Twitter session by Protocol Labs.

This LDN request is followup to #2040, which has been a great success. Data had been stored to prominent Service Providers like Seal Storage, Simple IPFS Inc. (#1 Ranking QAP), Aligned SaaS provider, PikNik (Medula) and many others.

CommonCrawl has new monthly archives since the launch of LDN #2040, and since then a year worth of data needs to be archived and make it available on the Filecoin network.

Is this project associated with other projects/ecosystem stakeholders?

No

If answered yes, what are the other projects/ecosystem stakeholders

No response

Describe the data being stored onto Filecoin

https://data.commoncrawl.org/crawl-data/index.html
CC-MAIN-2024-10 	February/March 2024 	3.16 	123.50
CC-MAIN-2023-40 	September/October 2023 	3.40 	134.25
CC-MAIN-2023-23 	May/June 2023 	3.10 	119.28
CC-MAIN-2023-06 	January/February 2023 	3.35 	121.19
CC-MAIN-2022-49 	November/December 2022 	3.35 	127.89
CC-MAIN-2022-40 	September/October 2022 	3.15 	115.63
CC-MAIN-2022-33 	August 2022 	2.55 	96.52
CC-MAIN-2022-27 	June/July 2022 	3.10 	116.34
CC-MAIN-2021-43 	October 2021 	3.30 	119.83
CC-MAIN-2021-25 	June 2021 	2.45 	83.79
CC-MAIN-2021-21 	May 2021 	2.60 	93.66

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

If you are a data preparer. What is your location (Country/Region)

United States

If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?

Singularity is an excellent tool for CAR generation. I have used it extensively for the other LDN application.

If you are not preparing the data, who will prepare the data? (Provide name and business)

No response

Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.

No, we have checked there is no prior data sealed for these datasets.

Please share a sample of the data

CommonCrawl creates and compresses data indexes and original files in multiple files. Links in these files can be retrieved individually.

Data Type 	File List 	#Files 	Total Size
Compressed (TiB)
Segments 	segment.paths.gz 	100 	
WARC 	warc.paths.gz 	90000 	99.25
WAT 	wat.paths.gz 	90000 	22.99
WET 	wet.paths.gz 	90000 	9.30
Robots.txt 	robotstxt.paths.gz 	90000 	0.18
Non-200 responses 	non200responses.paths.gz 	90000 	3.43
URL index 	cc-index.paths.gz 	302 	0.25
Columnar URL index 	cc-index-table.paths.gz 	900 	0.28

Confirm that this is a public dataset that can be retrieved by anyone on the Network

  • I confirm

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Sporadic

For how long do you plan to keep this dataset stored on Filecoin

1.5 to 2 years

In which geographies do you plan on making storage deals

Asia other than Greater China, Africa, North America, South America, Europe, Australia (continent), Antarctica

How will you be distributing your data to storage providers

HTTP or FTP server

How did you find your storage providers

Slack, Big Data Exchange, Partners

If you answered "Others" in the previous question, what is the tool or platform you used

No response

Please list the provider IDs and location of the storage providers you will be working with.

f02853198, South America
f01904546, South Korea
f01697248, South Korea
f02846602, USA
f01945089, USA

Working with another other SP in South America and one in Europe.

How do you plan to make deals to your storage providers

Boost client, Singularity

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

Copy link
Contributor

datacap-bot bot commented Apr 2, 2024

Application is waiting for allocator review

@psh0691
Copy link
Contributor

psh0691 commented Apr 2, 2024

Thank you for applying.
Tools related to allocation will be updated this week.
Could you please wait generously even if the verification and allocation are delayed a little?

  1. Have you ever applied for the same DC as the previous LDN?
  2. Have you been assigned all the previously applied DCs?
  3. Please let me know the link of the LDN I applied for the same content as before.
  4. Also, if it is the same as the previously requested wallet address, an error may occur, so please give me a different wallet address.

Copy link
Contributor

datacap-bot bot commented Apr 2, 2024

Datacap Request Trigger

Total DataCap requested

5PiB

Expected weekly DataCap usage rate

400TiB

DataCap Amount - First Tranche

50TiB

Client address

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

Copy link
Contributor

datacap-bot bot commented Apr 2, 2024

DataCap Allocation requested

Multisig Notary address

Client address

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

DataCap allocation requested

50TiB

Id

6a4dbabe-13d2-43bc-9a81-81b74c2dbb67

Copy link
Contributor

datacap-bot bot commented Apr 2, 2024

Application is ready to sign

@amughal
Copy link
Author

amughal commented Apr 2, 2024

Hello @psh0691 Thank you for your quick response, appreciated. Please see inline replies to the questions:

  1. Have you ever applied for the same DC as the previous LDN?
    A. No, The previous LDN 2040 had different DC.

  2. Have you been assigned all the previously applied DCs?
    A. The previous LDN was allocated to the 90% DC. I believe as the Allocators are approving the new applications and corresponding DC, the remaining DC on the previous LDN will not be processed. I could be wrong on this, please let me know if you have more details.

  3. Please let me know the link of the LDN I applied for the same content as before.
    A. This is totally different content archive from CommonCrawl. Github link for the previous LDN is:
    MongoStorage - CommonCrawl Archive filecoin-project/filecoin-plus-large-datasets#2040

  4. Also, if it is the same as the previously requested wallet address, an error may occur, so please give me a different wallet address.
    A. It is different.

Please let me know if there are any further questions. Looking forward and thank you again.

@Blockchain-World-News Blockchain-World-News deleted a comment from datacap-bot bot Apr 2, 2024
@Blockchain-World-News Blockchain-World-News deleted a comment from datacap-bot bot Apr 2, 2024
@amughal
Copy link
Author

amughal commented Apr 3, 2024

Thank you for the approval @psh0691 . Do you think it is fine to start sealing first 50TB on first miner? In the second tranche, then to the next two miners? This means that in two tranches, 50TB will be sealed on each of the three miners. Please let me know.

@psh0691
Copy link
Contributor

psh0691 commented Apr 3, 2024

@amughal
I thought it was allocationed as multiple signatures, but it was allocationed as the first signature.

It's my first allocator experience, so please understand if there's anything lacking.

The checkbot will be activated regardless of the allocator's intention, so it will be triggered and assigned normally in the next round only if it is deployed according to FIL+ rules.

@psh0691
Copy link
Contributor

psh0691 commented Apr 3, 2024

In the next round, we will see if we can adjust the DC Amount and help you.

@amughal
Copy link
Author

amughal commented Apr 4, 2024

Thank you @psh0691

@amughal
Copy link
Author

amughal commented Apr 10, 2024

@psh0691 Just an update, that due to the STFIL issue, few SPs have been on hold for further sealing.
One SP has successfully sealed about 55% of the 50TB allocation.

@amughal
Copy link
Author

amughal commented Apr 11, 2024

@psh0691
One SP "f02846602" in US west coast, can seal more data. Can you please approve a larger next tranche?
Thank you.

@psh0691
Copy link
Contributor

psh0691 commented Apr 11, 2024

@amughal
You've currently used DC 55.5%.
If you use 75% or more, we'll look at it when it's auto-triggered.

@amughal
Copy link
Author

amughal commented Apr 11, 2024

Okay thanks for your support. I will complete 75% sealing on this miner.

@amughal
Copy link
Author

amughal commented Apr 11, 2024

@amughal You've currently used DC 55.5%. If you use 75% or more, we'll look at it when it's auto-triggered.

@psh0691 Please take a look.

Copy link
Contributor

datacap-bot bot commented Apr 11, 2024

Application is in Refill

@datacap-bot datacap-bot bot added Refill and removed granted labels Apr 11, 2024
Copy link
Contributor

datacap-bot bot commented Apr 11, 2024

DataCap Allocation requested

Multisig Notary address

Client address

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

DataCap allocation requested

400TiB

Id

ec1bbcc1-8369-49a8-ae7d-8be7588ff8ed

@amughal
Copy link
Author

amughal commented Jul 13, 2024

@psh0691 requesting next allocation. Thanks

Copy link
Contributor

datacap-bot bot commented Jul 13, 2024

Application is in Refill

Copy link
Contributor

datacap-bot bot commented Jul 13, 2024

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceaidbvjn2m36vnhuya2cetndj7uvhbj2lmwolhzh2pney42xdkyxi

Address

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

Datacap Allocated

1PiB

Signer Address

f1qdko4jg25vo35qmyvcrw4ak4fmuu3f5rif2kc7i

Id

d2437325-e036-40cc-ac1e-48123c0bdba1

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceaidbvjn2m36vnhuya2cetndj7uvhbj2lmwolhzh2pney42xdkyxi

@datacap-bot datacap-bot bot added granted and removed Refill labels Jul 13, 2024
Copy link
Contributor

datacap-bot bot commented Jul 13, 2024

Application is Granted

Copy link
Contributor

datacap-bot bot commented Jul 13, 2024

Client used 75% of the allocated DataCap. Consider allocating next tranche.

@amughal
Copy link
Author

amughal commented Jul 13, 2024

Thank you @psh0691

@psh0691
Copy link
Contributor

psh0691 commented Jul 13, 2024

checker:manualTrigger

Copy link
Contributor

datacap-bot bot commented Jul 13, 2024

DataCap and CID Checker Report Summary1

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 40% of total datacap - f01697248: 45.26%

⚠️ 33.33% of Storage Providers have retrieval success rate equal to zero.

⚠️ The average retrieval success rate is 13.14%

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients2

✔️ No CID sharing has been observed.

Full report

Click here to view the CID Checker report.

Footnotes

  1. To manually trigger this report, add a comment with text checker:manualTrigger

  2. To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

@psh0691
Copy link
Contributor

psh0691 commented Aug 19, 2024

@amughal
Data distribution looks good, such as search success rate, no CID sharing, etc.
However, please secure additional distribution SP.

@amughal
Copy link
Author

amughal commented Aug 19, 2024

Thank you @psh0691 for taking a look. Yes, we will be adding two more SPs in late August and one in October.

I will keep you posted once I have more solid details.

Thanks again.

@amughal
Copy link
Author

amughal commented Oct 8, 2024

Hello @psh0691 Requesting next round of approval. Good News, I have added another SP "f03199233".

Thank you.

@psh0691
Copy link
Contributor

psh0691 commented Oct 10, 2024

checker:manualTrigger

Copy link
Contributor

datacap-bot bot commented Oct 10, 2024

DataCap and CID Checker Report Summary1

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 25% of total datacap - f01697248: 43.89%

⚠️ 40.00% of Storage Providers have retrieval success rate equal to zero.

⚠️ 100.00% of Storage Providers have retrieval success rate less than 75%.

⚠️ The average retrieval success rate is 13.16%

Deal Data Replication

⚠️ 98.54% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients2

✔️ No CID sharing has been observed.

Full report

Click here to view the CID Checker report.

Footnotes

  1. To manually trigger this report, add a comment with text checker:manualTrigger

  2. To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Copy link
Contributor

datacap-bot bot commented Oct 10, 2024

Application is in Refill

Copy link
Contributor

datacap-bot bot commented Oct 10, 2024

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacebqps6qwd6sedztjfu4p7ippvlkspcjy4glceaerc5jopgzsrnkl2

Address

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

Datacap Allocated

2PiB

Signer Address

f1qdko4jg25vo35qmyvcrw4ak4fmuu3f5rif2kc7i

Id

77fc6acc-ef17-4f88-a70b-2338ccb46c2d

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebqps6qwd6sedztjfu4p7ippvlkspcjy4glceaerc5jopgzsrnkl2

Copy link
Contributor

datacap-bot bot commented Oct 10, 2024

Application is Granted

@datacap-bot datacap-bot bot added granted and removed Refill labels Oct 10, 2024
@amughal
Copy link
Author

amughal commented Oct 10, 2024

Thank you @psh0691

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants