Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where is the text images in CC-OCR? #23

Open
TongkunGuan opened this issue Aug 20, 2022 · 4 comments
Open

Where is the text images in CC-OCR? #23

TongkunGuan opened this issue Aug 20, 2022 · 4 comments

Comments

@TongkunGuan
Copy link

Hello!
When I try to download the link OCR-CC Data (Huge, ~1.3T), I find the CC-OCR dataset does not contain text images. So I would like to know where to get these images.

389)UP~HOSTP0NJ9WCBL9ZH

@zyang-ur
Copy link
Contributor

We uploaded the GCC index file at https://tapvqacaption.blob.core.windows.net/data/GoogleCC/Train_GCC-training.tsv

The first index in "ocr_feat/visu_feat_resx" before "_" indicates the row number in the index file (both 0-indexed). E.g., "100000_1967358300" is the "100000" row of the soccer match image.

@daeing
Copy link

daeing commented Sep 8, 2022

We uploaded the GCC index file at https://tapvqacaption.blob.core.windows.net/data/GoogleCC/Train_GCC-training.tsv

The first index in "ocr_feat/visu_feat_resx" before "_" indicates the row number in the index file (both 0-indexed). E.g., "100000_1967358300" is the "100000" row of the soccer match image.

Is there another way to download the OCR-CC Data? Such as Google Drive... I can not download the dataset stably due to my area. Many Thanks.

@zyang-ur
Copy link
Contributor

zyang-ur commented Sep 9, 2022

Unfortunately, the CC3M dataset does not allow sharing raw images due to copyright issues. If you have a copy of CC3M images, it should cover all images in OCR-CC. There are also various online tools for CC3M downloading, which might solve/alleviate the network issue.

@daeing
Copy link

daeing commented Oct 28, 2022

Hello! When I try to download the link OCR-CC Data (Huge, ~1.3T), I find the CC-OCR dataset does not contain text images. So I would like to know where to get these images.

389)UP~HOSTP0NJ9WCBL9ZH

老哥,能够分享一下你下载的这个数据集吗?我按照他提供的这个azcopy下载一直不行。。能分享一个百度网盘链接不。。感谢感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants