Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata doesn't save #1925

Closed
AMNHcjohnson opened this issue Jan 7, 2023 · 24 comments
Closed

Metadata doesn't save #1925

AMNHcjohnson opened this issue Jan 7, 2023 · 24 comments
Assignees

Comments

@AMNHcjohnson
Copy link

Hi, I am trying to upload a new dataset to GBIF. However, whenever I fill out the metadata section and save it, when I go back to the resource page, nothing is saved and I fill the info out all over again. I've done it about 5 times now...what am i doing wrong? Chris

@mike-podolskiy90
Copy link
Contributor

mike-podolskiy90 commented Jan 7, 2023

@AMNHcjohnson Thank you for contacting us.
I need some more information please - what IPT version do you use? Do you have any exceptions displayed?

@mike-podolskiy90 mike-podolskiy90 self-assigned this Jan 7, 2023
@AMNHcjohnson
Copy link
Author

AMNHcjohnson commented Jan 7, 2023 via email

@mike-podolskiy90
Copy link
Contributor

I don't remember anything like that. Could you send me your IPT logs please? Or provide me with administrator rights for your IPT?
And, if possible, I would recommend you to update your IPT to the most recent version (2.6.3 currently)

@AMNHcjohnson
Copy link
Author

AMNHcjohnson commented Jan 7, 2023 via email

@mike-podolskiy90
Copy link
Contributor

I'm sorry I don't quite understand, you can't see the resource now? Have you deleted it or it just disappeared?
Log file is available for the admin users in the Administration -> Logs, or you can download them directly from the server: IPT data dir -> logs

@AMNHcjohnson
Copy link
Author

AMNHcjohnson commented Jan 7, 2023 via email

@AMNHcjohnson
Copy link
Author

AMNHcjohnson commented Jan 9, 2023 via email

@mike-podolskiy90
Copy link
Contributor

I'm glad to hear you managed to publish your resource. Question mark in the publication log simply indicates that the validation process was started. As you can see further in the log the IPT reported all went successfully.

What is your dataset please? After publishing in the IPT it might take some time for the dataset to be indexed by GBIF.

@AMNHcjohnson
Copy link
Author

AMNHcjohnson commented Jan 12, 2023 via email

@AMNHcjohnson
Copy link
Author

AMNHcjohnson commented Jan 12, 2023 via email

@mike-podolskiy90
Copy link
Contributor

@AMNHcjohnson I'm glad to help, but I don't know what dataset we're talking about. Could you send me the link please?
And, if possible, create an admin account in your IPT, that would help to diagnose what's going on.

@mike-podolskiy90
Copy link
Contributor

@ManonGros Could you assist with this please?

@AMNHcjohnson
Copy link
Author

AMNHcjohnson commented Jan 12, 2023 via email

@mike-podolskiy90
Copy link
Contributor

mpodolskiy@gbif.org

@AMNHcjohnson
Copy link
Author

AMNHcjohnson commented Jan 12, 2023 via email

@AMNHcjohnson
Copy link
Author

AMNHcjohnson commented Jan 13, 2023 via email

@AMNHcjohnson
Copy link
Author

AMNHcjohnson commented Jan 17, 2023 via email

@ManonGros
Copy link
Contributor

Hi @AMNHcjohnson I will take a look today

@ManonGros
Copy link
Contributor

@AMNHcjohnson it looks like we are unable to access the archives from your IPT.
This could be due to some firewall settings. It looks like it isn't just this dataset, for example the last time we were able to access the archive from this dataset (https://www.gbif.org/dataset/a8035a1d-e674-4d2a-bb59-b476af6a3d6d) was in July 2021.
You can find more information in our IPT manual here: https://ipt.gbif.org/manual/en/ipt/latest/installation#opening-the-ipt-to-the-internet

I will close this issue as I don't think this is a problem with the IPT software. Please follow up with us at helpdesk@gbif.org, thanks!

@ManonGros
Copy link
Contributor

@AMNHcjohnson One of my colleagues noticed that your IPT is behind Cloudflare, which is blocking machine access from our servers. You will need to configure Cloudflare to permit access to at least GBIF's servers, 130.225.43.0/25.

@AMNHcjohnson
Copy link
Author

AMNHcjohnson commented Jan 18, 2023 via email

@bvirgilioamnh
Copy link

Hey All! AMNH IT Here :)

I'll dig into the logs on our side of things, but my guess is that we're blocking it because it is automated/bot traffic. While we most certainly can add the range to our allow list it isn't the preferred solution as it does negate some security controls. We heavily leverage Cloudflare's Bot Management solution to help mitigate aggressive crawlers and data scrapers, unfortunately some legitimate solutions do run afoul of this. Coincidentally July 2021 is when we enabled this service within Cloudflare, so that adds up nicely.

Do the GBIF servers make requests to servers that include a specific user agent (e.g. GBIF Metadata Bot v1.0) instead of a generic user agent (e.g. Curl, Python Requests, etc)? If not, that'd be the first step. And then from there you can request that Cloudflare marks the bot as verified. We're happy to leverage our account and support team at Cloudflare to help assist with this if necessary.

https://developers.cloudflare.com/bots/reference/verified-bots-policy/

You can submit the bot verification on their Google Form:
https://forms.gle/pWVxfCj6cQgWGxDp9

Source documentation for the Google Form link (because why is Cloudflare using Google Forms for this? I'm not entirely sure...)
https://blog.cloudflare.com/friendly-bots/

-Ben

@MattBlissett
Copy link
Member

Hi Ben,

The IPT tool provides a managed data repository; the purpose is to allow programmatic access to the published data, with GBIF as the primary user.

I have completed the form, though I doubt we meet the scale Cloudflare requires. I think there are only 4 IPT installations behind Cloudflare, and yours is the only one with these tightened security settings. For https://ipt.amnh.org/ we would normally make 8 HTTP requests per week.

Our user agents include COLServer (COLServer/24a3ae9 2022-12-20), org.GBIF.utils/1.16 (Java/11.0.17; M-1800000-25-2; +https://www.gbif.org/), GBIF-Url-Validator and Thumbor/6.7.0. As far as I know, no-one is currently using user agents to allow/block access to an IPT, so we have not made any particular effort to align or maintain these. A few publishers do limit access to 130.225.43.0/24.

Other biodiversity systems or researchers also access IPTs using various tools or scripting languages. In the last week, I can see two researchers/groups have used Python and RStudio to query IPTs at https://cloud.gbif.org/. Blocking Python, Curl etc will block these users.

Matt

@bvirgilioamnh
Copy link

Ahh ok understood. Thanks for submitting it anyways, I'll pass this up to our account rep at Cloudflare just to let them know. Un/fortunately the way the bot management works is essentially based on "machine learning" (of course taken with a grain of salt 😄) and is built off the reputation of known user agents, we're not explicitly allowing/denying them. We're just given the ability to say block automated traffic, allow "good bots", captcha "likely automated" and ultimately try to balance accessibility with excessive scraping (and other more security related issues) across all of our sites.

We'll review implementing IP level controls to address this on our end.

Thanks Matt!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants