Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Google Analytics from Maven Website #660

Open
niallkp opened this issue Feb 2, 2025 · 10 comments
Open

Remove Google Analytics from Maven Website #660

niallkp opened this issue Feb 2, 2025 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@niallkp
Copy link

niallkp commented Feb 2, 2025

Bug description

Hi Maven Team!

Its great that you've migrated from Google Analytics (GA) to Matomo - thanks for doing that. However there seems to be a large number of pages (52,000+)in old docs that still have GA. The table below shows which components/versions are affected & the number of files containing GA.

Location File Count
doxia components 6,271
maven components 46,247
TOTAL 52,518

I'm not sure what the best way is to resolve this and I guess there might be different solutions for different components. I'm prepared to help with the work, but perhaps someone could do a triage what the best course of action is for each component?

  • Perhaps some of the old versions of the docs for components could be removed?
  • Is there an straightforward way to re-generate some of these component versions?
  • if these are never updated in svn, then perhaps we could patch some versions to remove GA

Regards

Niall

Location Component File Count
doxia doxia-archives/ 5181 => done
doxia doxia-sitetools-archives/ 1015 => done
doxia doxia-tools/doxia-integration-tools 25 => done
doxia doxia-tools-archives/ 50 => done
maven ant-tasks-archives/ 85 => done
maven archetype-archives/ 1639 => done
maven archetypes-archives/ 436 => done
maven core-its 1719 => done
maven enforcer-archives/ 264 => done
maven jxr-archives/ 393 => done
maven maven-indexer/indexer-examples 20 => done
maven maven-indexer-archives/ 511 => done
maven maven-release-archives/ 219 => done
maven plugins/ 127 => done
maven plugins-archives/ 11781 => done
maven plugin-testing-archives/ 386 => done
maven plugin-tools-archives/ 2259 => done
maven pom-archives/ 1239 => done
maven ref/ 1019 => done
maven resolver-archives/ 2237 => done
maven sandbox/plugins 27 => done
maven scm-archives/ 3156 => done
maven shared-archives/ 2518 => done
maven skins-archives/ 2981 => done
maven studies/extension-demo 21 => done
maven surefire-archives/ 6127 => done
maven wagon/ 352 => done
maven wagon-archives/ 6758 => done

Maven site URL where bug exists

x

@niallkp niallkp added the bug Something isn't working label Feb 2, 2025
@hboutemy
Copy link
Member

hboutemy commented Feb 2, 2025

Hi @niallkp
good list, thanks
regenerating is not an option: patching is (I did it in the past, just need to remember the search/replace magic formula)
and while at it, we should probably do some cleanup, like keeping only latest patch version for each minor => I'll get consensus on that approach quickly I suppose

@niallkp
Copy link
Author

niallkp commented Feb 2, 2025

Hi @hboutemy , thanks for the response!

I've added a column to the table above indicating a later minor version (theres 27 of them) - so if you could get agreement to delete those, that would be a start.

There are also six maven plugins where the LATEST version has Google Analytics and I was wondering whether they need patches to ensure the next release doesn't contain GA or whether thats already been fixed?

  • maven-antrun-plugin-LATEST
  • maven-docck-plugin-LATEST
  • maven-ejb-plugin-LATEST
  • maven-pdf-plugin-LATEST
  • maven-scripting-plugin-LATEST
  • maven-wrapper-plugin-LATEST

I've become quite adept at doing search/replace to remove GA, so happy to help generate patches if you want - let me know.

@hboutemy
Copy link
Member

hboutemy commented Feb 4, 2025

@niallkp most of the cleanup done
the only probably missing part is when there are some subdirectories

can you check and report, please, what issues remain? It should be much much less now

@niallkp
Copy link
Author

niallkp commented Feb 6, 2025

Sorry @hboutemy I messed up. When I checked the maven website out of SVN, I didn't notice it had timed out - so I had only a part of the website locally. So instead of there being 6,000+ files with GA, there are actually 52,000+ files.

Thanks for doing the cleanup of minor versions, it has removed some GA. I have updated the list in the description to reflect the current situation.

There are a few maven components with later minor versions in the current list which could also be removed, but it still leaves a daunting list of docs to fix

  • ref/3.8.6
  • ref/3.8.7
  • ref/4.0.0-alpha-2
  • ref/4.0.0-alpha-3
  • resolver-archives/resolver-ant-tasks-1.2.0
  • resolver-archives/resolver-ant-tasks-1.3.0

@kwin
Copy link
Member

kwin commented Feb 7, 2025

Isn't it enough to rely on the ASF CSP to disable Google Analytics for the archived sites?

@slawekjaranowski
Copy link
Member

@hboutemy can you share somewhere your commands or script used to cleanups?

@hboutemy
Copy link
Member

VS Code search replace...

@hboutemy
Copy link
Member

hboutemy commented Feb 18, 2025

for sparse checkout, ignoring big content like javadoc and jxr, I did a small svn-sparse-co.sh script:

#!/bin/bash

function coDir() {
    local url=$1
    echo "$url"
    svn co --depth immediates $url

    cd `basename $url`

    for d in `find * -depth -maxdepth 0 -type d`
    do
      case $d in
        xref | xref-test | apidocs | testapidocs | css | img | images | fonts | js | cobertura)
        echo "ignore $d in `pwd`"
         ;;
        *)
          coDir $url/$d
          ;;
      esac
    done
    cd ..
}

coDir $1

find `basename $1` -type f -name '*.html' -exec grep google-analytics {} \; | wc -l

then in the directory, I search replace with VS Code and svn commit the result

like what I just did:

./svn-sparse-co.sh https://svn.apache.org/repos/asf/maven/website/components/wagon
(VS Code: pick one .html, search for google-analytics, select one GA snippet, replace with snippet taken from view-source:https://maven.apache.org/ )
svn ci -m "replace Google Analytics with Matomo" wagon

human judgement is useful for searching for google-analytics and iteratively extracting all misc corresponding snippets

@hboutemy
Copy link
Member

@niallkp I think it is all done: can you check please and confirm?

@niallkp
Copy link
Author

niallkp commented Feb 19, 2025

Wow @hboutemy, thats brilliant - thanks for doing that!

I re-scanned the Doxia website which was all clear. For the Maven website I only found 16 pages still with Google Analytics - in the ref/4.0.0-alpha-3/maven-bom component. I'm attaching a patch to fix that,

relative to the following path:

The patch also changes the docs in the maven-site-plugin to use a different js example other than Google Analytics - but that isn't an issue, except it will prevent a false positive on the report to the Privacy Committee - so feel free to not apply that part.

maven-components-ga.patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants