Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a system to manage specific rules for some URL (spring configurable) #182

Merged
merged 1 commit into from
Oct 5, 2015
Merged

Add a system to manage specific rules for some URL (spring configurable) #182

merged 1 commit into from
Oct 5, 2015

Conversation

scheylord
Copy link

Hello everyone,

As we are discuting about the future of the OpenWayback project, the BnF would like to present you a small feature that could maybe interest some of you and could be add to the openwayback project.

This feature has been developed by Nicolas Giraud, the previous software engineer of the BnF web archive team. This feature provides a system to manage specific canonicalization rules on URL. For instance, we have a news website which have session id in URLs and we want to remove it during the search and replay process.

The URLs looks like :
http://www.ouestfrance-enligne.com/scripts/consult/pdf/PDF_frame.asp?in_ses_id=2047240875806&pdf=S2192220&date=09/09/2014&art_id=68159490&zoom=125,11,78

We define the rules in Spring :

<bean class="org.archive.wayback.util.url.BnfUrlCanonicalizer" id="bnfCanonicalizer">
    <property name="processingRules">
        <list>
            <!-- Ouest France -->
            <bean class="org.archive.wayback.util.url.CanonicalizationRule">
                <property name="pattern" value="ouestfrance-enligne\.com" />
                <property name="processors">
                    <list>
                        <bean class="org.archive.wayback.util.url.UriStripper">
                            <property name="pattern" value="(in_ses_id=[0-9]+)&amp;?" />
                        </bean>
                        <bean class="org.archive.wayback.util.url.UriTranscoder">
                            <property name="pattern" value="&amp;edi=(.+)&amp;" />
                            <property name="sourceEncoding" value="ISO-8859-1" />
                            <property name="targetEncoding" value="UTF-8" />
                        </bean>
                    </list>
                </property>
            </bean>
        </list>
    </property>
</bean>
  • BnfUrlCanonicalizer simply extends AggressiveUrlCanonicalizer
  • we have a processingRules property which is a list of canonicalization rules
  • A CanonicalizationRule is a pattern, in our case it's a specific domain (ouestfrance-enligne.com), and a list of processors that will only process when some input URL matchs this pattern.
  • the processors could be from several types, but all extends PatternBasedTextProcessor abstract class :
    • UriStripper remove the specific pattern in an input URL
    • UriTranscoder transcode the matching part of an URL from a encoding to an other

In this example, we want to remove the in_ses_id attribut, so we use a UriStripper processor with the corresponding pattern.
Then we also want to transcode an another attribute that cause trouble, from ISO-8859-1 to UTF-8 so we use the UriTranscoder with the right pattern.

In BnfUrlCanonicalizer, we override urlStringToKey method from AggressiveUrlCanonicalizer, adding :

for (CanonicalizationRule rule : getProcessingRules()) {
    searchUrl = rule.processIfMatches(new CanonicalizationInput(searchUrl));
}

which will transform the URL in case of match.

We could give to the OpenWayback project all the classes needed to run this functionality.

@PsypherPunk
Copy link
Contributor

Could you update the release notes as per the guidelines as part of this pull request?

I do think there's a bigger conversation to be had around canonicalization, specifically that we should probably be using exactly the same code as is used in Heritrix.

That, however, can wait as it would mean getting said code into webarchive-commons which I fear is no small amount of work.

johnerikhalse added a commit that referenced this pull request Oct 5, 2015
Add a system to manage specific rules for some URL (spring configurable)
@johnerikhalse johnerikhalse merged commit 9c937ba into iipc:master Oct 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants