Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Absolute links and the actual urls in some cases is being rendered wrongly. #24

Open
mkantautas opened this issue Jun 22, 2017 · 8 comments

Comments

@mkantautas
Copy link

For e.g. page http://toastytech.com/evil/ with $linkDepth = 2; gives a lot of incorrect urls. You may say that this webpage is very old and no one writes relative urls like "../yourUrlPath", but I think this still should be fixed :)

"/evil/../links/index.html" => array:14 [▼
    "original_urls" => array:1 [ …1]
    "links_text" => array:1 [ …1]
    "absolute_url" => "http://toastytech.com/evil/../links/index.html"
    "external_link" => false
    "visited" => true
    "frequency" => 1
    "source_link" => "http://toastytech.com/evil/"
    "depth" => 1
    "status_code" => 200
    "title" => "Nathan's Links"
    "meta_keywords" => ""
    "meta_description" => ""
    "h1_count" => 1
    "h1_contents" => array:1 [ …1]
@zrashwani
Copy link
Owner

zrashwani commented Jun 27, 2017

Hello @neorganic
thank you for pointing about this issue.

This is now done in this commit, the output for the link you mentioned is now as following:

    [/index.html] => Array
        (
            [original_urls] => Array
                (
                    [../index.html] => ../index.html
                )

            [links_text] => Array
                (
                    [] => 
                    [Back to my Toasty Technology Page] => Back to my Toasty Technology Page
                )

            [absolute_url] => http://toastytech.com/index.html
            [external_link] => 
            [visited] => 1
            [frequency] => 3
            [source_link] => http://toastytech.com/evil/
            [depth] => 1
            [status_code] => 200
            [title] => Nathan's Toasty Technology Page
            [meta_keywords] => 
            [meta_description] => 
            [h1_count] => 0
            [h1_contents] => Array
                (
                )

        )

let me know if there is any other case

@mkantautas
Copy link
Author

 "http://meyrovich.com/2016/10/06/vegan-pesto/" => array:8 [▶]
  "http://meyrovich.com/2016/10/06/vegan-pesto/?share=facebook" => array:8 [▶]
  "http://meyrovich.com/2016/10/06/vegan-pesto/?share=twitter" => array:8 [▶]
  "http://meyrovich.com/2016/10/06/vegan-pesto/?share=tumblr" => array:8 [▶]
  "http://meyrovich.com/2016/10/06/vegan-pesto/?share=pinterest" => array:8 [▶]
  "http://meyrovich.com/2016/10/06/vegan-pesto/?share=linkedin" => array:8 [▶]
  "http://meyrovich.com/2016/10/06/vegan-pesto/?share=reddit" => array:8 [▼
    "original_urls" => array:1 [ …1]
    "links_text" => array:1 [ …1]
    "absolute_url" => "http://meyrovich.com/2016/10/06/vegan-pesto/?share=reddit"
    "external_link" => false
    "visited" => false
    "frequency" => 1
    "source_link" => "http://meyrovich.com/category/food/"
    "depth" => 2
  ]
  "http://meyrovich.com/2016/10/06/vegan-pesto/?share=google-plus-1" => array:8 [▶]
  "http://meyrovich.com/author/admin/page/2/" => array:8 [▶]
  "https://www.facebook.com/policies/cookies/" => array:8 [▶]
  "https://www.facebook.com/recover/initiate?lwv=100" => array:8 [▶]
  "/reg/" => array:8 [▶]
  "/home" => array:8 [▶]
  "https://twitter.com/signup?context=webintent&follow=meyrovich_" => array:8 [▶]
  "/account/begin_password_reset" => array:8 [▶]
  "//support.twitter.com/groups/31-twitter-basics/topics/104-welcome-to-twitter-support/articles/215585-twitter-101-how-should-i-get-started-using-twitter" => array:8 [▶]
  "https://www.tumblr.com/login?redirect_to=https%3A%2F%2Fwww.tumblr.com%2Fwidgets%2Fshare%2Ftool%3FshareSource%3Dlegacy%26canonicalUrl%3D%26url%3Dhttp%253A%252F%252Fmeyrovich.com%252F2016%252F12%252F24%252Ffarmdrop-christmas-feast%252F%26title%3DFarmdrop%2BChristmas%2BFeast%26_format%3Dhtml%26sequence%3Dpreview" => array:8 [▶]
  "https://www.pinterest.com/_/_/about/cookie-policy/" => array:8 [▶]
  "/_/_/about/terms-service/" => array:8 [▶]
  "/_/_/about/privacy/plain.html" => array:8 [▶]
  "/_/_/about/" => array:8 [▼
    "original_urls" => array:1 [ …1]
    "links_text" => array:1 [ …1]
    "absolute_url" => "http://meyrovich.com/_/_/about/"
    "external_link" => false
    "visited" => false
    "frequency" => 1
    "source_link" => "http://meyrovich.com/2016/12/24/farmdrop-christmas-feast/?share=pinterest"
    "depth" => 2
  ]
  "/_/_/blog/" => array:8 [▶]
  "/_/_/business/" => array:8 [▼
    "original_urls" => array:1 [ …1]
    "links_text" => array:1 [ …1]
    "absolute_url" => "http://meyrovich.com/_/_/business/"
    "external_link" => false
    "visited" => false
    "frequency" => 1
    "source_link" => "http://meyrovich.com/2016/12/24/farmdrop-christmas-feast/?share=pinterest"
    "depth" => 2
  ]
  "/_/_/about/privacy/" => array:8 [▼
    "original_urls" => array:1 [ …1]
    "links_text" => array:1 [ …1]
    "absolute_url" => "http://meyrovich.com/_/_/about/privacy/"
    "external_link" => false
    "visited" => false
    "frequency" => 1
    "source_link" => "http://meyrovich.com/2016/12/24/farmdrop-christmas-feast/?share=pinterest"
    "depth" => 2
  ]

Hey, yeah so the base url is http://meyrovich.com and mainly I see 2 problems here the links with ///... and that it goes after the ?share links I think these and probably other urls that have ? should be excluded from the list?

@zrashwani
Copy link
Owner

Hello @neorganic ,
A major re-writing was done on the library now, the dots issue should be fixed now, as currently PSR 7 - specifically \GuzzleHttp\Psr7\UriResolver::removeDotSegments method - is used to normalize and remove dots from URLs.

Also in order to exclude links with specific patterns from list you can use filterLinks method as mentioned in readme file; I think that way is better as some sites depend on query strings to serve different documents;

Also @neorganic can you check the new version of the library and confirm if issues are fixed in it?

@LunarDevelopment
Copy link

LunarDevelopment commented Dec 3, 2018

Love the library, bravo!

Some interesting results from the latest version of this library, results then code below (I recognise the site being scanned isn't a perfectly marked up site and may be out of scope for the project):

<li itemprop=name><a href=/operational-support-agreement.pdf title="Siteshield" itemprop=url>Siteshield</a></li>

Outputs:
Broken: /operational-support-agreement.pdf from https://www.espressoweb.co.uk/

Other instances where it seems the full url has been interpreted okay, but the protocol has not been assumed correctly:

<h2 class=h3><a class="color-grey" href="social-marketing" title="Social Media Marketing from Espresso Web">Engaging Social</a></h2>

Outputs:
Broken: http://www.espressoweb.co.uk/social-marketing from http://www.espressoweb.co.uk
AND
Broken: http://www.espressoweb.co.uk/social-marketing from http://www.espressoweb.co.uk/social-marketing

<?php
require '../vendor/autoload.php';

$time_start = microtime(true);

$url = 'https://www.espressoweb.co.uk/';
$linkDepth = 10;
// Initiate crawl
$crawler = new \Arachnid\Crawler($url, $linkDepth);
$crawler->traverse();

// Get link data
$links = $crawler->getLinks();

$collection = new  \Arachnid\LinksCollection($links);

//getting broken links
$brokenLinks = $collection->getBrokenLinks();

foreach($brokenLinks as $i =>  $link) {
    echo "++++++++++++++++++++++++++++++". PHP_EOL ;
    echo "Broken: " . $i . " from " . $link->getParentUrl() . PHP_EOL ;
//    echo json_encode($links) . PHP_EOL ;
}

$time_end = microtime(true);
$execution_time = ($time_end - $time_start) / 60;
$execution_time = number_format((float)$execution_time, 2);

echo "++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++" . PHP_EOL;
echo "Complete!" . PHP_EOL;
echo 'Total Execution Time: ' . $execution_time . ' Mins' . PHP_EOL;
echo "++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++" . PHP_EOL;

die;

@zrashwani
Copy link
Owner

Hello @LunarDevelopment
Thank you for reporting this issue; your sample script has made me pay attention to other issues as well
there was problem in LinksCollection method getBrokenLinks, it was checking status codes 2xx; however it classified links as broken if the status code is redirect 3xx which is wrong

I have pushed a commit that should handle some cases, and will test it in the next few days to make sure all such cases are fixed

@LunarDevelopment
Copy link

Happy to help, looking forwards to your next release!

@zrashwani
Copy link
Owner

@LunarDevelopment the issues you were facing should be fixed in the recent release 2.0.1, please update it to latest release and try there

@LunarDevelopment
Copy link

LunarDevelopment commented Dec 11, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants