-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Absolute links and the actual urls in some cases is being rendered wrongly. #24
Comments
Hello @neorganic This is now done in this commit, the output for the link you mentioned is now as following:
let me know if there is any other case |
Hey, yeah so the base url is http://meyrovich.com and mainly I see 2 problems here the links with ///... and that it goes after the ?share links I think these and probably other urls that have ? should be excluded from the list? |
Hello @neorganic , Also in order to exclude links with specific patterns from list you can use Also @neorganic can you check the new version of the library and confirm if issues are fixed in it? |
Love the library, bravo! Some interesting results from the latest version of this library, results then code below (I recognise the site being scanned isn't a perfectly marked up site and may be out of scope for the project): <li itemprop=name><a href=/operational-support-agreement.pdf title="Siteshield" itemprop=url>Siteshield</a></li> Outputs: Other instances where it seems the full url has been interpreted okay, but the protocol has not been assumed correctly: <h2 class=h3><a class="color-grey" href="social-marketing" title="Social Media Marketing from Espresso Web">Engaging Social</a></h2> Outputs: <?php
require '../vendor/autoload.php';
$time_start = microtime(true);
$url = 'https://www.espressoweb.co.uk/';
$linkDepth = 10;
// Initiate crawl
$crawler = new \Arachnid\Crawler($url, $linkDepth);
$crawler->traverse();
// Get link data
$links = $crawler->getLinks();
$collection = new \Arachnid\LinksCollection($links);
//getting broken links
$brokenLinks = $collection->getBrokenLinks();
foreach($brokenLinks as $i => $link) {
echo "++++++++++++++++++++++++++++++". PHP_EOL ;
echo "Broken: " . $i . " from " . $link->getParentUrl() . PHP_EOL ;
// echo json_encode($links) . PHP_EOL ;
}
$time_end = microtime(true);
$execution_time = ($time_end - $time_start) / 60;
$execution_time = number_format((float)$execution_time, 2);
echo "++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++" . PHP_EOL;
echo "Complete!" . PHP_EOL;
echo 'Total Execution Time: ' . $execution_time . ' Mins' . PHP_EOL;
echo "++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++" . PHP_EOL;
die; |
Hello @LunarDevelopment I have pushed a commit that should handle some cases, and will test it in the next few days to make sure all such cases are fixed |
Happy to help, looking forwards to your next release! |
@LunarDevelopment the issues you were facing should be fixed in the recent release 2.0.1, please update it to latest release and try there |
Thankyou, I'll give it a go.
…On Tue, 11 Dec 2018 at 13:13 Zeid Rashwani ***@***.***> wrote:
@LunarDevelopment <https://github.com/LunarDevelopment> the issues you
were facing should be fixed in the recent release 2.0.1, please update it
to latest release and try there
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKNqM1NAllDy8UoGGTHPp4T4cBIJV7wHks5u3695gaJpZM4OCW1f>
.
|
For e.g. page http://toastytech.com/evil/ with $linkDepth = 2; gives a lot of incorrect urls. You may say that this webpage is very old and no one writes relative urls like "../yourUrlPath", but I think this still should be fixed :)
The text was updated successfully, but these errors were encountered: