www.xbdev.net
xbdev - software development
Thursday January 2, 2025
Home | Contact | Support | JavaScript... so much power in such a few lines of code..
     
 

JavaScript...

so much power in such a few lines of code..

 


Client-Based Javascript Webcrawler (Runs Web-Browser)


Javascript has come a long way in the past few years - including the ability to load and parse other HTML webpages. This short tutorial will take you though how to develop a webcrawler that can scrap a website for all the links.

This can be extremely useful for:
• generating the sitemap.xml for searchengines
• checking for broken links
• plotting and identifying which webpages link to which other webpages



websites can have thousands of pages - imagine having to check each page and any subpage links
Websites can have thousands of pages - each page can have hundreds of links. How would you build a site list of all the pages and which pages are connected (linked) to which other pages? Also what if you wanted to crawl all your webpages and check them (check titles, formatting, load times, broken links..). Webcrawlers are little scripts to help you with this - and you can do this with Javascript!




This JS webcrawler will only work for webpages that are generated on the server-side (not client-side JS generated ones).

For the examples, we'll limit the webcrawler to a single website (i.e., not going to scrap the entire internet and external sites) - just the website with the base url (e.g., all the pages on a website called - like 'www.cats.com').


Basics - Getting a HTML Website (Loading File)


You can grab a website contents quickly and easily using the
fetch
function. For example:

let p await fetch('www.news.com');
let t await p.text();

console.log('website html:'t);


Be careful of caching when using 'fetch' - you can set the 'nocache' option as an arguemnt if you're repeatedly loading HTML pages - otherwise it might report back content which is out of date.

Hard Way


The first way of doing a webcrawler is to grab the HTML contents and perform and string search using regular expressions.

homeurl  'https://cats.com';
iurls    = [];
ourls    = [];
maxpages 100// safety catch so we can't ever get stuck in an infinit loop

while (iurls.length>&& maxpages>0)
{
    
maxpages--;
    
    const 
iurl iurls.pop();
    const 
response await fetchiurl , {cache"no-store"} );
    
    
response.ok;     // => false
    
response.status// => 404
    
    
if ( response.ok == false || response.status == 404 )
    {
        
console.log('404 error: ' iurl);
        continue;
    }
    
    
let text await response.text();
    
    
// find all the links
    
const linkRx = /<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1/g;
    linksx = text.matchAll( linkRx );
    linksx = [...linksx]
    
    // this is where it gets messy
    // url can have any of these formats:
    // https:/..
    // http:/..
    // site.com
    // index.html
    // ../index.html
    // ../../../index.html
    // ./test.php
    // mailto:me@happychicken.com
    
    // for testing, I'll just do a vanilla loop
    for (let kk=0; kk<linksx.length; kk++)
    {
        let link = linksx[kk];
        
        // do checks and fix the link
        
        if ( link.startsWith ('/')  ){ link = homeurl + '/' + link;        }
        if ( link.startsWith ('./') ){ link = homeurl + link.substring(1); }
        
        // just website without any base (e.g., cat/cats.html - need to append the base url)
        if ( !link.startsWith( 'https:' ) && 
             !link.startsWith( 'http:'  ) && 
             !link.startsWith( 'www.' ))
        {
              link = homeurl + '/' + link;   
        }
        
        // if it isn't from this website - we skip reading its contents
        if ( !hr.includes( 'https://xbdev.net/' ) )
        {
            ourls.push( link );
            continue; 
        }
        
        // add the webpage to the input list
        iurls.push( link );
        
    }// end for kk
}// end for iurls


Smarter Way (Built-in Webpage Parser)


When you get the page contents, instead of doing the processing of strings yourself - you can take advantage of the built-in XML/HTML parser - this will automatically fix any strings and let you do searches as you would a standard webpage (e.g.,
getElementsByTagName(..)
).

Use the
DOMParser
to process the page and get the correct 'url' links for each link. The secret to making this work is setting the 'base' element for the page (so any local links are correctly associated with the correct relative page).

let baseEl htmlDoc.createElement('base');
baseEl.setAttribute('href'urlbase);


There can be a CORS security issue when trying to crawl websites that the script does not hosted on - of course, one quick and easy solution is to use userscripts to run the JS on that website.

The following gives the improved working version which scrapes a specific domain for all the links.

homeurl  'https://cats.com';
iurls    = [];
ourls    = [];
maxiters 100// safety catch so we can't ever get stuck in an infinit loop

while (iurls.length>&& maxiters>0)
{
    
maxiters--;
    
    
let iurl iurls.pop();
    
iurl     iurl.trim();
    if ( 
ourls.includesiurl ) ) { console.log('repeat:'iurl );  continue; }

    if ( !
iurl.includes'https://' homeurl ) ) { continue; } // only recurse this address

    
log('iteration:'maxiters'url:'iurl'num out:'ourls.length'remaining:'iurls.length );

    
// get contents and find any child links        
    
const response await fetchiurl , {cache"no-store"} );

    
response.ok;     // => false
    
response.status// => 404
    
    
if ( response.ok == false || response.status == 404 )
    {
        
log('404 error: ' iurl ',' iurlparent );
        continue;
    }

    
ourls.pushiurl ); 
        
    
let text await response.text();

    
iurl iurl.replace('www.'+homeurlhomeurl); // e.g., www.cat.com to cat.com
    
    
const url iurl;
    const 
domain = (new URL(url));
    const 
urlhost domain.protocol '//' domain.hostname;
    const 
urlbody domain.pathname.substring(0domain.pathname.lastIndexOf('/')+1);
    const 
urlbase urlhost+urlbody;  
    
    var 
parser = new DOMParser();
    var 
htmlDoc  parser.parseFromString(text'text/html');
    
let baseEl htmlDoc.createElement('base');
    
baseEl.setAttribute('href'urlbase);
    
htmlDoc.head.append(baseEl);
    var 
linksx htmlDoc.getElementsByTagName('a');
    
linksx = [...linksx];
    
    
console.log('url:'url);
    
console.log('  friendly url:'urlhost domain.pathname )
    
console.log('  url base:',     urlbase )
    
console.log('    num links:'linksx.length)
    
    for (
let bb=0bb<linksx.lengthbb++)
    {
        
let link linksx[bb].href;

        if ( !
link.includes'https://'+homeurl ) )    {  continue;  }

        
        if ( 
ourls.includeslink ) ) 
        {  
            
console.log('    link 'bb' :'link' (already in list)' );
            continue; 
        }
        if ( 
iurls.includes link ) )
        {
            
console.log('    link 'bb' :'link' (in wait list)' );
            continue;
        }
        
console.log('    new link 'bb' :'link );
    
        
link link.trim();
        
        
link link.replaceAll('http://''https://');

        
iurls.pushlink );
    }
// end for bb

}// end iteration loop

console.log('');
console.log('number unique urls found:'ourls.length );
console.log('list of log unique urls:');
console.logourls.join("\n") );


Building a SiteMap.xml


Once you've processed the website, you've got all the URL and webapges - so you can generate a sitemap.xml which search engines usually want.

let xmlsitemap '<' '?' 'xml version="1.0" encoding="UTF-8"' '?' '>';

xmlsitemap += `
<!-- generator="SimpleSitemapGenerator/1.2.0" -->
<!-- sitemap-generator-url="https://www.xbdev.net"
                          sitemap-generator-version="1.2.0" -->
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9                                 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
`;
    
ourls.forEach( (uu)=>{
    
uu uu.replaceAll('&''&amp;');
    
xmlsitemap += `<url><loc>${uu}</loc>
                        <lastmod>2022-10-21</lastmod>
                   </url>\n
`;
});
xmlsitemap += `</urlset>`;
xmlsitemap += "\n";
    
xmlsitemap xmlsitemap.replaceAll('http://''https://' );
    
// add a download button so you can download the sitemap after finished
const blob = new Blob([ xmlsitemap ], {type'text/xml'});
const 
elem window.document.createElement('a');
document.body.appendChildelem );
elem.innerHTML 'Download Sitemap.xml';
elem.href window.URL.createObjectURL(blob);
elem.download 'sitemap.xml';



Things to Try

• Develop a visualization resource showing which pages are linked to which other pages
• Create a table which lists all the pages and how many links are on each page
• Extract other information for each page (e.g., h1 tag, title, meta data, number of words, )
• Use a webworker to run the process in the background - so it doesn't lock up when the page looses focus
• Asynchronous non-blocking version (crawl multiple pages at the same time)


































 
Advert (Support Website)

 
 Visitor:
Copyright (c) 2002-2024 xbdev.net - All rights reserved.
Designated articles, tutorials and software are the property of their respective owners.