Category Archives: Internet Marketing

Googlebot & Site Crawl

Advice Googlebot (Google) To Crawl Your Site

 

Googlebot is a Google’s web crawling bot or spider. This collects data from the web pages to build a searchable index for the Google search engine. Crawling is simply a process by which Googlebot visits new and updated pages, It uses an algorithmic programs determine which sites to crawl, how often, and how many pages to fetch from each site?

 

As Googlebot visits website it detects links (src and href) on each page and adds them to its list of pages to crawl. New sites, changes to existing sites, and dead links are noted and used to update the Google index.

 

If a webmaster wishes to control the information on their site available to a Googlebot,they can do so with the appropriate directives in a robots.txt file, or by adding the meta tag

 

<meta name=”Googlebot” content=”nofollow” />

 

to the web page.

 

Once you’ve created your robots.txt file, there may be a small delay before Googlebot discovers your changes.

 

Googlebot discovers pages by visiting all of the links on every page it finds. It then follows these links to other web pages. New web pages must be linked to other known pages on the web in order to be crawled and indexed or manually submitted by the webmaster.

 

How to control or stop Search Engines to crawl your Website using Robots.txt

Control or Stop Search Engines to crawl your Website using Robots.txt

Website owner can instruct search engines on which pages to crawl and index, They can use a robots.txt file to do so.

A search engine robot want to visit a website URL, say http://www.domainname.com/index.html (as defined in directory index)before visit, it first check http://www.domainname.com/robots.txt, and looks to see if there are specific directives to follow. Let’s suppose it finds the following code in the robots.txt.

User-agent: *
Disallow: /

 

The “User-agent: *” means this is a directive for all robots. The * symbol means all.
The “Disallow: /” tells the robot that it should not visit any pages on the site.

 

Important considerations to use robots.txt file.

1) Robots that choose to follow the instructions try to search this file and read the instructions before visiting the website.If this file doesn’t exist web robots assume that the web owner wishes to provide no specific instructions.

2) A robots.txt file on a website will function as a request that specified robots ignore specified files or directories during crawl.

3) For websites with multiple sub domains, each sub domain must have its own robots.txt file. If domainname.com had a robots.txt file but sub.domainname.com did not, the rules that would apply for domainname.com would not apply to sub.domainname.com.

4) The robots.txt file is available to the public to view. Anyone can see what sections of your server you don’t want robots to use.

5) Robots can ignore your /robots.txt.

6) Your robots.txt file should be in the root for your domain. In our server’s configurations this would be the public_html folder in your account. If your domain is “domainname.com” then the bots will look for the file path http://domainname.com/robots.txt.If you have add-on domains and want to use a robots.txt file in those as well you will need to place a robots.txt file in the folder you specified as the root for the add-on domain.

 

Some examples:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

 

In the example, web-owner told ALL robots (remember the * means all) to not crawl four directories on the site (cgi-bin, images, tmp, private), if you do not specify files or folders to be excluded it is understood the bot then has permission to crawl those items.

 

To exclude ALL bots from crawling the whole server.
User-agent: *
Disallow: /

 

To allow ALL bots to crawl the whole server.
User-agent: *
Disallow:

 

To exclude A SINGLE bot from crawling the whole server.
User-agent: BadBot
Disallow: /

 

To allow A SINGLE bot to crawl the whole server.
User-agent: Google
Disallow:

User-agent: *
Disallow: /

 

To exclude ALL bots from crawling the ENTIRE server except for one file.
🙂 Tricky since there’s no ‘allow’ directive in the robots.txt file. What you have to do is simply place all the files you do not want to be crawled into one folder, and then leave the file to be crawled above it. So if we placed all the files we didn’t want crawled in the folder called SCT we’d write the robots.txt rule like this.

 

User-agent: *
Disallow: /SCT

 

Or you can do each individual page like this.
User-agent: *
Disallow: /SCT/home.html

 

To create a Crawl Delay for the whole server.
User-agent: *
Crawl-delay: 10

 

If you wish to block one page, you can add a <meta> robots tag.
<meta name=”robots” content=”” />

You can get more knowledge about robots.txt file from http://www.robotstxt.org/

 

Software Development Life Cycle

Software Development Life Cycle Phases

 

SDLC(Software Development Life Cycle) is a conceptual model or a detailed plan on how to create, develop, implement and launch the software, it describes the stages involved in an information system development project.

 

There are six steps or stages of SDLC.

 

1. System Requirements Analysis
2. Feasibility study
3. Systems Analysis & Design
4. Code Generation
5. Testing
6. Maintenance

 

System Requirements Analysis

This stage includes a detailed study of the business needs of the application. Design focuses on high level design like, what programs are needed and how are they going to use,

Low-level design (How the individual programs are going to work?)
Interface design(What are the interfaces going to look like?)
Data design(what data will be required?)

Analysis and Design are very crucial in the whole development cycle. Much care must be taken during this phase.

 

Feasibility Study

The feasibility study is used to determine if the project should get the go-ahead and well-defined the scope of project will prepared. If the project is to proceed, the feasibility study will produce a project plan and budget estimates for the future stages of development.

 

Systems Analysis and Design

Here analysis is made on the design of the system that is going to be developed. In other words database design, the design of the architecture(frame works) chosen, functional specification design, low level design documents, high level design documents (SRS) and so on takes place. Care must be taken to prepare these design documents because the next phases namely the development phase is based on these documents & design.

 

Code Generation or Implementation

In this phase the designs are translated into code. Computer programs are written using a conventional programming language. Different high level programming languages like C, C++, PHP, Java are used for coding. With respect to the type of application, the right programming language is chosen to reduce the time and cost.

 

Testing

In this phase the application is tested. Normally programs are written as a series of individual modules, these subject to separate and detailed test or module test. The system is then tested as a whole. The separate modules are brought together and tested as a complete system. A software or system which is not tested would be treated as poor quality.

 

Maintenance & Enhancement

Application will need maintenance. Software will definitely undergo change once it is delivered to the end-customer. There may many reasons for the change, change could happen because of some unexpected input values into the system or addition of some functionality into the system. The software should be developed to accommodate changes.