Preparing Your ASP.NET Site for Crawling
By Russ Nemhauser
Published: 4/2/2003
Reader Level: Beginner
Rated: 3.50 by 2 member(s).
Tell a Friend
Rate this Article
Printable Version
Discuss in the Forums

Preparing Your ASP.NET Site for Crawling

In this article we'll explore a handful of things you can do to your web site to make it crawler-friendly. With a few simple steps your site can take full advantage of the Microsoft Search service that comes with Sharepoint Portal Server or Site Server 3.0. You can also be prepared for the large search engines.

When your enterprise uses a web crawler to help provide search capability, you might consider creating a special page that only your crawler will access. This single page can provide a list of hyperlinks to be followed by the crawler so only the content you specify is indexed and returned when someone does a search. The crawler can be configured to only follow hyperlinks one level deep.

To provide an example, let's assume we all work on a fictitious application for a company called "Global Rentals", or "Global" for short. Global leases and manages commercial real estate, so their corporate intranet is their gateway to tenant and building information.

Global owns 1000 buildings with 7500 total tenants. It would stand to reason that their Intranet has a page called building.aspx and a page called tenant.aspx which displays all available information about buildings and tenants, respectively.

The building managers at Global often search the Intranet for information, so lets return to the concept of a crawl start page. This page, which we'll call crawl.aspx will be plain and simple. What we will do is provide a link to every building and tenant that Global has in their database. To do this we'll call two stored procedures in Global's property management database. These sprocs will return the information necessary to provide a link to our two pages. We'll create a SqlDataReader to loop through the records returned and call Response.Write to output our links to the page.

C#

private void Page_Load(object sender, System.EventArgs e)
{
    SqlConnection conn = new SqlConnection(ConfigurationSettings.AppSettings["ConnectionString"]);
    SqlCommand cmd = new SqlCommand("s_BuildingList", conn);
    cmd.CommandType = CommandType.StoredProcedure;
    conn.Open();
    SqlDataReader sdr = cmd.ExecuteReader(CommandBehavior.CloseConnection);
    while (sdr.Read())
    {
        Response.Write("<a href=\"building.aspx?buildingid=" + sdr["BuildingID"] + "\">"
            + sdr["BuildingName"] + "</a><br>");
    }
    sdr.Close();
}

Private Sub Page_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
    Dim conn As SqlConnection = New SqlConnection(ConfigurationSettings.AppSettings("ConnectionString"))
    Dim cmd As SqlCommand = New SqlCommand("s_BuildingList", conn)
    cmd.CommandType = CommandType.StoredProcedure
    conn.Open()
    Dim sdr As SqlDataReader = cmd.ExecuteReader(CommandBehavior.CloseConnection)
    While sdr.Read
        Response.Write("<a href=""building.aspx?buildingid=" & sdr("BuildingID") & """>" _
            & sdr("BuildingName") & "<a><br>")
    End While
    sdr.Close()
End Sub

We would obviously create a SqlCommand and SqlDataReader to output the links to all of our tenants as well. But there is one more thing that we'll do in the Page_Load for crawl.aspx - the code Session("crawling") = "true" or Session["crawling"] = "true", depending on your language of choice. This code gives us flexibility on building.aspx and tenant.aspx in that we can choose whether or not to render certain parts of the page. This is useful if we want to exclude some of the content on those pages if the crawler is the one that is loading it. For example, we might have an exhaustive JavaScript menu system that navigates through our whole Intranet. All this HTML would not need to be indexed and included in our search catalog, so we could use this session variable as a flag telling us whether or not to render it.

Now that we have our start page, we'll need to take care of a couple of things on the pages that we're having crawled. First, each page will need to have a descriptive title. "Tenant Detail" or "Building Detail" don't really offer us much. With that in mind, we'll alter the <title> tag on building.aspx.

HTML

<title id="title" runat="server">Building Detail</title>

This turns our HTML <title> element into a server-side control that we can program against in our code-behind.

C#

protected System.Web.UI.HtmlControls.HtmlGenericControl title;

VB

Protected WithEvents title As System.Web.UI.HtmlControls.HtmlGenericControl

On our building.aspx page we might be getting information from a business object or a database, so we'll set our page's title to the name of the building. For this example, we'll assume the building information was retrieved through the use of output parameters being passed to a stored procedure.

C#

title.InnerText = "Building Profile - " + prmBuildingName.Value;

VB

title.InnerText = "Building Profile - " & prmBuildingName.Value

Now our new, descriptive titles will show up when our users do a search. The title is usually what is provided as a hyperlink to the page in most search results. But the title is only first of at least two important things that we need to take care of on pages that will be crawled. The second is a description <meta> tag.

If we do not provide a description for our page in the form of a <meta> tag most crawlers will just display the first several words of content that they find on the page. This could include field labels in an HTML table, heading text, or any other content that the crawler can read. What we'll do is set the contents of the description <meta> tag at run time based on the content retrieved from the database. This will really let us provide useful information about the building or tenant directly beneath the page's title in the search results. Hopefully our users will be able to use this information to better choose what result to navigate to. For example, if Global has forty Jiffy Lube tenants in forty different centers, showing the address information in the tenant.aspx description meta tag might prove useful.

HTML

<meta name="description" content="description" id="metaDescription" runat="server" />

We'll reference this <meta> tag in our code behind just like we did the <title> tag.

C#

protected System.Web.UI.HtmlControls.HtmlGenericControl metaDescription;

VB

Protected WithEvents metaDescription As System.Web.UI.HtmlControls.HtmlGenericControl

In our Page_Load event (assuming we used stored procedure output parameters to retrieve a single tenant's information) we'll build a string for our dynamic page description:

C#

string desc;
desc = prmBuildingName.Value + " - " + prmAddress.Value + " " + prmCity.Value + ", "
    + prmState.Value + " " + prmZip.Value;
metaDescription.Attributes("content") = desc;

VB

Dim desc As String
desc = prmBuildingName.Value & " - " & prmAddress.Value & " " & prmCity.Value & ", " _
    & prmState.Value & " " & prmZip.Value
metaDescription.Attributes("content") = desc

The same logic could be followed to provide the keywords <meta> tag content.

Results

What we should now have are two pages that can respectfully take care of themselves when a web crawler crawls them. Our new page titles will help the users realize what building or tenant page they are going to click on and our new description <meta> tags will provide a brief description of the specific building or tenant that page will display.



Marketplace
(Sponsored Links)
What are the green links?
   



 
Copyright © 2007 CMP Tech LLC |
Privacy Policy (4/10/06) | Your California Privacy Rights (4/10/06) | Terms of Service | Advertising Info | About Us | Help