Preparing Your ASP.NET Site for Crawling
In this article we'll explore a handful of things you can do to your web site to make it
crawler-friendly. With a few simple steps your site can take full advantage of the Microsoft
Search service that comes with Sharepoint Portal Server or Site Server 3.0. You can also
be prepared for the large search engines.
When your enterprise uses a web crawler to help provide search capability, you might consider
creating a special page that only your crawler will access. This single page can provide a
list of hyperlinks to be followed by the crawler so only the content you specify is indexed
and returned when someone does a search. The crawler can be configured to only follow
hyperlinks one level deep.
To provide an example, let's assume we all work on a fictitious application for a company
called "Global Rentals", or "Global" for short. Global leases and manages commercial
real estate, so their corporate intranet is their gateway to tenant and building information.
Global owns 1000 buildings with 7500 total tenants. It would stand to reason that their
Intranet has a page called building.aspx and a page called tenant.aspx which
displays all available information about buildings and tenants, respectively.
The building managers at Global often search the Intranet for information, so lets return
to the concept of a crawl start page. This page, which we'll call crawl.aspx will
be plain and simple. What we will do is provide a link to every building and tenant that
Global has in their database. To do this we'll call two stored procedures in Global's
property management database. These sprocs will return the information necessary to provide
a link to our two pages. We'll create a SqlDataReader to loop through the records
returned and call Response.Write to output our links to
the page.
C#
private void Page_Load(object sender, System.EventArgs e)
{
SqlConnection conn = new SqlConnection(ConfigurationSettings.AppSettings["ConnectionString"]);
SqlCommand cmd = new SqlCommand("s_BuildingList", conn);
cmd.CommandType = CommandType.StoredProcedure;
conn.Open();
SqlDataReader sdr = cmd.ExecuteReader(CommandBehavior.CloseConnection);
while (sdr.Read())
{
Response.Write("<a href=\"building.aspx?buildingid=" + sdr["BuildingID"] + "\">"
+ sdr["BuildingName"] + "</a><br>");
}
sdr.Close();
}
Private Sub Page_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
Dim conn As SqlConnection = New SqlConnection(ConfigurationSettings.AppSettings("ConnectionString"))
Dim cmd As SqlCommand = New SqlCommand("s_BuildingList", conn)
cmd.CommandType = CommandType.StoredProcedure
conn.Open()
Dim sdr As SqlDataReader = cmd.ExecuteReader(CommandBehavior.CloseConnection)
While sdr.Read
Response.Write("<a href=""building.aspx?buildingid=" & sdr("BuildingID") & """>" _
& sdr("BuildingName") & "<a><br>")
End While
sdr.Close()
End Sub
We would obviously create a SqlCommand and SqlDataReader to output the links to all of
our tenants as well. But there is one more thing that we'll do in the Page_Load for
crawl.aspx - the code Session("crawling") = "true" or
Session["crawling"] = "true", depending on your language of
choice. This code gives us flexibility on building.aspx and tenant.aspx in
that we can choose whether or not to render certain parts of the page. This is useful if we
want to exclude some of the content on those pages if the crawler is the one that is loading
it. For example, we might have an exhaustive JavaScript menu system that navigates through
our whole Intranet. All this HTML would not need to be indexed and included in our search
catalog, so we could use this session variable as a flag telling us whether or not to render
it.
Now that we have our start page, we'll need to take care of a couple of things on the
pages that we're having crawled. First, each page will need to have a descriptive title.
"Tenant Detail" or "Building Detail" don't really offer us much. With that in mind, we'll
alter the <title> tag on building.aspx.
HTML
<title id="title" runat="server">Building Detail</title>
This turns our HTML <title> element into a server-side
control that we can program against in our code-behind.
C#
protected System.Web.UI.HtmlControls.HtmlGenericControl title;
VB
Protected WithEvents title As
System.Web.UI.HtmlControls.HtmlGenericControl
On our building.aspx page we might be getting information from a business object or
a database, so we'll set our page's title to the name of the building. For this example,
we'll assume the building information was retrieved through
the use of output parameters being passed to a stored procedure.
C#
title.InnerText = "Building Profile - " + prmBuildingName.Value;
VB
title.InnerText = "Building Profile - " & prmBuildingName.Value
Now our new, descriptive titles will show up when our users do a search. The title is
usually what is provided as a hyperlink to the page in most search results. But the title
is only first of at least two important things that we need to take care of on pages that
will be crawled. The second is a description <meta>
tag.
If we do not provide a description for our page in the form of a
<meta> tag most crawlers
will just display the first several words of content that they find on the page. This could
include field labels in an HTML table, heading text, or any other content that the crawler
can read. What we'll do is set the contents of the description <meta>
tag at run time based
on the content retrieved from the database. This will really let us provide useful information
about the building or tenant directly beneath the page's title in the search results.
Hopefully our users will be able to use this information to better choose what result to
navigate to. For example, if Global has forty Jiffy Lube tenants in forty different centers,
showing the address information in the tenant.aspx description meta tag might prove useful.
HTML
<meta name="description" content="description" id="metaDescription" runat="server" />
We'll reference this <meta> tag in our code behind just like
we did the <title> tag.
C#
protected System.Web.UI.HtmlControls.HtmlGenericControl metaDescription;
VB
Protected WithEvents metaDescription As
System.Web.UI.HtmlControls.HtmlGenericControl
In our Page_Load event (assuming we used stored procedure
output parameters to retrieve a single tenant's information) we'll build a string for our
dynamic page description:
C#
string desc;
desc = prmBuildingName.Value + " - " + prmAddress.Value + " " + prmCity.Value + ", "
+ prmState.Value + " " + prmZip.Value;
metaDescription.Attributes("content") = desc;
VB
Dim desc As String
desc = prmBuildingName.Value & " - " & prmAddress.Value & " " & prmCity.Value & ", " _
& prmState.Value & " " & prmZip.Value
metaDescription.Attributes("content") = desc
The same logic could be followed to provide the keywords
<meta> tag content.
Results
What we should now have are two pages that can respectfully take care of themselves when a web
crawler crawls them. Our new page titles will help the users realize what building or tenant
page they are going to click on and our new description <meta>
tags will provide a brief description of the specific building or tenant that page will display.