Instructional Technology Portfolio | Design and Development Tools | Project
Login   |  Résumé   |  IT Portfolio   |  Home

Project Entries


Entry 1

Veterinary hospitals track patient demographics that include everything from name of the patient and weight to species and breed and sex and color of the animal. However, every hospital seems to have its own list of species and breeds and the various standard encoding repositories tend to focus on breeds of only one species. Not only are most registries focused on one species, but they often are not comprehensive enough to include all breeds a hospital may see. A prime example is canine species and the different breeds as recognized by the American Kennel Club and the United Kennel Club. Although these two have many overlapping entries, both have entries not found in the other registry, but which are equally valid from the hospital's viewpoint.

There are an amazing number of recognized breeds in the world and many breeds are backed by centuries of breeding records and tracking while many others are recent newcomers only lately recognized by the well established registries. Other breeds are not recognized by the more reputable registries, but nonetheless, have a large following and so deserve some sort of recognition, if only to note their relative lack of legitimacy.

This project has its roots in my extensive work with the UGA veterinary teaching hospital to bring together a list of all species and breeds the hospital regularly sees or might see in the future. The data found here-in begins the initial lists I began to encode with the SNOMED repository and brings together even more species and breeds discovered as I combed the Internet for more detailed information about each species I was working with.

For more on Canine Kennel Clubs, we'll begin with the most well known:

Google's National Kennel Clubs


Entry 2

I have been giving a lot of thought to the underlying technical aspects, especially the database schema how to best capture and represent the data and concepts in the database.

The first step is coming up with some basic terminology around which I am going to think of things and probably one of the best ways of collecting and organizing disperse data is the structure I encountered with the SNOMED® project. The SNOMED® project organizes some 400,000 terms and distinctions around what they call a "concept." Each concept represents something very distinct in the database. That is, there is no two concepts in the database that means the same thing.

So, "concept" is to represent each distinct element in the taxonomy while "description" is to represent syonymns, alternate spellings, and such that are representative of the particular concept. Every concept will have a type associated with it that places it at a specific level in the taxonomy (i.e. Kingdom, Species, Class). Concepts will also have statuses associated with them to indicate whether they are current, obsolete, or duplicate of another concept in the taxonomy. For the web site, the status won't be all that important as only current concepts will be displayed. However, if a web-service is built into the web-site so that data can be served to various clients, then this status becomes important to helping distributed databases maintain integrity in referencing this taxonomy.

To handle concepts that have more than one name (such as a species having several common names and a scientific name), the SNOMED® project introduces a descriptions table that ties back to the concepts table. Additionally, concept A can be related to concept B in some fashion to represent either hierarchy or some other relationship (i.e. "is-a", "same as", "causitive agent").

As good as the organization of SNOMED® is, it falls short of accurately capturing taxonomy data due to the overly broad scope that it attempts to define and encode. For example, you cannot easily distinguish terms by type such as scientific name vs. common name. You can, however, have have names that are preferred terms in one language, but not in another, so I will keep this idea in mind to help make this project a multi-lingual capable project down the road. For now, everything will be US-English (i.e. will not be adding UK-English distinctions such as "colour" at this time). Additionally, SNOMED® does not distinguish concept types such as family, genus, species, etc.

A second site that will be heavily modelled after is the Interagency Taxonomic Information System or ITIS. Because such a good site for taxonomic information already exists, this particular project will be primarily focused on filling out those particular branches where the breed is important and for which there are recognized standards in wide-spread use. In other words, this is not an attempt to recreate a comprehensive database of taxonomic information, but that doesn't mean that as others find the site useful and become involved in adding new information to the site that it doesn't grow into something much more comprehensive.

The vision is that the surfer will have flexibility in being able to "switch views" on the data so that they can browse by scientific names or by common names. Ultimately, it would be interesting to see if preferred terms (that is, those common names most widely used) can be localized even further so that some one in the Southern United States sees "Possum" rather than someone in Canada seeing "Opposum."

Perhaps most importantly, unlike most other taxonomy sites out there, which are strictly static sites that are highly controlled by a limited number of people, the aim of this particular site will be to make it an site that is updatable by anyone. I believe that if many people become involved in maintaining the areas they best understand and know, then the site as a whole gyrates towards a very accurate representation with a minimal of maintenance work on the part of the host administrator (i.e. me).

Many of the taxonomies existing on the web offer up a lot of details on each species, including full taxonomy, descriptions and history of the species, geographical and habitat background, and so on. My initial plan is to simply accurately capture taxonomy hierarchy along with scientific names, common names, and other synonyms. Eventually, I would like to find a way to leverage existing information by hopefully capturing reference links during initial build of the database and then finding ways to link to the wealth of data out there similar to the way search directories work today. Of course, of paramont importance is to always give credit to the source of the information. If sites encountered prove highly useful and also extremely well structured, I will seek to work with the site maintainers on some sort of partnership arrangement to display their data through this site.

With these thoughts in mind, the following initial data schema comes to mind:

create table concept_type (
   typeid            integer  not null constraint concept_type_pk primary key,
   description       varchar(50)
);

insert into concept_type values (0, Kingdom); insert into concept_type values (1, Phylum); insert into concept_type values (2, Class); insert into concept_type values (3, Order); insert into concept_type values (4, Family); insert into concept_type values (5, Genus); insert into concept_type values (6, Species); insert into concept_type values (7, Breed); insert into concept_type values (8, Sub-phylum); insert into concept_type values (9, Sub-class); insert into concept_type values (10, Sub-family); insert into concept_type values (11, Sub-genus); insert into concept_type values (12, Sub-species);

create table concept_status ( statusid integer not null constraint concept_status_pk primary key, description varchar(50) );

insert into concept_status values (0, current); insert into concept_status values (1, obsolete); insert into concept_status values (2, duplicate);

create table concepts ( conceptid integer not null constraint concepts_pk primary key, typeid integer not null concepts_concepttypeid_fk references concept_type(concepttypeid) on delete restrict, fullname varchar(200) not null constraint concepts_desc_un unique, statusid integer not null constraint concepts_status_fk references concept_status(conceptstatusid) );

create table desc_type ( typeid integer not null constraint desc_type_pk primary key, description varchar(50) not null );

insert into desc_type values (0, scientific name); insert into desc_type values (1, peferred name); insert into desc_type values (2, synonym);

create table descriptions ( descriptionid integer not null constraint desc_pk primary key, conceptid integer not null constraint desc_conceptid_fk references concepts(conceptid), description varchar(200) not null, desc_typeid integer not null constraint desc_typeid_fk references desc_type(typeid), statusid integer not null constraint desc_statusid_fk references concept_status(statusid), langid char(2) not null default en );

create table references ( refid integer not null constraint refs_pk primary key, userid integer not null constraint refs_userid_fk references users(userid) on delete cascade, conceptid integer not null constraint refs_conceptid_fk references concepts(conceptid) on delete cascade, title varchar(200) not null, description varchar(4096) not null, source varchar(250) not null, link varchar(250) not null, visible boolean not null default t, datecreated timestamp not null default now() );

References

Taxonomy hierarchy

Nomenclatural Glossary for Zoology

Taxonomy database in SPI

Animal Kingdom Primer

HIM - Species/Breed Code

Lioncrusher Species website

Mammal Species of the World


Entry 3

I began to build the database schema and populate the database with some initial data while also putting some thoughts into the design and layout of the web-site. I have been looking at other taxonomy sites, not only to get ideas about how to design this project's site, but to also do some research on the proper hierarchy and to populate the roots of the tree (i.e. Kingdoms, Phylums) with the correct concepts that will give me enough working data to play around with some design aspects of the site.

The following is the schema I have developed thus far. In thinking about the design of the site, I realized I wanted to attach descriptors and general comments to each comment and know where to find them in the database in order to properly display them. To this end, I needed to add a comments and comments type table. I also needed a table to track relationships between each concept so that I can build a tree-like structure for the taxonomy.

create table concept_type ( typeid integer not null constraint concept_type_pk primary key, description varchar(50) );

insert into concept_type values (0, Kingdom); insert into concept_type values (1, Phylum); insert into concept_type values (2, Class); insert into concept_type values (3, Order); insert into concept_type values (4, Family); insert into concept_type values (5, Genus); insert into concept_type values (6, Species); insert into concept_type values (7, Breed); insert into concept_type values (8, Sub-phylum); insert into concept_type values (9, Sub-class); insert into concept_type values (10, Sub-family); insert into concept_type values (11, Sub-genus); insert into concept_type values (12, Sub-species);

create table concept_status ( statusid integer not null constraint concept_status_pk primary key, description varchar(50) );

insert into concept_status values (0, current); insert into concept_status values (1, obsolete); insert into concept_status values (2, duplicate);

create table concepts ( conceptid integer not null constraint concepts_pk primary key, typeid integer not null constraint concepts_concepttypeid_fk references concept_type(typeid) on delete restrict, fullname varchar(200) not null constraint concepts_desc_un unique, statusid integer not null constraint concepts_status_fk references concept_status(statusid) on delete restrict );

create table origins( orgid integer not null constraint origins_pk primary key, userid integer not null constraint origins_userid_fk references users(userid) on delete cascade, conceptid integer not null constraint origins_conceptid_fk references concepts(conceptid) on delete cascade, title varchar(200) not null, description varchar(4096) not null, source varchar(250) not null, link varchar(250) not null, visible boolean not null default t, datecreated timestamp not null default now() );

create table desc_type ( typeid integer not null constraint desc_type_pk primary key, description varchar(50) not null );

insert into desc_type values (0, scientific name); insert into desc_type values (1, peferred name); insert into desc_type values (2, synonym);

create table descriptions ( descriptionid integer not null constraint desc_pk primary key, conceptid integer not null constraint desc_conceptid_fk references concepts(conceptid) on delete restrict, description varchar(200) not null, desc_typeid integer not null constraint desc_typeid_fk references desc_type(typeid) on delete restrict, statusid integer not null constraint desc_statusid_fk references concept_status(statusid) on delete restrict default 0, langid char(2) not null default en );

create table rel_type ( relid integer not null constraint rel_type_pk primary key, description varchar(25) not null );

insert into rel_type values (0, is a);

create table relationships ( relid integer not null constraint rels_pk primary key, conceptid1 integer not null constraint rels_conceptid1_fk references concepts(conceptid) on delete restrict, typeid integer not null constraint rels_typeid_fk references rel_type(relid) on delete restrict, conceptid2 integer not null constraint rels_conceptid1_fk references concepts(conceptid) on delete restrict, statusid integer not null constraint rels_statusid_fk references concept_status(statusid) default 0 );

create table concept_comment_type ( commenttypeid integer not null constraint concept_comment_type_pk primary key, description varchar(25) not null );

insert into concept_comment_type values (0, descriptor); insert into concept_comment_type values (1, about); insert into concept_comment_type values (2, general);

create table concept_comments ( commentid integer not null constraint concept_comments_pk primary key, commenttypeid integer not null constraint concept_comments_typeid_fk references concept_comment_type(commenttypeid) on delete restrict, conceptid integer not null constraint concept_comments_conceptid_fk references concepts(conceptid) on delete cascade, comment varchar(4000) not null );

create sequence tax_id_seq start 100000;

insert into concepts values (nextval(tax_id_seq), 0, Animalia , 0); insert into concepts values (nextval(tax_id_seq), 0, Plantae , 0); insert into concepts values (nextval(tax_id_seq), 0, Fungi , 0); insert into concepts values (nextval(tax_id_seq), 0, Protista , 0); insert into concepts values (nextval(tax_id_seq), 0, Monera , 0);

insert into concept_comments values (nextval(tax_id_seq), 0, 100004, bacteria, blue-green algae); insert into concept_comments values (nextval(tax_id_seq), 0, 100003, amoeba, euglena, paramecium, diatom); insert into concept_comments values (nextval(tax_id_seq), 0, 100002, mushrooms, water molds, bread molds); insert into concept_comments values (nextval(tax_id_seq), 0, 100001, mosses, ferns, grasses, shrubs, flowering plants, gymnosperms); insert into concept_comments values (nextval(tax_id_seq), 0, 100000, mammals, fish, insects, birds, humans);

References

Introduction to Taxonomy


Entry 4

I begin to work on the design for displaying the data. The first thing I realize is that I need to have a "root node" in the hierarchy from which to construct the tree in order to utilize Zope's built-in tree functions. To facilitate this, I add the following two records plus their corresponding relationships to the tables.

insert into concept_type values (-1, root node);
insert into concepts values (0, -1, root, 0);
insert into relationships 
  select nextval(tax_id_seq), conceptid, 0, 0, 0 from concepts where typeid = 0;

Also needed to add more concept types as follows:

insert into concept_type values (13, Infra-class);
insert into concept_type values (14, Super-order);

My first implementation was with a hierarchical tree in Explorer-like format. However, it became quickly obvious as I began to add data to the database that the tree approach wasn't going to work; at least not in a frameless design. What was happening was that as I drilled down the tree, too much horizontal space began to be taken up on the screen as the indentations of the tree was rendered. Also, as I added more items to the tree, the detailed area to the right of the tree begin to float upwards and off the screen as a result of going deeper into the tree (i.e. you would be on line 300 of the tree, which scrolls the entire window down significantly far enough that the descriptions are also scrolled off the screen, so you see a blank page, instead of useful details about the selected node in the tree.

References

The ABC's of Animal Taxonomy

The EMBL Reptile Database

The Species Concept


Entry 5

When it came time to implement the references for the material on the site, I realized that many entries were going to come from the same site, yet I had specified the origins table to be a one-to-one relationship. That is, each reference/origin was associated with exactly one concept. Rather than re-entering the same reference for each new concept, I changed the database to allow me to enter concepts, then enter origins, then associate origins to one or more concepts as necessary. To do this, I had to drop the conceptid column from the origins table and then add the following new table.

create table concept_origins (
   conceptid    integer  not null constraint co_conceptid_fk references concepts(conceptid) on delete cascade,
   orgid        integer  not null constraint co_orgid_fk references origins(orgid) on delete cascade
);

Now, with most of the site designed, it is time to begin the almost overwhelming process of entering data for species and breeds for the site. To build the initial taxonomic hierarchy, I am largely referencing HITIS for the tree structure and only populating the branches necessary to capture the initial breed information for the database. Since everyone immediately thinks of dogs whenever I mention breeds, this is probably the best category to tackle first.

Since I do not have an army of data entry folks, I am looking at techniques for harvesting pertinent information from existing sites. A techinque known as "web scraping" in the programmer's world of jargon. The first thing I did was visit the American Kennel Club's web-site to get an idea of how they display all their breeds. As it turns out, they have a number of options for browsing all their breeds and the most convenient option is to browse by breed name.

Because of the regularity of how the breeds are presented in table format, I was able to write a python script to parse breed name and first paragraph of description from each breed. With a few stumbles over learning the Python language as I go, I finally come up with the following script:

# Extracts breed name and first paragraph as a descriptor for each and inserts into
# the taxonomic database.

request = container.REQUEST response = request.RESPONSE

# removes the weird spacing and tabs and line breaks def reflow(value):

r = "" for i in range(0, len(value)): if value[i] != \t and value[i] != \n and value[i] != '\r': if value[i] == ' and len(r) > 0 and r[len(r)-1] == ': a = 1 else: r = r + value[i] return r

# Isolate the table of breeds start = input[input.find(firstword):] start = start[:start.find(</table>)]

# process all the breeds in the table while start != "": # get the breed name breed = start[:start.find("")] start = start[start.find(valign="top">)+len(valign="top">):]

# get the first paragraph if breed comments for descriptor descriptor = start[:start.find("
")] if descriptor[0] == '<': descriptor = descriptor[descriptor.find(</a>)+4:] start = start[start.find(</td>):]

# find next breed in table if start.find(<a ) >= 0: start = start[start.find(<a class="white"):] start = start[start.find(">")+1:] else: start = ""

# make breed and descriptor suitable for saving breed = reflow(breed) print breed descriptor = reflow(descriptor) print descriptor print " "

# insert the new breed and descriptor into database conceptid = container.getNextTaxID()[0].nextval

container.addConcept( conceptid = conceptid, typeid = 7, fullname = breed, statusid = 0)

container.addRelationship( relid = container.getNextTaxID()[0].nextval, conceptid1 = conceptid, typeid = 0, conceptid2 = 100372, statusid = 0)

container.addDescription( descriptionid = container.getNextTaxID()[0].nextval, conceptid = conceptid, description = breed, desc_typeid = 1, statusid = 0, langid = en)

container.addComment( conceptid = conceptid, commentid = container.getNextTaxID()[0].nextval, typeid = 0, comment = descriptor)

container.addOrigin(conceptid=conceptid, orgid=100403)

# Let the user know what was processed return printed


Entry 6

After building the interface for adding new data and getting the database populated, I started to work on building in the multiple views of the data so that users would be able to browse by scientific names as well as common names. However, it quickly became evident that some of the groups identified by one or two latin names was hard to transcribe into short common names, so I decided that every entry would have a short descriptor that would describe the scientific name and both terms would be displayed on all pages.

After some additional data entry to populate all the current entries with common terms, this new design took shape and seemed to function well.

Another design decision was to flip the hierarchy tree so that the links read from bottom (more specific to top (more general) rather than as every other taxonomy site does. This made navigation of the site for research purposes much more fluid and easier to move back up one branch as necessary when "hunting around.",


Entry 7

General feedback via the desk-crits was generally very positive and the site layout appears to work well based on the comments given. Some additional changes to put a title on all the pages as well as direct links to all the breeds in the database (this will grow automatically as users add new breeds under different species).

Of course, the one request that nearly every critique asked for was pictures. Due to time constraints and band-width considerations, I was unable to add this to request to the project at this time.


mwlang@cybrains.net
Guest Login   |  Home