|
Darwin Initiative Project Final Report DAISY -Digital Automated Identification of Insects: a way to circumvent the Taxonomic Impediment to monitoring biodiversity, and thus implementing the CBD.
Introduction The inventory and monitoring of biological diversity (Article 7) is a critical part of the implementation of the Convention for Biological Diversity. In practice such activities are usually undertaken using a very few groups of organisms (such as birds or amphibians) which are of both high conservation interest and are readily recognisable. There are several disadvantages to using such organisms. First, they are generally high in the food chain, and respond more slowly to environmental perturbation than do invertebrates or micro-organisms. Second, they have a long mean generation time, which also lengthens their response time to such perturbations. Third, many are rare, and sampling can be labour intensive, damaging and difficult. Many insects do not have these disadvantages, and thus are potentially good indicators of overall species-richness. They can be easily and almost continually sampled, and thus enable assessment to be made of how a particular event or treatment impacts on the environment. Using such indicators there are a variety of other activities that may be undertaken. For example, assessment of the "green-ness" of coffee fincas in El Salvador is currently being undertaken in part by using insect samples (see Darwin-funded El Salvador project). However, the single, most important factor that prevents the use of insects in biodiversity inventory and monitoring projects is the great difficulty non-specialists have in identifying them. Traditional identification keys can only be used by specialists with access to a large reference collections and specialist libraries. Such facilities are costly to establish and both expensive and difficult to maintain, especially in tropical countries where the climate accelerates decomposition of biological material. The objective of this Darwin-funded project was to develop collaboratively with the Universidad de Costa Rica (UCR) a computerized system for the identification of insects. Initially parasitic wasps were selected as target organisms, but during the course of the project other groups of insects were also used to assess the applicability of the system to a range of insect groups. The results have been very encouraging, and a advanced system has been installed at UCR, a duplicate is in the NHM, London, and the software developed has been made available to the National Biodiversity Institute in Costa Rica (INBio). The project was conducted jointly between the Natural History Museum, London and the Universidad de Costa Rica. From the latter institution two departments were involved, Biology and Computing Sciences. The lead scientist for the project software development was Dr Mark O’Neill who worked as an independent consultant through the NHM. Dr Ian Gauld of the NHM was project leader, Professor Paul Hanson of the Escuela de Biología, Universidad de Costa Rica, was primarily responsible for the Costa Rican co-ordination and arranged for the procurement and preparation of many insect specimens. Professor Juan Carlos Briceño, Escuela de Informatica, Universidad de Costa Rica, was the lead Costa Rican computer specialist. A small number of other collaborators made different inputs are different times during the work. Most notable of these was Sr Ignacio Solis, formerly a computer-science student at UCR, who following graduation obtained a place at the University of California to study for a higher degree. His programming work on the DAISY front end system undoubtedly helped him obtain this placement and much of the current DAISY front end visuals are attributable to his efforts.
The DAISY project: general account of project related activity A prototype DAISY system was installed and partially developed in the Escuela de Biología at UCR. The system now occupies a room adjacent to the insect collections, and at the conclusion of the project a fully operational system is in place. This system comprises two modified computers, internet connection and a high quality (Moritex) digitizing camera with appropriate hardware. Installed on the system is the software, the development of which constituted the single largest part of this project. The software has been extensively tested, both in Costa Rica, and in the U.K. where a similar system is installed in the Department of Entomology at the Natural History Museum. Software development formed the major part of this project, and as each routine was developed it was extensively tested using images collected from especially prepared specimens. The technical details relating to software development are given below in a separate section. Much of the non-software development involved the preparation of a variety of test data sets. These comprise sets of images of the wings of various insects. One of the major image sets used for testing were the wings of Enicospilus species, a large and difficult genus of nocturnally active wasps. The preparation of test sets involved a considerable amount of field work, most conducted at Zurquí de Moravia, a site on the edge of Braulio Carrillo National Park that we know to be particularly rich in species. Sampling was conducted from dusk until midnight using specially constructed light traps that we had made to order for the work. These and all other field collecting equipment are deposited in Costa Rica and are now used routinely by students at the Universidad de Costa Rica. Using freshly preserved material collected at light, test specimen sets were prepared, by removing the right fore wing and mounting the wing on a microscope slide in Balsam, to ensure a permanent preparation. Two Costa Ricans were trained in the techniques necessary for making permanent slide mounts. The test sets and associated specimens are now deposited in the Natural History Museum, by a large number of other specimens (circa 500) were also collected and identified during the course of the project. These specimens are deposited in the Museo de Insectos, Universidad de Costa Rica, and some material has been placed in the collections of the Instituto Nacional de Biodiversidad (INBio). During the course of this project two visits were made by Costa Ricans to London to participate in software development and system testing. As part of the second visit in February 1999 the three Costa Rican scientists present attended a specially commissioned course at Manchester University. This intensive one-day course, organised by Dr Tony Lacey and Dr Neil Thacker, from the school of Image Science and Bio-Medical engineering explained the latest developments in TINA, a state of the art set of programming libraries used for machine vision and image processing in experimental biophysics. Knowledge of these programming routines was of considerable importance, because it was initially thought that some of the routines could be used in the DAISY programme. Although they were eventually not used, the familiarization with the TINA system imparted, prevented duplication of effort and lead to a fruitful exchange of ideas. Although the early focus of our work was wasps, it became apparent that the DAISY system has a wide application, and to investigate this we decided to experiment with other data-sets. Several of these were pre-existing data sets in the NHM or were specially prepared by our U.K. technician, but a new set was acquired by Dr O’Neill and a photographic technician (M. Denos) in Costa Rica, using fresh, locally collected specimens of hawk-moths (Sphingidae) present in the INBio collection. We opted for this as older museum specimens have often faded or have their wing scales abraded, which unnecessarily impair performance of DAISY, and although DAISY can cope with such suboptimal specimens such imperfections are not desirable qualities in the standard reference data-base. The hawk-moth data set has become the standard demonstration set, and features in the TV coverage of the project.
The DAISY project: technical computational details At the start of the DAISY project there were no generic scaleable systems appropriate for a task such as insect recognition, although concurrently other systems have been developed. However, none of these are true automated identification systems designed for a non-expert user. For example, the system built to recognise bee species from their wing venation implemented by Wittmann and colleagues at the University of Bonn in Germany requires considerable user intervention. Although this system is capable of high accuracy (e.g. correct classification rates in excess of 95% for certain bee groups, for example Colletes species) it requires the user to accurately locate a large number of vein intersections in the wings. This means that the system is not easily usable by the non-expert unfamiliar with the particular venational terminology (and several alternative systems of venational terminology exist for insects), and it is relatively labour intensive and slow to do single determinations. Furthermore, the Wittmann system, at least in its earlier forms, is simply a sequence of a number of pre-existing computer codes derived from the Photogrammetric and Remote Sensing Communities. Consequently, the system does not lend itself to generic scaleable recognition in the way that the DAISY system does. The DAISY system is a completely novel idea for insect identification. Basically a system was envisaged that accepts an image and identifies it, without user intervention, by comparing it with other images stored in a reference database. In its initial form (e.g. Weeks, P.J.D., Gauld, I.D. Gaston, K.J. & O’Neill, M.A. 1997. Automating the identification of insects: a new solution to an old problem. Bulletin of Entomological Research, 87: 203-211.) the DAISY system was based closely on the PCA-based facial recognition systems which where developed by Matthew Turk and Sandy Pentland at Massachusetts Institute of Technology in the early 1990's and the prototype DAISY system was essentially a re-implementation of the Turk and Pentland principal component analysis (PCA) face classifier. However, a major modification to the approach advocated by Turk and Pentland was the building of individual classifiers for each morph-class (e.g. in the case of DAISY insect species). There were two major reasons for this, first to ensure that DAISY was scaleable by having the "identification engine" access an expandable set of concatenated reference images. Second, by having concatenated images, effectively constructs embracing the observed variation in the morph-class, we were able to recognize variable objects such as insect species. Previously, such systems worked only for invariant objects, such as certain unalterable reference points derived from a human face or fingerprints. It became clear that if computer-aided taxonomic (CAT) systems like DAISY were to be a practical proposition the amount of computation associated with adding (a) new species to the system or (b) increasing the number of training examples for a given species must be limited. In classical PCA the entire set of training examples is (non-linearly) transformed to a single reduced dimensionality space. While this is not a problem with small numbers of species and closed training sets, it becomes untenable for recognition systems which may contain thousands of individual species and tens of training examples per species because:
Although the prototype DAISY approach worked relatively well (correct classification in > 95% of all identifications for five closely related species of Costa Rican parasitic wasps, and >90% in the case of some thirty species of Palaearctic ceratopogonid flies), following initial testing as part of the DARWIN funded project two further problems with the PCA approach were recognized. First, the PCA approach is very computationally intensive, which means even the modified PCA approach adopted by the prototype DAISY project actually scaled relatively poorly. Second, PCA implicitly assumes that the distribution functions in morph space of objects which are to be recognised are essentially linear. This is clearly not the case for biological objects such as insect wings. In order to overcome these deficiencies, during the first year of the DARWIN-funded DAISY project a second version of DAISY was implemented which used the Lucas n-tuple nearest neighbour classifier (NNC) as opposed to PCA in order to compute the affinity of unknowns to training sets. This classifier has several advantages over the earlier PCA approach. First, NNC is very simple. The unknown is assumed to be in the same class as the training example to which it has the highest affinity. Second, NNC is capable of dealing with non-linear distributions of training objects in morph-space. Third, it scales linearly with increasing training set size and number of species. Fourth, it may easily be implemented in hardware (should the need arise). Fifth, it is amenable to MIMD parallelisation of the classification process over a network of interconnected workstations. Finally, it is capable of supporting sophisticated training set optimisation algorithms with a minimum of extra computational overhead. Tests conducted using the ceratopogonid and parasitic wasp data sets analysed previously by the prototype showed that the NNC version of DAISY performed at least as well as the earlier PCA versions (on the criterion of correct identifications) and with a throughput speed which was at least an order of magnitude better on the same hardware. The final DAISY system consists of classification engines (florets) which are based on the Lucas n-tuple NNC classifier together with a number of other components developed or adapted during the course of this project, which facilitate data input and the dissemination of information about the objects classified. These components are: 1) Daisy Front End (DFE) – a X11R6 based GUI front end built using the FSF Gnome and GDK+ libraries. DFE is a tool which can be used to capture imagery, mark the boundaries of objects (such as insect wings) within the imagery which are to be identified, and perform a selection of image-processing operations on the input imagery (e.g. contrast enhancement, centering, reflection etc.). 2) IPM: ipm is a front end to the floret classification engine which transforms input objects into a standard pose (e.g. invariant to rotation and scale). 3) a virtual HTML generator, which turns a list of probable identifications generated by floret into a link page of HTML. It then launches a standard Web Browser (e.g. Netscape) in order to display this information. In practice, the exemplar DAISY system described above has been able to discriminate similar species rich in visual information such as hawkmoths (Sphingidae). Species of the genus Xylophanes have consistently been identified to species with a very high degree of accuracy (approaching 100%). In its current form, DAISY uses a modified n-tuple NNC which builds an ordered list of distances from the unknown in morph space. This means that the system is able to fail gracefully. If it cannot make an identification to species, it is almost always able to say that the unknown X is one of (say) ten species. This means that the system has a useful screening function in the case of sibling species complexes, and species swarms. For example, extensive tests conducted during this project with 55 species of Costa Rican parasitic wasps in the genus Enicospilus (which contains complexes of extremely similar and morphologically rather variable species possessing very similar wing venation) have shown that DAISY is almost always able to say the unknown X is one of four or five possible species. This is extraordinary as non-specialists have extreme difficulty identifying these organisms. The screening function of DAISY will save a lot of time when dealing with speciose tropical biota, as it greatly reduces the number of species that have to be considered. It reduces the identification burden on the expert taxonomist because screening can be performed by relatively inexperienced personnel such as parataxonomists.
Results of the project A functional DAISY system has now been developed and installed at the Universidad de Costa Rica with a duplicate at the Natural History Museum. Discussions are now underway in preparation to installing the software at the National Biodiversity Institute (INBio) in Costa Rica. The results of extensive tests clearly indicate the feasibility of the DAISY idea, and demonstrate it is a very useful tool for automating the identification of insects. In the course of this project DAISY has achieved the following results.
Publicity The DAISY system has received local attention in Costa Rica where an article appearing in the principle local newspaper La Nacion. Most recently, the BBC programme Tomorrow’s World (May 30, 2001) did a feature on the DAISY system demonstrating how it was helping to identify insects in Costa Rican National Parks.
Publications Weeks, P.J.D., O’Neill, M.A., Gaston, K.J. & Gauld, I.D. 1999. Automating insect identification: exploring the limitations of a prototype system. Journal of Applied Entomology, 123: 1-8. Gauld, I.D., O'Neill, M.A. & Gaston K.J. 2000. Driving Miss Daisy: the performance of an automated insect identification system, pp 303-312. In: Austin A.D. & Dowton, M (eds.) Hymenoptera: Evolution, Biodiversity and Biological Control. 468 pp. CSIRO. Canberra. Weeks, P.J.D., O’Neill, M.A., Gaston, K.J. & Gauld, I.D. [in press] Species identification of wasps using principal component associative memories. Image and Vision Computing. DAISY web site – www.tumblingdice.co.uk/daisy
Future directions Thanks to the Darwin Initiative Funding the DAISY ideal has been developed into an adaptable working system, demonstrating its potential for a range of uses. Interest has already been expressed by INBio, to use DAISY as a tool to allow identification of a range of Costa Rican insects and provide instant access to information held in their own databases. Scientists at the Universidad de Costa Rica are interested in developing databases to use DAISY to screen for agricultural pests. The use of DAISY was also favourably discussed at a recent international meeting on the implementation of the Global Taxonomy Initiative (GTI) held in Costa Rica for Central American states. I do not think that it is unreasonable to claim that DAISY is a significant contribution to alleviating the "taxonomic impediment" hindering our understanding of this planet’s biodiversity.
Ian D. Gauld The Natural History Museum, 2001 |