Ammonoid Taxonomy with Supervised and Unsupervised Machine Learning Algorithms

,


Introduction
Ammonoids represent a morphologically diverse subclass of extinct cephalopods, ranging from the Paleozoic to the Mesozoic era (Walker and Ward, ). Found in marine sedimentary rocks, ammonoids are crucial index fossils for biostratigraphy (Cox, ), therefore ammonoid taxonomy is useful for the study of stratigraphic subdivision. Furthermore, the use of ammonoids in systematic palaeontology (Kennedy, ) and evolutionary biology (Monnet et al., ) is substantive.
Ammonoid taxonomy utilizes conch morphology, coiling, and aperture shape. Where the shell is preserved, ornaments such as ribs (their direction, spacing, and type) may be used for family classification (De Baets et al., a,b), as well as keels, spines, nodes, and tubercles. Where the shell is eroded or broken, suture lines across the sediment-filled interior are highly diagnostic for ammonoid order (Wiedmann and Kullmann, ; Klug and Hoffmann, ).
Korn ( ) defines a number of conch geometry parameters and proportions for taxonomic study of ammonoids, shown in Table . Individual conchs may be described by terms based on the values of these conch proportions, for example ammonoids with 0.35 ≤ CW I ≤ 0.6 have 'discoidal' general conch shape.
Since ammonoids exhibit intraspecific variation (De Baets et al., ), it follows that each species has a typical range of conch proportions which are diagnostic of taxonomy.
The aim of this study is to taxonomize ammonoids by their conch proportions with a range of supervised and unsupervised machine learning algorithms. This presents a novel proof-of-concept approach at ammonoid diagnostics which lays the groundwork for future methods in biostratigraphy, systematic palaeontology, and evolutionary biology. Umbilical width index U W I uw dm1

Methods . Data
Ammonoid data were sourced from the Paleobiology Database (PBDB) (Clapham et al., ), a nongovernmental, non-profit public paleontological database supported by the US National Science Foundation. Data were downloaded from PBDB on November, , using the taxon/taxa name 'Ammonoidea', which provided N = 19, 576 unique specimens. Specimens with missing diameter, whorl height, whorl width, and umbilical width were excluded from analyses. To reduce over-or underfitting models, species with fewer than specimens in the dataset were also removed. The remaining data consisted of N = 781 specimens from species. These specimens are

. Analyses
For each ammonoid, the conch parameters dm 1 , ww 1 , wh 1 , and uw were used to find the conch proportions CW I, W W I, and U W I as defined in Table . All specimens were then plotted to observe apparent within-population variability in three-dimensional proportion space.

. . Supervised Algorithms
Classification was achieved with supervised machine learning algorithms. The target variable was the specimen species, and the fit data were the conch proportions.
A 5 × 5 nested cross-validation approach was taken to avoid potential bias in performance evaluation due to over-fitting in model selection (Cawley and Talbot, ). This way, estimates for the unbiased generalization performance of each classifier were obtained through test and train accuracies averaged across the outer folds. For the inner folds, a grid search was utilized to select the model parameters which resulted in the best test accuracy. The classifiers implemented, as well as the range of parameters used in cross-validation, are summarized in Table . Table : Classification Models for Cross-Validation.

. . Unsupervised Algorithms
In addition to the supervised classification algorithms, a range of unsupervised clustering algorithms were implemented. Because cross-validation is not a clearly defined concept for unsupervised algorithms, crossvalidation was not implemented for these methods. The following describes the unsupervised algorithms used and how model selection was approached.
A K-Means clustering model was selected in the usual way with the 'elbow' method by selecting the value of k for which a reduction in the sum of squared distances of samples to their nearest cluster center is diminished (the 'elbow' of an inertia against k curve).
Similarly, a DBSCAN clustering model was selected with an 'elbow' method. First, a value for the minimum number of neighbours around a point to define a core point was chosen heuristically as m = ln(N ) . A nearest neighbours algorithm was then implemented to calculate the average distance between each point and its ln(N ) nearest neighbours. The optimal value of epsilon was then selected as the distance which corresponded to the elbow on a distance against data point number curve (with the data points sorted by distance).
Building on the DBSCAN model, an OPTICS clustering model was implemented using a DBSCAN cluster method and the same values of epsilon and m.
The final two unsupervised models implemented were Mean Shift and Affinity Propagation clustering. 'Elbow' methods are not well defined for model selection with these two algorithms, therefore grid searches were used to select the models with the highest Caliński-Harabasz index or Variance Ratio Criterion (Caliński and Harabasz, ), which identifies the model with the most dense and well separated clusters. This internal validation metric is naive to the ground truth labels, and so keeps the model selection process unsupervised.
For the Mean Shift model, a range of bandwidths with quantile parameters 0.01, 0.02, . . . 0.15 were searched. The upper limit of 0.15 was selected because preliminary analyses showed models with quantiles > 0.15 predicted only two clusters. Pretending not to know the true number of species, a brief visual inspection of the ammonoid data suggests at least three clusters exist, so it is reasonable to exclude models predicting fewer clusters.
Similarly, for the Affinity Propagation model the range of preference values −0.394, −0.344, . . . − 0.044 were searched. The lower limit of −0.394 and upper limit of −0.044 were selected becase preliminary analyses showed models with preference < −0.394 predicted only one cluster, and models with positive preference predicted 781 clusters. Again, visual inspection of the data finds these numbers of clusters to be unrealistic, so it is reasonable to exclude such models.
After model selection, external validation was performed on all unsupervised methods by computing Fowlkes-Mallows scores (Fowlkes and Mallows, ), which compare the classes predicted by clustering to the actual classes via the geometric mean of the pairwise precision and recall. Normalized Mutual Information scores (Strehl and Ghosh, ) were also computed to compare these models with extant literature, however this metric is not adjusted against chance.

Results
The final analysis data are summarized in Table . While some species share similar values on average for individual conch proportions, fewer have similar average values for all three conch proportions. Exceptions include {Andidiscus behrendseni, Eoamaltheus multicostatus, and Badouxia canadensis}, and {Arkanites relictus and Retites semiretia}. Standard deviations on mean conch proportions are generally small relative to those proportions, suggesting relatively narrow spreads, which favour a classification approach.  The distribution of ammonoids in three-dimensional proportion space is shown in Figure . Intra-species clustering is readily apparent for most species, especially Badouxia canadensis, Cladiscites beyrichi, Meekoceras gracilitatis, and Karakaschiceras attenuatus, which display the least intra-cluster distance. Withinpopulation variability is considerably greater in the species Arkanites relictus, Paralegoceras sundaicum, and Andidiscus behrendseni. Unbiased generalization performance estimates for the supervised models are shown in Table . Figure ; The first subplot is identical to Figure . The colour maps between subplots are independent.

Discussion
This research aimed to taxonomize ammonoids by well-defined geometric conch parameters using machine learning algorithms. Supervised models obtained at least 70% accuracy in ammonoid classification. Unsupervised models obtained Fowlkes-Mallows scores of at least 0.5 and Normalized Mutual Information scores of at least 0.6. The latter models predicted between five and 15 species when 11 were present. This proof-of-concept approach to ammonoid classification is novel in its application and may in future provide additional insight for biostratigraphy, systematic palaeontology, and evolutionary biology. A discussion of the models implemented follows.

. Supervised Models
The supervised models' average test accuracies ranged from 0.704 for the Decision Tree to 0.781 for the Support Vector Machine. On this imbalanced dataset, a naive classifier approach which simply predicts the majority class would achieve an accuracy of 148 781 ≈ 0.190. Relative to this baseline classifier, all the models presented here are far superior. The relative performance of these models however is less clear. The dispersions of the individual outer fold test accuracies as measured by the standard deviation on each model were somewhat large relative to the differences in accuracies, therefore for this particular set of models, the generalized test accuracies did not differ largely.
Another result to consider when evaluating which model performed best is the average train accuracy. The largest absolute differences between test and train accuracies were observed for the Decision Tree and K-Nearest Neighbours classifiers. This suggests significant over-fitting in these models, which may lead to poor out-of-sample performance. Compared to single tree models, tree ensemble methods generally have less over-fitting (Shalev-Shwartz and Ben-David, ), and this is true of the Random Forest and Gradient Boosting models presented in this study, however the reduction is not particularly large. Conversely, the smallest differences between test and train accuracies were observed in the Naive Bayes and Support Vector Machine models, with the downside to the latter approach being its high memory intensiveness.
Based on the above considerations, the Naive Bayes and Support Vector Machine approaches appear to be the most appropriate classifiers for ammonoid taxonomy.
The accuracies of the models demonstrated in this study are highly comparable to related studies using other animal properties as features in supervised classifiers. For example, Gunasekaran and Revathy ( ) classified animal species using acoustics with up to 70.3% overall accuracy, Manohar et al. ( ) classified animal species using images with up to 79.54% overall accuracy, and Atanbori et al. ( ) classified birds using motion with up to 66% accuracy.

. Unsupervised Models
Overall, Fowlkes-Mallows and NMI scores were relatively low across all unsupervised methods. The highest scores were achieved by the Mean Shift and K-Means algorithms respectively. In a datset consisting of 11 distinct species, the K-Means model estimated five clusters (species) were present, whereas the Mean Shift algorithm predicted 9. Considering the number of clusters, Fowlkes-Mallows score, and NMI score, the most successful unsupervised model applied to the ammonoid data was the Mean Shift algorithm.
While the DBSCAN and OPTICS algorithms also predicted 9 species, their scores were amongst the lowest, and a visual inspection of the clusters makes it clear that both algorithms consistently overestimate the range of conch proportions which define individual species (clusters). This can be seen in their subplots in Figure , which are dominated in all areas of proportion space by single colours. In contrast, the distribution of colours exhibited by the K-Means, Mean Shift, and Affinity Propagation algorithms resemble the actual distribution much more closely.
Visually, Affinity Propagation had the most success in distinguishing between species such as Eoamaltheus multicostatus and Badouxia canadensis, but was oversensitive to intraspecific variation in species such as Owenites koeneni.
The results of this study suggest centroid-based supervised clustering algorithms such as K-Means and Mean Shift are best suited to ammonoid taxonomy, while density-based approaches may be less appropriate.
The NMI scores achieved by these models are comparable to extant unsupervised classification literature with similar objectives and number of classes, for example Clink and Klinck ( ) classified 10 primates using acoustics with NMI scores between 0.55 and 0.73, and Saryan et al. ( ) achieved similar scores when classifying 10 plant taxa.

. Strengths
This study's application of supervised and unsupervised methods in machine learning to ammonoid fossils is novel, and the use of nested cross-validation and both internal and external validation metrics is rigorous.

. Limitations
The primary limitation of this study is the small sample size. This may result in overfitting if the algorithm is too complex, or underfitting if the algorithm is too simple (Mehryar et al., ). Either way, a larger number of ammonoid specimens for training would result in better generalization (Mehryar et al., ). In this case, the sample size was limited by the availability of data in PBDB, therefore alternate data sources are required for better generalization. Under similar circumstances of limited data availability, machine learning approaches with fewer data have been published in recent months (Shen et al., ; Jiang et al., ).
Furthermore, the data make no distinction between mature and juvenile specimens, nor males and females. However, the range of conch proportion values across ontogeny and sexual dimorphism may be unique to individual species anyway, and therefore diagnostic of ammonoid taxonomy.