BICO is a fast streaming algorithm to compute high quality solutions for the k-means problem on very large sets of points. It combines the tree data structure of SIGMOND Test of Time Award winning algorithm BIRCH with insights from clustering theory to obtain solutions fast while keeping the error regarding the k-means cost function low.

We evaluated BICO experimentally in 3 to test its speed and accuracy.
The original source code that was used to perform the experiments is available here.
Please note that this version is intended for reproducing the experiments only.
We strongly suggest to use the most recent version available here.

Data Sets

We used five data sets.
Tower, Covertype and Census are from the UCI Machine Learning Repository.
BigCross is a subset of the Cartesian product of Tower and Covertype, created by the authors of 1.
The fifth data set is called CalTech128 and consists of 128 SIFT descriptors.
For further information on the data sets and the availability of these, please contact us.

Reference Implementations

BIRCH:
We used the implementation of BIRCH by the authors available here. MacQueen: We used the implementation of MacQueen's k-means implemented in the open source project ESMERALDA. StreamKM++: We used the implementation of Stream-KM++ by the authors available here. StreamLS: We used the implementation of StreamLS by the authors available here.

Testing Environment

The testing environment that we used is based on the environment provided on the StreamKM++ webpage and contains scripts for the automatic execution of BICO and the reference algorithms.

[1] M. Ackermann, C. Lammersen, M. Märtens, C. Raupach, C. Sohler, K. Swierkot: StreamKM++: A Clustering Algorithm for Data Streams. ALENEX 2010.
Also see the webpage of StreamKM++.

[2] D. Arthur, S. Vassilvitskii: k-means++: The advantages of careful seeding. SODA 2007

[3] H. Fichtenberger, M. Gillé, M. Schmidt, C. Schwiegelshohn, C. Sohler. ESA 2013