Paper abstract

A Space Efficient Solution to the Frequent String Mining Problem for Many Databases

Adrian Kugel - University of Ulm, Germany
Enno Ohlebusch - University of Ulm, Germany

Session: Pattern Mining
Springer Link: http://dx.doi.org/10.1007/978-3-540-87479-9_14

The frequent string mining problem is to find all substrings of a collection of string databases which satisfy database specific minimum and maximum frequency constraints. Our contribution improves the existing linear-time algorithm for this problem in such a way that the peak memory consumption is a constant factor of the size of the largest database of strings. We show how the results for each database can be stored implicitly in space proportional to the size of the database, making it possible to traverse the results in lexicographical order. Furthermore, we present a linear-time algorithm which calculates the intersection of the results of different databases. This algorithm is based on an algorithm to merge two suffix arrays, and our modification allows us to also calculate the LCP table of the resulting suffix array during the merging.