We've got many inquiries about the datasets used in the paper. Although we've cited the data sources and illustrated the procedure to use them, here is just a brief summary of the major datasets used, to help you get started.
- HDFS log:
This dataset first appeared in paper Large-scale system problem detection by mining console logs published in SOSP'09, the authors have kindly released the dataset, and you could find it here. Please cite their paper if you're using this dataset.
A brief introduction of how it is used in our paper:
- The raw log file is demobuild/data/online1/lg/sorted.log, about 1.5GB unpacked, sorted by timestamps.
- In 200nodes, "mlabel.txt" and "nameIndex.txt" together show the ground truth lable of blocks, "col_header.txt" is the source code template which could be used to parse each raw log entry into a log key without a log parser.
- With a total of over 11 million log entries, DeepLog used the first 100 thousand normal log entries for training, and tested on the rest. The ground truth labels are in the granularity of "blocks", so we first group log entries based on their block id "blk_*", and train/test for each block.
Here are the datasets used for the log key anomaly detection model, parsed from previous "sorted.log", where each line represents one "block":
- OpenStack cloud log:
This dataset was generated on CloudLab. It is quite easy to start an OpenStack experiment and generate your own logs through here.
A few datasets used in our paper:
- Blue Gene supercomputer log: and many others, could be found here.
- VAST Challenge 2011 Mini-Challenge 2 log:
This dataset was originally downloaded from here. However this link is currently not available. It might be moved to here.