Labs

Datasets

If you are a researcher or developer being interested in recommender systems, reference management software, or PDF title extraction, the following datasets may be of interest to you. For more information read the papers The Architecture and Datasets of Docear’s Research Paper Recommender System and Docears PDF Inspector: Title Extraction from PDF files. Feel free to contact us for a research cooperation.

Research Papers __ The research papers dataset contains information about the research papers that Docear’s PDF Spider crawled and their citations. This includes information about 9.4 million documents, including 7.95 million citations and 1.8 million URLs to freely available academic PDFs on the Web. The dataset also provides citation positions, i.e. where in a document a citation occurs. This leads to 19.3 million entries in the dataset. __
 _
Mind-Maps / User libraries The mind maps dataset contains information on 390,613 revisions of 52,202 mind-maps created by 12,038 users. The mind-maps themselves are not included in the dataset due to privacy reasons. Information includes the number of nodes, the documents that are linked, the id of the user who created the mind-map, and when mind-maps were created
 _
Users The user dataset contains information about 8,059 of 21,439 registered users, namely about those who activated recommendations and agreed to have their data analyzed and published. Among others, the dataset includes information about the users’ date of registration, gender and age (if provided during registration), usage intensity of Docear, when Docear was last started, when recommendations were last received, the number of created mind-maps, number of papers, how recommendations were labeled, the number of received recommendations, and click-through rates (CTR) on recommendations.
 _
Recommendations The recommendation dataset contains information on 308,146 recommendations that were delivered to 3,470 users between March 2013 and March 2014. This includes the date of creation and delivery, the time required to generate recommendations and corresponding user models, and information on the algorithm that generated the recommendations. Information on the algorithms are manyfold. We stored whether stop words were removed, which weighting scheme was applied, whether terms and/or citations were used for the user modelling process, and 28 other variables that are described in more detail in the dataset’s readme file.
 PDFs This datasets contains 500 PDFs that we used to evaluate Docear’s PDF Inspector. The dataset also contains an Excel file with the extracted titles from Docear’s PDF Inspector, Google Scholar, and ParsCit. Feel free to use this dataset to compare your PDF title extraction tool against Docear’s PDF Inspector.