An Exercise in Data Analytics on Bibliographic Data
TL;DR straight to results
Andreas Harth, November 2016
Goal
Create a taxonomy about "data intelligence" in the context of Semantic Web, Linked Data, Stream Reasoning, Internet of Things.
Straightforward Approach
- Use Google: gives high-level definition in marketing-speak
- Use DBLP: results in papers with the term in the title, but the papers do not define "data intelligence" in sufficient depth
Data-driven Approach
Idea: could we use data from the Semantic Web to get an overview of the topic of "data intelligence"?
- Assumption: people work on research topics with some continuity (that is, the topics a researcher works on do not change suddenly)
- If we identify the topic of each researcher, we can derive the likely topics a group of researchers is working on, and hence get an outline of a research field
- Goal: identify topics in a group of researchers
- Goal: create a ranked list of important publications and persons in the group
Groups that likely represent the (vagely defined) term "data intelligence"
Now, how to get relevant topics, papers and persons in these groups?
Step 1: Identify Data Sources
- DBLP is a database about computer science publications
- Instance data about publications (3.4m) and persons (1.7m)
- Available as RDF (closed beta though)
- AMiner is a site providing scientific network mining.
- Citation network (3.3m DBLP papers, 8.5m citations) extracted from PDFs
- Available in own data format, publications are referenced with their title string
- Papers include authors, and call for papers contain lists of people (in our groups: 4 to 27 people)
Step 2: Extract and Prepare Data
- Manually create RDF files with DBLP URIs from author lists or lists of PC members
- Convert AMiner datset to RDF (adding DBLP identifiers)
- PageRank calculation over entire dataset revealed issues in the data (wrong links), remove manually
- We need topics of publications, but we only have titles (e.g., "Data intelligence on the Internet of Things")
- Extract topics from titles via a simple heuristic (many possible ways for improvement)
- Query for title
- Remove stop words ("the", "of", "it"...)
- Carry out Porter stemming
- Create bi-grams from titles to give an estimate of topics ("data_intellig", "intellig_internet", "internet_thing")
Step 3: Integrate Data
- Combine RDF datasets: DBLP (beta 2016-07-03, 7.5 GB, 58m triples) and AMiner (2016-07-14, 762 MB, 5.7m triples)
- Take original DBLP schema (prefix
dblp
), add citation links (rdfs:seeAlso
) and topics generated via the simple heuristic (dcterms:subject
).
- Extract subgraphs with SPARQL query for group ${FOCUS} (manually created)
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dblp: <http://dblp.org/rdf/schema-2015-01-26#>
DESCRIBE ?x ?y ?paper ?paper1
FROM <focus-${FOCUS}.nt>
FROM <dblp-2016-07-03.nt>
FROM <dblp-citation-good-links.nt>
FROM <ngrams.nt>
WHERE {
?s foaf:focus ?x .
?paper dblp:authoredBy ?x .
?paper dblp:authoredBy ?y .
OPTIONAL { ?paper1 dblp:authoredBy ?y . }
}
The result are RDF graphs containing data about the groups of researchers:
- data-diiot.nt, 31M, 236k triples
- data-cli.nt, 36M, 268k triples
- data-iswc.nt, 116M, 872k triples
- data-eswc.nt, 124M, 930k triples
- data-sr.nt, 183M, 1384k triples
- data-iab.nt, 9.9M, 74k triples
Step 4: Rank and Visualise
- Query subgraphs for n-grams and year of paper (restrict to the papers > 2006, to get only recent trends for topics)
- Import into MS Excel, create pivot table with sparklines (2007 - 2016) to indicate popularity over time
- Compute PageRank on subgraphs
- Query top-1000 persons/publications
- Represent as sorted list
Step 5: Inspect Results
Authors: Data intelligence on the Internet of Things
Authors: The Clinical Data Intelligence Project - A smart data initiative
Stream Reasoning Workshop 2016 Programme Committee
ISWC 2015 Senior PC
ESWC 2016 Area Chairs
Internet Architecture Board Semantic Interoperability in IoT Workshop Chairs
Web Services and Formal Methods Workshop Chairs (3rd to 11th Edition)
Big Data Value Association Officials
Interpretation of Results
Observations
- The number of publications for all topics drops in 2015/2016, which could be due to delays in the update process of DBLP.
Follow-up Questions
- The top researchers in the Semantic Web community are not from the community (although PageRank calculuation was local to the extracted subgraphs).
- The top papers in the Semantic Web community are not Semantic Web papers (although PageRank calculuation was local to the extracted subgraphs).
- The Stream Reasoning community seems heterogeneous, as top people are from Information Retrieval and Machine Learning. Is the reason that the topic of Stream Reasoning is fairly new?
You have an interesting thought regarding the results? Send me an email.
References
- Andreas Thalhammer and Achim Rettinger.
PageRank on Wikipedia: Towards General Importance Scores for Entities, Joint Proceedings of the 5th Workshop on Data Mining and Knowledge Discovery meets Linked Open Data and the 1st International Workshop on Completing and Debugging the Semantic Web (Know@LOD-2016, CoDeS-2016)
co-located with 13th ESWC 2016.
- Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'2008). pp.990-998.