An Exercise in Data Analytics on Bibliographic Data

Andreas Harth, November 2016

Goal

Create a taxonomy about "data intelligence" in the context of Semantic Web, Linked Data, Stream Reasoning, Internet of Things.

Straightforward Approach

Use Google: gives high-level definition in marketing-speak
Use DBLP: results in papers with the term in the title, but the papers do not define "data intelligence" in sufficient depth

Data-driven Approach

Idea: could we use data from the Semantic Web to get an overview of the topic of "data intelligence"?

Assumption: people work on research topics with some continuity (that is, the topics a researcher works on do not change suddenly)
If we identify the topic of each researcher, we can derive the likely topics a group of researchers is working on, and hence get an outline of a research field
Goal: identify topics in a group of researchers
Goal: create a ranked list of important publications and persons in the group

Groups that likely represent the (vagely defined) term "data intelligence"

Now, how to get relevant topics, papers and persons in these groups?

Step 1: Identify Data Sources

DBLP is a database about computer science publications
- Instance data about publications (3.4m) and persons (1.7m)
- Available as RDF (closed beta though)
AMiner is a site providing scientific network mining.
- Citation network (3.3m DBLP papers, 8.5m citations) extracted from PDFs
- Available in own data format, publications are referenced with their title string
Papers include authors, and call for papers contain lists of people (in our groups: 4 to 27 people)

Step 2: Extract and Prepare Data

Manually create RDF files with DBLP URIs from author lists or lists of PC members

Convert AMiner datset to RDF (adding DBLP identifiers)
PageRank calculation over entire dataset revealed issues in the data (wrong links), remove manually

We need topics of publications, but we only have titles (e.g., "Data intelligence on the Internet of Things")
Extract topics from titles via a simple heuristic (many possible ways for improvement)
- Query for title
- Remove stop words ("the", "of", "it"...)
- Carry out Porter stemming
- Create bi-grams from titles to give an estimate of topics ("data_intellig", "intellig_internet", "internet_thing")

Step 3: Integrate Data

Combine RDF datasets: DBLP (beta 2016-07-03, 7.5 GB, 58m triples) and AMiner (2016-07-14, 762 MB, 5.7m triples)
Take original DBLP schema (prefix dblp), add citation links (rdfs:seeAlso) and topics generated via the simple heuristic (dcterms:subject).

Extract subgraphs with SPARQL query for group ${FOCUS} (manually created)

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dblp: <http://dblp.org/rdf/schema-2015-01-26#>

DESCRIBE ?x ?y ?paper ?paper1
FROM <focus-${FOCUS}.nt>
FROM <dblp-2016-07-03.nt>
FROM <dblp-citation-good-links.nt>
FROM <ngrams.nt>
WHERE {
  ?s foaf:focus ?x .
  ?paper dblp:authoredBy ?x .
  ?paper dblp:authoredBy ?y .
  OPTIONAL { ?paper1 dblp:authoredBy ?y . }
}

The result are RDF graphs containing data about the groups of researchers:

data-diiot.nt, 31M, 236k triples
data-cli.nt, 36M, 268k triples
data-iswc.nt, 116M, 872k triples
data-eswc.nt, 124M, 930k triples
data-sr.nt, 183M, 1384k triples
data-iab.nt, 9.9M, 74k triples

Step 4: Rank and Visualise

Query subgraphs for n-grams and year of paper (restrict to the papers > 2006, to get only recent trends for topics)
Import into MS Excel, create pivot table with sparklines (2007 - 2016) to indicate popularity over time

Compute PageRank on subgraphs
Query top-1000 persons/publications
Represent as sorted list

Step 5: Inspect Results

Authors: Data intelligence on the Internet of Things

Authors: The Clinical Data Intelligence Project - A smart data initiative

Stream Reasoning Workshop 2016 Programme Committee

ISWC 2015 Senior PC

ESWC 2016 Area Chairs

Internet Architecture Board Semantic Interoperability in IoT Workshop Chairs

Web Services and Formal Methods Workshop Chairs (3rd to 11th Edition)

Big Data Value Association Officials

Interpretation of Results

Observations

The number of publications for all topics drops in 2015/2016, which could be due to delays in the update process of DBLP.

Follow-up Questions

The top researchers in the Semantic Web community are not from the community (although PageRank calculuation was local to the extracted subgraphs).
The top papers in the Semantic Web community are not Semantic Web papers (although PageRank calculuation was local to the extracted subgraphs).
The Stream Reasoning community seems heterogeneous, as top people are from Information Retrieval and Machine Learning. Is the reason that the topic of Stream Reasoning is fairly new?

You have an interesting thought regarding the results? Send me an email.

References

Andreas Thalhammer and Achim Rettinger. PageRank on Wikipedia: Towards General Importance Scores for Entities, Joint Proceedings of the 5th Workshop on Data Mining and Knowledge Discovery meets Linked Open Data and the 1st International Workshop on Completing and Debugging the Semantic Web (Know@LOD-2016, CoDeS-2016) co-located with 13th ESWC 2016.
Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'2008). pp.990-998.