A Build System for Web Data
December 12th, 2008Systems for graphically constructing processing pipelines for web data are mushrooming. [Your company name here] Pipes. These type of systems make for very nice and impressive demos, however, are utterly useless for power users who need to process sizable amounts of web data, let’s say on dataset sizes from 1MB to 10GB (the “ETL scenario” if you’re a bit older).
Dealing with data from the web requires all sorts of cleansing and pre-processing steps, and cobbling the various steps together is a major pain if you want to build practical systems. We currently create “processing pipelines” (to use a fancy expression) with bash scripts, however, error handling and resuming build operations are tricky. Error handling is important since Web data is notoriously nosiy, and *will* kill your system at one point. Resuming build operations is important because on large datasets, you don’t want to process the entire pipeline if just the last processing step has changed.
Thankfully there are already systems out there for processing source code (such as make or Ant), and these systems could be easily adapted to handle data as well. For me, Ant would be the system of choice since it’s relatively compact and thus easy to learn, and includes a variety of build primitives for file manipulation already. What’s now needed are Ant tasks that do things such as parsing, scraping, cleansing, schema and instance matching. reasoning and what not, all steps necessary for preprocessing RDF. Using a common platform (Ant) would enormously simplify the integration of various Java-based RDF processing tools.
