Concept and Objectives

Motivation and Main Idea

CUBIST is a visionary approach that leverages Business Intelligence to a new level of precise, meaningful and user-friendly analytics of data. CUBIST follows a best-of-breed approach that combines capabilities of Business Intelligence, Semantic Technologies, and Visual Analytics.

 

CUBIST in a nutshell: Developing an approach for semantic and user-friendly Business Intelligence by
  • augmenting Semantic Technologies with BI capabilities, and
  • providing conceptually relevant and user friendly visual analytics.
 

 

Business Intelligence (BI) refers to technologies, applications and practices of gathering, representing and analyzing data to support decision making in enterprises. According to a 2009 Gartner report, BI software is going to use up to 40% of company IT budgets in the next four years. Constantly growing amounts of data, complicated and rapidly changing interactions in the economy and an emerging trend of incorporating unstructured data into analytics bring new challenges to BI:

Unstructured data. One of the problems is the fact that trends visible in unstructured data (e.g., in the news) need some time to appear in a data warehouse. It is expected by the Data Warehouse Institute that we will soon witness a significant change of paradigm in Data Warehousing and BI: the proportion of unstructured data in all data is estimated to be over 80%, leading to the fraction of structured data used in Data Warehouses decreasing significantly in favour of unstructured data. Information extraction to provide unstructured data sources to BI brings new challenges regarding the quality of extracted data. Currently BI engines do not handle the meaning of data so detecting contradicting information is difficult.

 

Semantic Technologies (ST) are proposed by CUBIST as key elements in new BI systems to extract and utilise the meaning of data to address these challenges. ST refers to a set of technologies that have grown in popularity over the last decade, due to their potential for automation. ST provide a formal foundation to represent and reason over knowledge. Research within the Semantic Web has standardised a set of knowledge representation languages including the Resource Description Framework (RDF), an abstract model for describing things by making statements about resources and their properties. RDF is supported by logical formalisms and related reasoning techniques and query languages such as SPARQL.

The advantages of using ST can be summarised as:

From an application point of view, ST are well suited for:

 

Combining ST and BI: The following table compares classical BI with ST

Feature
Classical BI
Semantic Technologies
Federating data from unstructured and structured sources
bad
good
Explicit meaning of data / explicit relations
bad
good
Querying/accessing huge amounts of data
good
average
OLAP functionalities
good
bad
Agile DW schema
bad
good
Inferring implicit facts
bad
good
ETL flexibility
bad
average
Data reuse
bad
good
Use of external datasources / data location independence
bad/average
good

 

ST introduce benefits to the BI lifecycle (see Figure 1 for comparison of BI and BI+ST lifecycles):

To realise these benefits CUBIST will address issues relating to traditional ST:

Figure 1: lifecycle of classical and semantic BI

Analyzing the Information by Means of Interactive Visualizations: User interaction with data is closely related to its visual representation. Interactive visualization technology displays numerous aspects of multidimensional data using interactive pictures and charts. The colour, size, shape and motion of objects represent the multidimensional aspects of the data. Information visualization has been defined as “the use of computer-supported, interactive, visual representations of abstract data to amplify cognition”. A good visual representation can amplify user cognition by providing more information, faster, with less cognitive effort. BI platform vendors are currently promoting these technologies as an alternative and enrichment to traditional reporting and online analytical processing capabilities. CUBIST will provide BI users with visualization and interaction techniques enhanced by Semantic Technologies, such as Formal Concept Analysis (FCA).

FCA captures hitherto undiscovered patterns in data. Concepts are formalised as groups of objects associated with groups of attributes. Hierarchical relationships between these groups are formed and visualized and can be used for information retrieval. Information organised in this way has a close correlation to human perception. Thus, FCA has been used as the basis for semantic search engines  and as a means of organising information based on its meaning. FCA has also been used to mine data for groups of items and their associated transactions to identify, for example, groups of products that are frequently purchased together.

Objectives

 

Objective 1: Semantic ETL. Unlike classical BI, CUBIST will provide comprehensive methods to bring unstructured data into analytics. SETL will use lineage information, metadata of data sources, semantic descriptions, and meaning of extracted data to support error detection within data sources and the SETL process. Semi-automated mapping of structured data sources to the RDF model will reduce the complexity of data integration. This objective will be approached by leveraging ETL best practises, orchestrated and wrapped by a semantic layer.

Expected Result: Semantically enriched lineage information, error detection within extracted data and CUBIST information warehouse and a SETL component that provides SPARQL endpoints for various data sources.

Objective 2: Semantic Warehouse. CUBIST will employ an RDF triple store and ontology as the backbone for the information warehouse, to improve performance and reduce the complexity of the integration of heterogeneous data sources. ST will enrich BI by enabling the discovery of new implicit information through logical reasoning. The standard RDF query language SPARQL will be extended by OLAP functionalities for complex aggregates and analysis. The information warehouse will use advanced indexing and materialisation techniques known from state-of-the-art data warehousing to improve the performance of the RDF triple store. A layer within the warehouse will integrate the triple store with the FCA-based visual analytics.

Expected Result: OLAP extensions for SPARQL, discovery mechanism for finding implicit information, triple store based information warehouse, integration of triple store with FCA.

Objective 3: FCA-Based Advanced Visual Analytics. On top of the Semantic Warehouse, CUBIST will provide ways to visualize and explore hitherto undiscovered BI using FCA.

Expected Result: High performance formal concept miner, large scale FCA capability, FCA visualization of BI.