CubeSlice, by SDG Computing, Inc.office (715) 262-3601

Session 2 : Finding Measures, Dimensions, Levels, and Member Properties in a Relational Database

From a Two Part Seminar on Getting Started With OLAP

This session explains how to find OLAP structures in a relational database. Thank you to all the participants in the first session of the seminar for your ideas on this topic.

In the first session, I stated that there were two ways of deciding on the structures that you are going to include in your cubes. One of the participants suggested that I should have really said that there are three ways, and I agree. Here is my revised list:

1. Deciding on OLAP structures by looking at the spreadsheets that are currently being used.
2. Deciding on OLAP structures by looking at the elements that are available in the source data.
3. Deciding on OLAP structures by asking the users what they really need to see.

I once heard someone say that the best way to discover what was needed in the cubes was to see what the business analysts were writing in the margins of their spreadsheets. Perceptive analysts realize that the reports they receive don't show all the significant information. They take the data in the reports and do additional calculations or comparisons. They make notes in the margins - and those notes can tell us what is really important.

One example of this would be a report that shows information about orders, including an order date, a required date, and a delivery date - but doesn't show whether or not an order was overdue, or the number of days an order was overdue, or the number of days allowed between order date and required date. An analyst who was concerned about on-time delivery for a particular customer would have to scan through the report to find that customer and then make notes regarding the difference between the various dates and the content of the particular orders. Maybe the customer had overdue orders because of unrealistic delivery times. Maybe the overdue orders were caused by some particular item included in the order.

In designing your cube, you can include the number of days between the various dates. Because it's in a cube, the analyst can easily find the overdue orders and analyze them from any of the available dimensions - by product, by time period, by customer, by customer geographical location, etc.

Our fundamental goal when developing OLAP structures is to find out what perspectives, calculations, and comparisons are important to our users and then to include those elements in our cubes.

What's in the data?

We have a more limited goal in this session. We are focusing on the second method of deciding on cube structures - by looking at the elements that are available in the source data.

Imagine you're in a situation where the users know they want OLAP, but they don't know what should be included in the OLAP structures, so they tell you to include everything that's in the source database. That's what we're going to do in this session. We're going to design an OLAP structure that includes all possible elements from the Northwind database, one of the sample databases installed with SQL Server 2000.

It's a little hard to imagine users asking for everything from a relational database, but it's not at all unusual for users to give a long list of items that should be included in the cubes. We should be able to include any data in our OLAP structures - no matter what it is. So we're going to build some OLAP cubes that considers every table and every field from the Northwind database. If we can't use a table or a field we're going to explain why it can't be used.

Sample Cubes and Client Files

I created an Orders cube, using almost all of the fields in the Northwind database. The cube matches the structure outlined below, except that Customer.Demographics is not included. I left out this dimension because there is no data in the CustomerDemographics table in Northwind. Besides that, the only fields that are not included are the fields in Territories and in Region. Their omission is explained under KEY POINT #20 below.

I made the Orders cube three separate ways:

A simple one that contains only one hierarchy per dimension
One that contains multiple hierarchies within several of the dimensions
One that contains all the hierarchies in separate dimensions

I put each of these cubes in a separate local cube file. I then created a CubeSlice Share file including all three of them. A CubeSlice Share file is a self-extracting EXE, which contains local cube files and a set of views to display them. It can be used on any computer that has the Office XP or Office 2003 Office Web Components installed. I also put the three local cube files in a separate zip file if you want to download and browse them with a different OLAP browser.

Download the CubeSlice Share files containing the three local cube files.

Download the three local cube files in a zip file.

Try browsing each of the cubes and feel how they differ from each other.

The Process of Finding OLAP Structures

KEY POINT #1 - Almost every piece of data in a source database can be used in an OLAP structure - and some of them should be used more than once.


Almost every field in the Northwind database is going to be used either to create a measure, a level, or an additional attribute (member property in SQL Server 2000). Some fields are going to be used both as measures and as attributes.

KEY POINT #2 - Even the key fields in the relational database can be used in the cubes - include them as member properties (when using SQL Server 2000 Analysis Services) or attributes (when using SQL Server 2005 Analysis Services).

Relational databases have key fields which are used to identify an individual record and to maintain referential integrity between tables. In some systems these key fields have meaning to the users. They will refer to a product, for example, by the Product ID number. But even if they don't have any meaning, these key fields can still be valuable for connecting the OLAP cubes to other systems. You can use them in Analysis Services Actions to interface with other systems - opening existing reports or to initiate a process. You could, for example, create an interface from your cubes to the ordering system. As an analyst is browsing a cube and identifies a product that needs to be reordered, he could initiate an action from the cube that orders the product, based on the Product ID value.

KEY POINT #3 - Every non-key numeric field is a candidate for creating a measure - either by itself or together with other fields in a calculation that creates a measure.

Here are the non-key numeric fields in the Northwind database:

Order Details: UnitPrice, Quantity, Discount

Orders: Freight

Products: QuantityPerUnit, UnitPrice, UnitsInStock, UnitsOnOrder, ReorderLevel

Almost all the participants in the first session of the seminar selected Price, Quantity, and Discount as measures, either by themselves or as part of the calculation of another measure. Here are the suggestions that were made for measures based on these fields:

UnitPrice (This measure would normally be used either as an Average or in a calculation for the Extended Price)
Quantity
ExtendedPriceBeforeDiscount
DiscountPercent
DiscountAmount
ExtendedPriceAfterDiscount
Average Item Count
Average Price
Average Margin
Profit

(I don't think the last two can be calculated from the data that is available in Northwind, unless the UnitPrice that is listed in the Product table would be the purchase price. That seems unlikely, since the UnitPrice in the Product table is always the same or more than the UnitPrice listed in the Order Details table. Rather, the Product table seems to be listing the current selling price of the product. All values in the UnitPrice field in Order Details for orders after 4/4/1997 match the UnitPrice for the same product in the Products table, while all the prices before that date are different. This is the kind of analysis that sometimes needs to be done on the source data. Hopefully, there is some documentation or the users can explain the significance of the data in particular fields, but you will often need to see what data is actually there.)

One person suggested TerritorySales, SupplierSales, and ShipperSales as measures. These probably wouldn't be used as measures, because these values would be available by browsing the various dimensions. Dimensions are used to show sales for particular territories, suppliers, or shippers.

Some of the participants also selected UnitsInStock and UnitsOnOrder as measures.

KEY POINT #4 - All measures used in a single cube (or single measure group in SQL 2005) have to use the same level of granularity.

Two participants in the first seminar selected Freight as a measure. It could be a very important measure when analyzing different shippers. It would be easy to use this measure in a cube where the Orders table was used as the primary basis for the fact table in the cube. But it's harder to use Freight in a cube where Order Details is used as the basis for the fact table. All of the measures in a single cube have to be on the same level of granularity. Freight would have to be divided out among the Order Details. This could be done equally, proportionally by Quantity, or proportionally by Extended Price. It would be best, however, if it could be done using the the actual formula that the company used to calculate Freight costs.

In the sample cubes I included a measure for Freight, divided out among the Order Details proportionally by Extended Price Before Discount.

KEY POINT #5 - Measures can also be created by counting values in fields - but Distinct Count is often the kind of counting that's needed.

OrderCount, ProductCount, ShipperCount, and EmployeeCount were all suggested as possible measures. If Order, Product, Shipper, and Employee are all dimensions in a cube, then a simple count of these fields will all give the same value. Each order detail will be associated with one order, one product, one shipper, and one employee. When we count any of them we will get the number of order detail records that are combined to give the value displayed in each cell of the cube.

When users are looking for counts, they often want Distinct Count values. Distinct Count gives us the number of different orders, the number of different products sold, the number of different shippers, or the number of different employees. If we had a distinct count on employees, for example, we could see how many employees have sales in France, how many employees have sales of a particular product, or how many employees had sales in a particular time period.

Analysis Services allows you to define one Distinct Count measure per cube, and you can define additional ones using calculated measures. A calculated measure that creates a Distinct Count, though, will slow down cube browsing in larger cubes.

I think the Distinct Count measure that would make the most sense for this cube (and was selected by the largest number of participants) is OrderCount.

KEY POINT #6 - You can count non-null values in fields that allow null values.

You could create a measure called ShippedCount by counting the number of non-null values in ShippedDate in the Orders table. This field has null values for orders that have not yet been shipped. A Count in Analysis Services ignores null values.

KEY POINT #7 - Measures can be created by comparing the values in date fields.

I think the end users could very likely benefit from comparing OrderDate, RequiredDate, and ShippedDate. Two participants suggested OrderedShippedDiff. I think it would also be useful to include either RequiredShippedDiff or OrderedRequiredDiff. You could include all three, but with two of these measures, the third could be easily derived.

When presented in the cube, these measures would probably be best displayed as averages with a calculated measure.

KEY POINT #8 - Finding the potential measures in a database can help you find the potential fact tables around which cubes can be built.

The measures in the Order Details table all give transactional information. This table could be used as the basis for a fact table for a cube called Orders, which could have dimensions identifying the Product sold, the Customer, the Employee, and the Time of the transaction.

The measures in the Product table give inventory information. This table could be used as the basis for a fact table for a cube called Product Inventory. It would have dimensions such as Product, Supplier, and the Time that the inventory information was recorded.

KEY POINT #9 - Consider every table in the relational database as a candidate for a dimension.

This is easier to do in Northwind, with 13 tables, than in the typical relational database, which can easily have several hundred tables. Still, Northwind's structure is similar to the structure of normal corporate databases, even if it is more simple. I think it is a good small scale model for what you would normally have to do on a much larger scale.

Here are the table names and field names from Northwind:

Categories: CategoryID, CategoryName, Description, Picture

CustomerCustomerDemo: CustomerID, CustomerTypeID

CustomerDemographics: CustomerTypeID, CustomerDesc

Customers: CustomerID, CompanyName, ContactName, Address, City, Region, PostalCode, Country, Phone, Fax

Employees: EmployeeID, LastName, FirstName, Title, TitleOfCourtesy, BirthDate, HireDate, Address, City, Region, PostalCode, Country, HomePhone, Extension, Photo, Notes, Reports To, PhotoPath

EmployeeTerritories: EmployeeID, TerritoryID

Order Details: OrderID, ProductID, UnitPrice, Quantity, Discount

Orders: OrderID, CustomerID, EmployeeID, OrderDate, RequiredDate, ShippedDate, ShipVia, Freight, ShipName, ShipAddress, ShipCity, ShipRegion, ShipPostalCode, ShipCountry

Products: ProductID, ProductName, SupplierID, CategoryID, QuantityPerUnit, UnitPrice, UnitsInStock, UnitsOnOrder, ReorderLevel, Discontinued

Region: RegionID, RegionDescription

Shippers: ShipperID, CompanyName, Phone

Suppliers: SupplierID, CompanyName, ContactName, ContactTitle, Address, City, Region, PostalCode, Country, Phone, Fax, HomePage

Territories: TerritoryID, TerritoryDescription, RegionID

From this list, I would make a list of candidate dimensions as follows:

Category
CustomerDemographic
Customer
Employee
Order Detail
Order
Product
Region
Shipper
Supplier
Territory

The only two tables I have eliminated from consideration as dimensions in this first step are CustomerCustomerDemo and EmployeeTerritories, because they are join tables. All of their fields are included in other tables.

KEY POINT #10 - Consider merging two or more candidate dimensions into a single dimension if the data in them is highly correlated (or is perceived by the end users as being correlated).

We can tell that tables are related by looking at their foreign key relationships. In addition to these formalized database relationships, we have to consider how the tables fit into cognitive groupings. It is not possible to do this by simply looking at the structure of the database. We must use our own common sense - and, in particular, the common sense of the people who are using the data. Do certain types of data belong together? Does some of the data fit together logically in a hierarchy?

Here are some tables that we could consider putting into common dimensions:

Customer and CustomerDemographic - Customer Demographics presents additional information about customers.

Employee, Territory, and Region - An Employee is assigned territories. Territories are assigned to a Region - looks like levels of a hierarchy to me.

Product and Category - Category could be used as a higher level in the Product dimension.

Product and Supplier - Supplier could be used as a higher level in the Product dimension.

You have to weigh the advantages and disadvantages when deciding what entities should be combined together in a dimension. Here is a more detailed analysis of the decisions regarding product, supplier, and category.

Consideration #1. Creating two separate dimensions (or hierarchies) for product (ProductByCategory and ProductBySupplier) gives end users two different ways to summarize information for products.

Consideration #2. Users may want to put ProductByCategory on columns and ProductBySupplier on rows. This will look fine unless the user drills down to the Product level of both dimensions. They can look at Products and Categories while they're looking at Suppliers or they can look at Products and Suppliers while they're looking at Categories.

Consideration #3. If both dimensions are used at the same time and both are drilled down to the Product level, most of the cells of the grid will be empty. It's not logical to have the same level on the rows and on the columns at the same time. One of the members of the seminar asked about a similar situation in the FoodMart 2000 sample database. When users have Stores on one axis and Store Size in Square Feet on the other axis most of the cells are empty, because only one store is valid for each size.

Because of the problems described in Consideration #3, many developers would leave either Supplier or Category as a separate dimension, not combined with Product. But the desires of the users need to be considered. In my experience, users will often want to see data with different rollup options, even if that gives the possibility of creating combinations that don't make any sense.

KEY POINT #11 - Numeric values can be converted into dimensions by dividing the values into ranges.

It would be possible to make a dimension from the Discount field in the Order Detail dimension (though I have not done that in my proposed structure).

Because there aren't many different values, the lowest level of this dimension could be the actual discount percentages. At a higher level, these values would be combined into ranges - 0%, 1-5%, 6-10%, 11-15%, 16-20%, etc.

Analysis Services in SQL Server 2005 provides a way for the server to generate the desired ranges. In SQL Server 2000 ranges have to be created in the source database.

KEY POINT #12 - Tables that contain measures may or may not also have values which can be used to create a dimension.

The Order Details table has all numeric values - two key values and three numeric values which are good candidates for measures. I have decided not to use any of the numeric values in the Order Details table to create a dimension. The other two tables with measures (Orders and Products) have non-measure fields which can be used to create dimensions.

That leaves us with the following dimensions:

Customer
Employee
Product
Shipper
Order

KEY POINT #13 - If a table like Order Details is used as the fact table, the Orders table can be used to create a simple dimension that allows users to view individual orders - but this often doesn't work out very well.

Order doesn't make a very good dimension, because it most likely doesn't have any natural levels. But I've seen a lot of people who want to have this kind of dimension. They want to be able to view the individual orders as they are browsing the cube.

If a dimension like this is necessary, I would suggest adding some kind of levels, even if those levels are somewhat artificial. I created an artificial level for this dimension which divides the OrderID values into ranges of 100 values (10200-10299, 10300-10399, etc.)

It can often be helpful to create a set of local cubes, each with a subset of the Order Details. This can greatly improve browsing speed performance.

KEY POINT #14 - OLAP cubes nearly always have a Time dimension.

We haven't said anything yet about what is one of the most popular and universal dimensions in OLAP cubes - Time. I have seen a few cubes that didn't have a Time dimension, but I think almost all cubes would benefit from one.

KEY POINT #15 - Almost any date field in the relational database can be made into a Time dimension.

There are three date fields in the Orders table - OrderDate, RequiredDate, and ShippedDate. All three are candidates to be the basis for Time dimensions.

The OrderDate time dimension would probably be most used by the marketing and sales department.

The RequiredDate time dimension would be most relevant for the people planning future production.

The ShippedDate time dimension would probably be the best perspective for analyzing past levels of production.

KEY POINT #16 - The data load date can be used as the basis for a Time dimension.

If you are building a cube focused on transactions, date fields in the data will probably be used for the Time dimension. But if you are building an inventory type of cube, the Time dimension will often be based on the date the data is loaded into the data mart.

The Products table has two fields that can be used for tracking inventory - UnitsInStock and UnitsOnOrder. If the values of these fields in the source database were saved on a daily or weekly basis, we could build a cube that could be used to analyze the change in the inventory. The Time dimension for this cube would be the time that the particular values were loaded - in other words, the load date.

KEY POINT #17 - The levels of a Time dimension should match the time periods that analysts want to use to analyze their data.

One date field can be used to create all the levels of a Time dimension - All Time, 5-Year Period, Year, Half-Year, Quarter, Month, WeekInAllTime, WeekInYear, WeekInMonth, DayOfYear, DayOfMonth, and DayOfWeek. You should use the particular levels that the analysts want to see in your company. I have seen many cases where the analysts want two or more separate hierarchies in the Time dimension, because they like to drill down into the Time dimension in different ways.

KEY POINT #18 - Except for the Time dimension, each standard level is usually built from the data in a single field.

For example, you could build a dimension for the Product dimension consisting of Category on a higher level and Product at a lower level. This works well only if each Product is a member of one and only one Category. Because of the key relationship between the Products table and the Categories table in Northwind, this appears to be the case. The names of the members for each level would come from these two fields - CategoryName in Categories and ProductName in Products. These member names are the values that people would see on the rows or columns of a spreadsheet when browsing with this dimension.

KEY POINT #19 - Geography is a convenient source for levels in a dimension.

Geography works well for levels because lower geographical levels are often fully contained in higher ones. Most cities are contained in a particular state and most states are contained within a particular country. Districts and regions are often more of a problem, because their definition can vary within an organization and they can change over time. Metropolitan areas are much harder, because they often don't fit in a hierarchy with states - or even with countries.

There are five geographical hierarchies in the Northwind data - for Customers, Suppliers, ShippingDestination, where Employees live, and the Territories and Regions to which Employees are assigned. Some of them are probably more relevant than others - it's probably more appropriate to analyze Customers by their geographic location than analyzing Employees by where they happen to live.

KEY POINT #20 - End users may want less obvious levels added into a hierarchy for particular business needs.

One of the seminar participant suggested creating the following levels for the Product dimension:

CategoryName
Discontinued
ProductName

This doesn't seem very logical to me, but it might be exactly the way the business analysts want to see their data - and if it's what the business analysts want to see, it's what we should give them - unless we can suggest a solution that they like better than what they originally wanted.

One option here would be to pull Discontinued out of the hierarchy and use it as a separate dimension, with two levels (the All Level and the Discontinued Level) and two members at the lowest level (Yes and No). When browsing the cube, this dimension could be used as a slicer, so that the users could see only the Discontinued products or the ones that had not been Discontinued. It could also be used together with the product dimension on the rows or columns so that the users would see the following levels:

CategoryName
Discontinued

OR

CategoryName
ProductName
Discontinued

This is very easy to do when using Analysis Services in SQL Server 2005. Every attribute in a dimension can have its own attribute hierarchy.

KEY POINT #21 - Use parent-child dimensions when hierarchies have an indefinite depth and are defined by key relationships.

There appears to be a parent-child dimension in the Employees table, because there is a key field, ReportsTo, that links back to the primary key field EmployeeID. This relationship defines a supervisory chain.

This hierarchy has an indefinite depth. If we draw out the organization chart, it could now have 7 levels of supervision. But tomorrow a new employee could be hired and placed under an employee at the 7th level, making a new 8th level. This could be done in the source database simply by putting the appropriate value in the ReportsTo field.

Employees could be organized with a fixed, non-parent-child hierarchy. If, for example, the employee table had fields for Immediate Supervisor, Supervisor, and Shift Lead, we would use those three fields to form fixed levels in a standard dimension.

Because of performance considerations, it is better to use a standard than a parent-child dimension in Analysis Services. If you have a choice, build a standard dimension. But if you have a situation where the hierarchy has an indefinite depth, use a parent-child dimension.

KEY POINT #22 - When you have multiple ways within a dimension that you can drill down, you can create multiple hierarchies for that dimension, or you can create separate dimensions, with each dimension having a single hierarchy.

In the Employee dimension, I could imagine creating 4 hierarchies:

Employee.Identity - A simple listing of employees.
EmployeeName

Employee.Geography - Where employees live.
Country
Region
City
EmployeeName

Employee.Supervisor - Employees and their supervisors.
ParentChild (EmployeeID and Reports To)

Employee.Territory - The regions and territories where employees are assigned.
RegionDescription
TerritoryDescription
EmployeeName

These could also be created as 4 independent dimensions:

EmployeeIdentity
EmployeeGeography
EmployeeSupervisor
EmployeeTerritory

Analysis Services in SQL Server 2005 makes it easier to include multiple hierarchies in the same dimension.

KEY POINT #23 - Data should sometimes not be included if there is a many-to-many relationship between that data and the fact table.

Employee.Territory can't be used because there are many territories assigned to each employee and there is no obvious way of determining which of those territories is represented by each order. In order to use this dimension, there would need to be an additional field in the fact table to identify the territory for the order.

In some situations it makes sense to use a dimension that has a many-to-many relationship to the fact table. A person being admitted to a hospital, for example, may have several illnesses. But, in that case, each of those multiple values is relevant. In the case of Employee Territories, each order would, presumably, come from one particular territory. The data tells us which employee is associated with an order, but not the territory.

Analysis Services in SQL Server 2005 allows you to use many-to-many dimensions in your cubes.

KEY POINT #24 - All other fields (almost) can be used as member properties or attributes.

Member properties provide additional information. They are displayed in different ways by different OLAP client applications. One typical way is to display the member property information when the user hovers over the particular member.

NOTE: In SQL Server 2005 Analysis Services, other fields are created as attributes. Each attribute can have its own hierarchy.

KEY POINT #25 - You may want to create member properties (attributes) for fields that you are also using to create measures.

I used the QuantityPerUnit field as a measure, but also as a member property of the Product. Depending on how your OLAP client tool displays measures and attributes, it may be appropriate to make this information available in more than one way.

KEY POINT #26 - You can't include an image field (such as a picture) directly in an OLAP cube, but you can include a path that points to that picture, which can then be used in an Analysis Services Action to display the picture.

Before you spend a lot of time building in fields to be used in Actions, make sure your OLAP client tool will support the kind of Actions you want to create. Most tools support them to some extent, but I have never seen an OLAP client tool that supports all the capabilities of Actions as they are supported in Analysis Services.

The Finished OLAP Structure

Here is my finished OLAP structure:

Dimension Names and Hierarchy Names are listed together, separated by a dot:

DimensionName.HierarchyName

The Level Names are listed with single indentation under each Dimension and Hierarchy.

The Member Properties (Attributes) are indented under each level. When a level (such as CompanyName) is used in several hierarchies, the member properties for the level are only listed once.

Measures (for the Orders Cube, which I made)

UnitPrice
Quantity
DiscountPercent
DiscountAmount
ExtendedPriceBeforeDiscount
ExtendedPriceAfterDiscount
Freight
Order Count
Order Detail Count
Order Detail Shipped
Percent Shipped
AVGOrderedRequiredDiff
AVGOrderedShippedDiff
AVGRequiredShippedDiff

Measures (for the Product Inventory Cube, which I did not make)

QuantityPerUnit
UnitPrice
UnitsInStock
UnitsOnOrder
ReorderLevel
ValueInStock
ValueOnOrder

Customer.Identity

CompanyName

CustomerID
ContactName
Address
City
Region
PostalCode
Country
Phone
Fax

Customer.Demographics

CustomerDesc

CustomerTypeID

CompanyName

Customer.Geography

Country
Region
City
CompanyName

Employee.Identity

EmployeeName

EmployeeID
LastName
FirstName
Title
TitleOfCourtesy
BirthDate
HireDate
Address
City
Region
PostalCode
Country
HomePhone
Extension
PhotoPath
Notes
Reports To

Employee.Geography

Country
Region
City
EmployeeName

Employee.Supervisor

ParentChild (EmployeeID and Reports To)

Order

OrderRange (Artificially create field on OrderID ranges)

OrderID
OrderDate
RequiredDate
ShippedDate
ShipVia
Freight
ShipName
ShipAddress
ShipCity
ShipRegion
ShipPostalCode
ShipCountry

TimeShipped

Time levels derived from date field

TimeOrdered

Time levels derived from date field

TimeRequired

Time levels derived from date field

ShipTo.Geography

ShipCountry
ShipRegion
ShipCity
ShipName

ShipAddress
ShipPostalCode

Shipper

CompanyName

ShipperID
Phone

Products.Category

CategoryName

CategoryID
Description
Picture

ProductName

ProductID
Supplier
QuantityPerUnit
UnitPrice
UnitsInStock
UnitsOnOrder
ReorderLevel
Discontinued

Products.Supplier

CompanyName

SupplierID
ContactName
ContactTitle
Address
City
Region
PostalCode
Country
Phone
Fax
HomePage

ProductName

ProductID
CategoryName
QuantityPerUnit
UnitPrice
UnitsInStock
UnitsOnOrder
ReorderLevel
Discontinued

Supplier.Geography

Country
Region
City
CompanyName

DiscountPercent

Discount Percent Range
Discount Percent