Sitecore field data corruption

,

When we develop solutions based on Sitecore, we create custom code as well as Sitecore content. The code is stored in a version control system like Git, whereas the Sitecore contents exists inside a number of Sitecore SQL databases. Sometimes it makes sense to make some of the Sitecore content “part” of the code, especially if the code is tightly both to specific content.

To do this, we can use Unicorn, a tool that serializes part of the Sitecore content into code files that can be stored and released as part of the codebase. When the code is deployed, the content is deserialized into the Sitecore databases, ensuring that the code and the Sitecore content are in sync.

When Unicorn deserialized content, it uses a mechanism called Dilithium, which makes the deserializing faster. Unfortunately, Dilithium is not very forgiving to corruption in Sitecore’s databases. While some types of corruption can be mitigated by running Sitecore’s build in database clean up tool, others cannot.

One kind of corruption that that really annoys Dilithium is wrongly placed field values. I recently wrote a tool that cleans up this kind of corruption and realized that it is not only an issue for Sitecore solutions using Unicorn, but for Sitecore solutions in general. In this post I will explain how this kind of corruption arises and how it can be fixed. In doing so, I will also give a brief overview over how Sitecore structures content internally, which might come in handy in other scenarios.

In my solution I relied on a script posted by Mikael Högberg in relation to a bug in the Sitecore Publish Service (https://mikael.com/2019/11/clean-up-sitecore-database-and-avoid-corrupt-published-content/)

First of all, let’s have a look of how Sitecore structures content conceptually: All content in Sitecore is based on templates. A template is a schema that defines the kind of data that a specific piece of content consists of – that is, a list of fields that all content based on this template contains. So, a “Article” template might include a “Title” and a “Body” field definition, whereas the actual field values are stored in each piece of content based on this particular template. This division between “abstract” templates and “concrete” content is found in most systems that handles data although the names might be different.

The Sitecore item

But now let muddy the water a bit, because Sitecore is in many ways “build on Sitecore”. To understand this, lets introduce the Sitecore specific term – an item. An item is a piece of content in Sitecore – it is based on a template and stored in a hierarchy. Fair enough – Sitecore content is called items, what the big deal? The thing is, that when you define a template in Sitecore, you are actually creating an item, based on the “Template”-template via a template relationship. You might ask what template the “Template”-template is based on, and the answer is: Itself.

The “Template”-template. Notice how the template is based on itself.

Field definitions

Until now we have looked at the template relationship: Items using other items as templates. But items are also related via a parent-child relationship, meaning that an item can be a child of another item, which is what enables Sitecore to construct the content tree.

The definition of individual fields for a template is based on this parent-child relation: Each template (which is an item based on the “Templates”-template) contains a number of children. The fields definitions are all grandchild based on the “Template field”-template and sorted in one or more items based on the “Template section” template.

A “Article”-template with two field definitions, Title and Body.
Notice that field definitions are sorted in template sections.

The template field template

The “Template field”-template itself contains a number of field definitions, including the shared field definition and the unversioned field definition. These two definitions will become important when we dive into Sitecore’s databases in a moment.

The “Shared” field definition under the “Template field” template. Notice how each field definitions are based on the “Template field” template, so that e.g. the Shared field definition contains a Type and Title field, and even a Shared field not visible here.

A quick recap: Everything bit of content in Sitecore is an item, both actual content and the templates it is based on. Items are related via the template relationship and the parent-child relationship. Templates are just items based on the “Template”-template and field definitions are just items based on the “Template field”-template and placed in template section beneath a template. The “Template field”-template defines a number of fields, including a shared field definition and a unversioned field definition.

The Sitecore databases

A vanilla Sitecore uses three main databases, one for Sitecore’s core items, and two for content – unpublished (master) and published (web). The databases are structured in the same way, so in this post we will look at that master database to see how items and field values are stored.

All items are defined by an entry in the Items table and uniquely identified by a guid (the items ID): Each entry in the Items table contains only the basic scaffolding needed for the Sitecore content tree. Besides from the ID, in this post we will only use the item’s TemplateID and ParentID, which relates any item to two other entries in the Items table (its template and parent). Notice that the Items table contains no content and have no notion of language or versions.

If we want to get all templates in our database, we simply query the database after all entries in the Items table where the TemplateID is ‘{AB86861A-6030-46C5-B394-E8F99E8B87DB}’ which is the ID of the “Template”-template. Included in this set is also the “Template”-template because it is based on itself.

If we want to get all field definitions, we simply query the Items table after items based on the “Template field”-template, which is ‘{455A3E98-A627-4B40-8035-E683A0331AC7}’.

Notice that these ids are the same for all Sitecore installations, as the items exists in the Sitecore databases upon installation.

Field values

When you create a new item in Sitecore, you will create an entry in the Items table. However, the actual content, that is the value of each field is not stored in the Items table, but in three fields tables. Each fields table is structured in the same way: Besides from the actual field value, it contains the ID of the item containing the field, and of the ID of field definition to which this value belongs.

The reason why Sitecore uses three different fields tables, are that a field value can be shared (on field value per item), unversioned (one field value per item language), or versioned (one field value per version and per language). The three tables are named SharedFields, UnversionedFields and VersionedFields.

To figure out whether to put a field value in the SharedFields, UnversionedFields or VersionedFields, Sitecore looks at the field definition, and hang on here because this is rather convoluted: Remember that the “Template field”-template included a number of fields definitions, including the “Shared” and “Unversioned” field definitions. Both of these field definition is in turn based in the “Template field”-template, and hence contains Shared and Unversioned fields. And in both definitions, the Shared field is selected. So whenever you create a field definition and select whether a field should be shared or unversioned, you will create an entry in the SharedFields table where the ItemId is set to that particular field definition, and the FieldId is set to ‘{BE351A73-FCB0-4213-93FA-C302D8AB4F51}’, the ID of the Shared field definition.

In that way Sitecore can construct any item by looking at the item, its template (all found in the Items table) and then retrieve the actual field values from the three field tables.

What could possibly go wrong?

This item construction is, however somewhat forgiving. When Sitecore looks for a field value, you might expect that Sitecore only looks in the right field table because Sitecore already know whether a field is shared, unversioned or versioned. That however, is not the case – Sitecore always looks in the SharedFields table first, then the UnversionedFields and finally the VersionedFields, even if you run Item.Field.ReadAll() before retrieving the value.

There is probably a good reason for this approach as constructing the template (the field definitions) from the TemplateManager, involves resolving the template inheritance tree, which is labor intensive and not needed is all field values are placed in the right tables.

However, if e.g. a field value for a versioned field has somehow been misplaced in the shared table, this field value is actually used. And, if you edit this field in the Content Editor, Sitecore will update the VersionedFields table, yet reverting to the shared field value upon saving, making the field unupdatable.

No wonder Dilithium have a hard time handling these fields: What value should actually be updated – the misplaced value or the correctly placed value that are never used?

The bottom line is that we need to find these misplaced field values and get rid of them!

Finding wrongly placed field values

As we have three kind of field values and three field tables, we have nine combinations, where the six of them are misplaced – e.g. shared field values in the VersionedFields table, shared field values in the UnversionedFields table and so on. We need six SQL queries to extract these field.

Let’s start by finding all field definitions in the Items table. This is quite simple, as we already know which template to look after – the “Template field”-template:

[FieldIdsQuery] = SELECT ID FROM Items WHERE TemplateID = {455A3E98-A627-4B40-8035-E683A0331AC7}'

Notice that I will start introducing tokens for these queries, as I will use them as building blocks later on.
These query return the id of all field definitions. Notice that there might exists field values for non-defined field in the Sitecore database – this kind of corruption is cleared by the build in database clean up, and we will ignore these field values going forward.

Now, let’s see how we can differentiate between shared, unversioned and versioned field definitions. You might remember that the “Template field”-template include two field definitions, one for a shared-field checkbox and one for a unversioned-field checkbox. Both of these field definitions are shared themselves, meaning that we should look for any field values in the SharedFields table. The id of these fields can be found by looking at the field definitions in the Content Editor.

Let’s start by getting all shared field definitions. These definition should have a field value in the SharedFields table with the value ‘1’. Remember that field definitions are simply items, and the shared checkbox is simply a shared field belonging the field definition item – so we ‘simply’ going to get the items id for all field values having the shared checkbox definition id as field id and the value set to ‘1’. As with the build in templates, the ids used in these queries are the same for all Sitecore installations:

[SharedFieldIdsQuery] = SELECT ItemId FROM SharedFields WHERE FieldId = '{BE351A73-FCB0-4213-93FA-C302D8AB4F51} AND [Value] = '1'

The same approach can be used for unversioned field id and remember that the unversioned checkbox is a shared field, so the field values are still stored in the Shared table. But there is a catch: A field can be both shared and unversioned. If that happens, the field is by definition shared, so to be exact we have to exclude all shared fields from out query:

[UnversionedFieldIdsQuery] = SELECT ItemId FROM SharedFields WHERE FieldId = '{39847666-389D-409B-95BD-F2016F11EED5}' AND [Value] = '1' AND ItemId NOT IN [SharedFieldIdsQuery]

We now have the basic building blocks for the six queries we need. I am going to use some tokens defined above to make the queries easier to read.

The six queries

It is actually now pretty easy to find misplaced field. There are however some subtleties that I need to mention. First of all, orphan field values (field values for undefined fields) can exist in Sitecore, and as the field definition is missing, there is no way to know whether such a field is shared, unversioned. So, when look for field values, we should always only look for fields where the field id is in the [FieldIdsQuery], meaning that there actually is a field definition.

Secondly note that we cannot reverse the shared and unversioned field id queries by looking for the value ‘0’, as this value is the default value for a checkbox and will not be stored in the database. We have to look at all field not in the shared and the unversioned queries.

With that in mind, here are the six queries:

[SharedInUnversioned] = SELECT * FROM UnversionedFields WHERE FieldId IN ({FieldIdsQuery}) AND FieldId IN ({SharedFieldIdsQuery})

[SharedInVersioned] = SELECT * FROM VersionedFields WHERE FieldId IN ({FieldIdsQuery}) AND FieldId IN ({SharedFieldIdsQuery})

[UnversionedInShared] = SELECT * FROM SharedFields WHERE FieldId IN ({FieldIdsQuery}) AND FieldId IN ({UnversionedFieldIdsQuery})

[UnversionedInVersioned] = SELECT * FROM VersionedFields WHERE FieldId IN ({FieldIdsQuery}) AND FieldId IN ({UnversionedFieldIdsQuery})

[VersionedInShared] = SELECT * FROM SharedFields WHERE FieldId IN ({FieldIdsQuery}) AND FieldId NOT IN ({SharedFieldIdsQuery}) AND FieldId NOT IN ({UnversionedFieldIdsQuery})

[VersionedInUnversioned] = SELECT * FROM UnversionedFields WHERE FieldId IN ({FieldIdsQuery}) AND FieldId NOT IN ({SharedFieldIdsQuery}) AND FieldId NOT IN ({UnversionedFieldIdsQuery})

With these queries it is possible to find all misplaced field values in the three fields tables.

So, what to do?

The easiest way to get rid of any misplaced field values is simply deleting the misplaced entries. But as described above, misplaced field values are not necessary ignored by Sitecore, but might actually be used instead of actually correctly placed values! Especially values placed in the SharedFields and UnversionedFields table should be carefully inspected before deleting, as these values are the values retrieved by Sitecore– even is the field is in fact versioned. If you work with multiple language this only add an additional layer of complexity to this operation.

The bottom line is that you might have to resolve these issues by hand, which is what I did: I created a tool that list all misplaced field values and compared their values with the correctly placed value and the value retrieved by Sitecore. If these values match, it is safe to delete the misplaced field – otherwise some kind of migration has to take place. The tool consists of a service, a field entry class wrapping the field value, and an enum to differentiate between the three field types. You will find the complete code below.

On a final note

I hope this this tour de force into Sitecore’s field tables have been informative and will help you resolve not also this type of field corruption, but also other kind of issues with field. While interacting with Sitecore’s database directly might make you feel uncomfortable, in some cases it is actually necessary, and an understanding of how Sitecore stores it data will definably help you in other scenarios as well.

But on question remains: How can a field value be misplaced in the first place? One way is that if you switch an exist field definition from e.g. shared to versioned, Sitecore starts a background job to move any field values from the to the SharedFields to the VersionedFields table. If this jobs fails or is interrupted, you might end up with this kind of corruption. Also, as far as I can tell, a number of items provided with Sitecore is actually born with this have this kind of corruption.