CoursesRefactoring content for migrationContent normalization
Track
Replatforming from a legacy CMS to a Content Operation System

Refactoring content for migration

Lesson
3

Content normalization

Migrating is an opportunity not only to move your content to Sanity, but your content strategy to structured content.

Log in to mark your progress for each Lesson and Task

As mentioned in the Re-platforming to Sanity course, migrating your content to Sanity is a golden opportunity to mature your content model and bring structure to increase the reusability of your legacy content. In data and database parlance, this is akin to data normalization. In fact, it is precisely that.

There are some typical examples where some content normalization can be rewarding:

  • Translating "page templates" into a structured content model, sometimes splitting content for a template into separate dedicated types.
  • Stripping out HTML of string values or converting it to Sanity’s presentation agnostic Portable Text format.
  • Keeping only the best resolution of duplicated images (because some CMSes require you to upload different resolutions of the same image)
Identify opportunities for content normalization in the content you are to migrate.

Your existing website-centric CMS likely has stored what would be considered structured content into web pages. Resulting in content that looks like this:

{
"type": "page",
"id": 4014,
"template": "staff-profile-page",
"title": "Emkay Petersen"
}

This content is not a web page. It's a person! Storing this as structured content doesn't prevent it from being queried into a web page. This same document would be better remodeled into a Sanity document like this:

{
"_type": "person",
"_id": "person-4014",
"name": "Emkay Petersen"
}

Note in this example that the change from title to name is subtle but meaningful!

You can learn more about structured content modeling in the Hello, Structured Content course.

Don‘t get us wrong: We love HTML! But it works best as a rendering language in a browser and not as the storage format for your content. The same goes for HTML-like formats like Markdown, MDX, etc.

Depending on the content, you might want to get rid of HTML altogether; typically, your old web-centric CMS has allowed for rich text editing in fields that you might want to keep to plain text so that you can have control over the rendering wherever you need to display this content:

{
"type": "post",
"id": 4014,
"title": "<em>Disarm <span style=\"font-color: red\">you</span> with a <strong>smile</strong>.</em>"
}

By including an HTML stripping out step in your migration script, you can get this clean content:

{
"_type": "post",
"_id": "post-4014",
"title": "Disarm you with a smile."
}

For cases where you want to keep the information about the semantic rich-text formatting, embeds, and such, go to the Migrating to block content course.

Think about how you want to deal with HTML content in your migration script(s)

Courses in the "Replatforming from a legacy CMS to a Content Operation System" track