CoursesRefactoring content for migrationMigrating to block content
Track
Replatforming from a legacy CMS to a Content Operation System

Refactoring content for migration

Lesson
9

Migrating to block content

Convert HTML to presentation-agnostic Portable Text, even handling complex block content from WordPress' Gutenberg editor.

Log in to mark your progress for each Lesson and Task

Portable Text is a presentation-agnostic open-source format for block content, that is, when you have a mix of rich-text paragraphs and specialized content blocks, like images, videos, call-to-action objects, etc. Portable Text also lets you define editable object data inline or as text annotations. Portable Text can be rendered in different ways, and for most front end frameworks, it's a matter of natively mapping its data to components' props (instead of awkwardly injecting HTML).

The Portable Text Editor makes editing block content in Sanity Studio simple and relieves content teams of learning specialized syntax, custom tags, or dealing with HTML embed code.

That said, scripting HTML content into block content requires a little more finesse, but it will be worth it!

Migrating from HTML-formated rich text and block content will be more difficult the more presentation-focused your source content is. For example, if you’re migrating from WordPress and have content stored in the Classic Editor, converting basic HTML into rich text and block content should be reasonably straightforward.

However, if you’re using WordPress’s block editor (aka Gutenberg), your documents likely have complex HTML structures, which will take more effort to recreate and refactor into Portable Text.

Fortunately, you can use @sanity/block-tools to simplify the deserialization of an HTML string to Portable Text and can leverage schema types from your Sanity Studio. The readme provides simplified examples of migrating an HTML string to block content.

This tool will handle the basics of rich text formatting, such as headings, paragraphs, and lists, without configuration. However, more complex objects, like images, must be parsed from the HTML and turned into block content. Block tools exposes the incoming HTML through the HTML Node API (not to be confused with Node.js), which lets you access elements as JavaScript objects.

Below is a simplified example of intercepting a <figure> element in the HTML, retrieving the URL and alt text from an <img> tag inside, and create a new block with its URL and alt text.

Install JSDOM: npm install -D jsdom
Install Block Tools: npm install -D @sanity/block-tools

The deserialize function is synchronous, so you must post-process these blocks to upload any images found in the content.

Again, optimize your upload script by leveraging an in-memory cache to avoid re-uploading the same image every time the migration script is run. Also, rate limits can be avoided by throttling the number of concurrent uploads in a parallelized operation.

import {JSDOM} from 'jsdom'
import {htmlToBlocks} from '@sanity/block-tools'
export async function htmlToBlockContent(html: string) {
let blocks = htmlToBlocks(html, blockContentSchema, {
parseHtml: (html) => new JSDOM(html).window.document,
rules: [
{
deserialize(node, next, block) {
const el = node as HTMLElement
if (node.nodeName.toLowerCase() === 'figure') {
const img = el.querySelector('img')
const imgSrc = img?.getAttribute('src')
if (!img || !imgSrc) {
return undefined
}
const altText = img.getAttribute('alt')
return block({
_type: 'image',
url: imgSrc,
altText,
})
}
return undefined
},
},
],
})
// Insert your own logic to upload any blocks
// where block._type == "image" and change
// them to an asset reference!
return blocks
}

Courses in the "Replatforming from a legacy CMS to a Content Operation System" track