CoursesMigrating content from WordPress to SanityConverting HTML to Portable Text
Track
Replatforming from a legacy CMS to a Content Operation System

Migrating content from WordPress to Sanity

Lesson
8

Converting HTML to Portable Text

Migrate HTML content from WordPress to Portable Text in Sanity, gaining presentation-agnostic block content and rich querying and filtering capabilities for your structured content.

Log in to mark your progress for each Lesson and Task

Your migration script now handles individual content fields as well as image uploads. This is essential groundwork, but we've danced around what will undoubtedly be the most complicated part—migrating from HTML into Portable Text.

This can be difficult because WordPress's HTML is stored as a string and could contain literally anything. It is unstructured content. And the mess you're trying to get out of. While Sanity provides tooling to smooth this process, there are bound to be rough edges depending on the quality of your existing data source content.

In this lesson, you'll import post-processed HTML from the content.rendered attribute in the WordPress REST API response. This string of HTML will have all functions from your WordPress installation executed. Such as "shortcodes" translated into markup.

Once complete, you'll have content stored in Portable Text, a standard allowing you to:

  • Work in a fully customizable, real-time collaborative editor with custom blocks, marks, styles, comments, and so on
  • Render block content directly as props in front end frameworks (on web and mobile)
  • Query documents based on specific block content and apply filters to even the most complex structured block content and rich text

In the next lesson, you'll send an authenticated request to get access to pre-processed HTML in content.raw – preferable for handling content written in the Block Editor (known as Gutenberg).

If you proceed with this lesson and import the content as-is, it will still import your final HTML markup into Portable Text; however, you will lose much control over how to serialize those blocks. Forcing you to do icky things like implementing content with dangerouslyInsertInnerHTML.

Unfortunately, if you're using a page builder such as Elementor, Divi, or Builder Beaver, you will have a bad time with the migration. While the Block tools package below can extract content from these, the HTML output can be fairly messy and tricky to navigate.

These plugins lack a method to extract serialized content like the built-in WordPress block editor. They are more challenging to translate into structured content and almost impossible for us to reason about in a lesson. The good thing about migrating to Sanity and a modern stack is that you won't have to deal with this content lock-in again 🤞.

Earlier, you installed @sanity/block-tools into the project. This package contains a function, htmlToBlocks(), to convert an HTML string into Portable Text.

By default, it will extract some formatting, such as headings, lists, and paragraphs, into corresponding Portable Text blocks. However, if you need to take some of the existing HTML and turn it into custom objects—like taking an image, uploading it, and creating a reference—that will require some customization.

The helper function below wraps htmlToBlocks and contains logic to extract the URL of any <img> tag found inside a <figure> tag. If your content field does not have images inside figure tags, you must update the script to find them.

Because the deserialize method is synchronous, the image URL is first stored in a block type externalImage.

In the next section, we map over each block to find an external image and use the URL to attempt to search for the image in the WordPress database. Because the image is just a string in the HTML markup, this process is not guaranteed to work. Which, while unfortunate, is an excellent demonstration of why structured content and referential integrity are so important!

In the script below, you'll find that the htmlToBlockContent takes an argument with rules that describe how to deserialize the incoming HTML to structured content in Portable Text. A rule exposes the HTML using the HTML Node API, letting you write fairly fine-grained conditional checks against its structure. This is a low-level API, so be prepared for some troubleshooting.

The script below does the following things:

  1. Accept a string of HTML
  2. Convert it into Portable Text
  3. If it finds a figure tag, it stores the URL in an externalImage block
  4. Then, in a throttled array of async functions, searches the WP REST API for that image based on its filename
  5. If found, either use an existing image in the in-memory cache or upload the image
  6. Eliminates empty blocks
  7. Returns the Portable Text
Create the new wrapper function to turn an HTML string into Portable Text:
./migrations/import-wp/lib/htmlToBlockContent.ts
import {htmlToBlocks} from '@sanity/block-tools'
import {Schema} from '@sanity/schema'
import {uuid} from '@sanity/uuid'
import {JSDOM} from 'jsdom'
import pLimit from 'p-limit'
import type {FieldDefinition, SanityClient} from 'sanity'
import type {Post} from '../../../sanity.types'
import {schemaTypes} from '../../../schemaTypes'
import {BASE_URL} from '../constants'
import {sanityIdToImageReference} from './sanityIdToImageReference'
import {sanityUploadFromUrl} from './sanityUploadFromUrl'
import {wpImageFetch} from './wpImageFetch'
const defaultSchema = Schema.compile({types: schemaTypes})
const blockContentSchema = defaultSchema
.get('post')
.fields.find((field: FieldDefinition) => field.name === 'content').type
// https://github.com/sanity-io/sanity/blob/next/packages/%40sanity/block-tools/README.md
export async function htmlToBlockContent(
html: string,
client: SanityClient,
imageCache: Record<number, string>,
): Promise<Post['content']> {
// Convert HTML to Sanity's Portable Text
let blocks = htmlToBlocks(html, blockContentSchema, {
parseHtml: (html) => new JSDOM(html).window.document,
rules: [
{
deserialize(node, next, block) {
const el = node as HTMLElement
if (node.nodeName.toLowerCase() === 'figure') {
const url = el.querySelector('img')?.getAttribute('src')
if (!url) {
return undefined
}
return block({
// these attributes may be overwritten by the image upload below
_type: 'externalImage',
url,
})
}
return undefined
},
},
],
})
// Note: Multiple documents may be running this same function concurrently
const limit = pLimit(2)
const blocksWithUploads = blocks.map((block) =>
limit(async () => {
if (block._type !== 'externalImage' || !('url' in block)) {
return block
}
// The filename is usually stored as the "slug" in WordPress media documents
// Filename may be appended with dimensions like "-1024x683", remove with regex
const dimensions = /-\d+x\d+$/
let slug = (block.url as string)
.split('/')
.pop()
?.split('.')
?.shift()
?.replace(dimensions, '')
.toLocaleLowerCase()
const imageId = await fetch(`${BASE_URL}/media?slug=${slug}`)
.then((res) => (res.ok ? res.json() : null))
.then((data) => (Array.isArray(data) && data.length ? data[0].id : null))
if (typeof imageId !== 'number' || !imageId) {
return block
}
if (imageCache[imageId]) {
return {
_key: block._key,
...sanityIdToImageReference(imageCache[imageId]),
} as Extract<Post['content'], {_type: 'image'}>
}
const imageMetadata = await wpImageFetch(imageId)
if (imageMetadata?.source?.url) {
const imageDocument = await sanityUploadFromUrl(
imageMetadata.source.url,
client,
imageMetadata,
)
if (imageDocument) {
// Add to in-memory cache if re-used in other documents
imageCache[imageId] = imageDocument._id
return {
_key: block._key,
...sanityIdToImageReference(imageCache[imageId]),
} as Extract<Post['content'], {_type: 'image'}>
} else {
return block
}
}
return block
}),
)
blocks = await Promise.all(blocksWithUploads)
// Eliminate empty blocks
blocks = blocks.filter((block) => {
if (!block) {
return false
} else if (!('children' in block)) {
return true
}
return block.children.map((c) => (c.text as string).trim()).join('').length > 0
})
blocks = blocks.map((block) => (block._key ? block : {...block, _key: uuid()}))
// TS complains there's no _key in these blocks, but this is corrected in the map above
// @ts-expect-error
return blocks
}
Update your transformToPost.ts script to convert HTML to Portable Text and write to the content field
./migrations/import-wp/lib/transformToPost.ts
import {uuid} from '@sanity/uuid'
import {decode} from 'html-entities'
import type {SanityClient} from 'sanity'
import type {WP_REST_API_Post} from 'wp-types'
import type {Post} from '../../../sanity.types'
import {htmlToBlockContent} from './htmlToBlockContent'
import {sanityIdToImageReference} from './sanityIdToImageReference'
import {sanityUploadFromUrl} from './sanityUploadFromUrl'
import {wpImageFetch} from './wpImageFetch'
// Remove these keys because they'll be created by Content Lake
type StagedPost = Omit<Post, '_createdAt' | '_updatedAt' | '_rev'>
export async function transformToPost(
wpDoc: WP_REST_API_Post,
client: SanityClient,
existingImages: Record<string, string> = {},
): Promise<StagedPost> {
const doc: StagedPost = {
_id: `post-${wpDoc.id}`,
_type: 'post',
}
// ...all your other attributes
if (wpDoc.content) {
doc.content = await htmlToBlockContent(wpDoc.content.rendered, client, existingImages)
}
return doc
}

You can now run the migrations again.

npx sanity@latest migration run import-wp --no-dry-run --type=posts

Once the transactions are committed, you should see documents appear with populated content fields.

You might notice that once rendered in columns or other specific layouts, your existing HTML is now rendered in one column of block content.

If preserving presentation is essential, more work is required. We'll cover this in the next lesson by working with raw content from WordPress.

Courses in the "Replatforming from a legacy CMS to a Content Operation System" track