Get the Text-to-Speech from Articulate Storyline 360
I needed a process to extract the TTS content from an Articulate Storyline file, and make it reproducible. Here's my process and outcome.
The Spec
I've started to investigate a bit the format, and to my surprise, it was the same approach as MS Office: a zip file with many files containing the relevant data.
The story/story.xml
inside the zip file contains the list of all slides in order, in two places:
sceneLst
tag contains the list of scenes with slides by ID,toc
contains the list of slides byrefG
tag, which is present in each slide.
sceneLst
The structure of the sceneLst
node (story/sceneLst
) is like:
<sceneLst>
<scene g="86678ffa-72f4-4a33-87ba-06660cb7d6a5"
verG="e50b1569-154f-4faa-9556-1f054857ff8b"
name="An example"
desc=""
primaryId="00000000-0000-0000-0000-000000000000"
sceneType="scene"
collapse="false">
<sldIdLst>
<sldId>R6NvRGVHRMwC</sldId>
<sldId>R6TQMcrHEncM</sldId>
...
</sldIdLst>
</scene>
<scene ...>
<sldIdLst>
...
</sldIdLst>
</scene>
It is a list of scene
nodes, each one with a list of sldId
node.
toc
The structure of the toc
node is somewhat similar:
<toc g="b3cb855a-dbd8-4bef-8990-0c0871574584"
verG="a2c27a8d-8d11-4e37-af79-eb5ee2d6989a"
projectId="e1fb1df7-a022-4b46-8188-6f033e4b343c">
<entryLst>
<tocSceneEntry g="9aabbd20-fd20-474c-bacf-8d27f6350dde"
verG="86341bf7-ed25-4811-a60c-3931a34ca671"
refG="86678ffa-72f4-4a33-87ba-06660cb7d6a5"
corG="898b081a-cb37-4156-b7f8-8f3548fd0eff"
expanded="true">
<entryLst>
<tocSlideEntry g="af15fa61-7c9c-44e7-8004-b6ae34492dd5"
verG="ee293cb0-4225-4b30-907b-aa6e5d9b3c73"
refG="84bb9348-7232-4a61-9fc3-0799a8b5a8e5"
corG="10d347fc-fe2a-41ae-a77c-2c2950ba77a7"
expanded="true">
<entryLst/>
</tocSlideEntry>
<tocSlideEntry ...>
<entryLst/>
</tocSlideEntry>
</entryLst>
</tocSceneEntry>
</entryList>
...
</toc>
The refG
attribute is in the slide.
Get the text to speech
The _rels/story.xml.rels
has a list of (sldId
, slide XML file name
) pairs.
For the text to speech process, we form the ttsPrps
tag in the slide XML file `sld/shapeList/tts/ttsPrps.
Close captioning
Resources for each slide are stores in slides/_rels/(slide name).xml.rels
and looks something like this:
<?xml version="1.0" encoding="utf-8"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Type="media" Target="/story/media/R6aYHe1Ciu1x.png" Id="Rf6f47d320aa74f1c"/>
<Relationship Type="media" Target="/story/media/R5glk1klqvrM.mp3" Id="Rcb2472a5c22347e7"/>
<Relationship Type="media" Target="/story/media/R5xtk5W3yuYF.vtt" Id="Rb4d0d34c283547a2"/>
<Relationship Type="media" Target="/story/media/R6ZuoNYZStjK.jpg" Id="R588c9b3261194ae4"/>
</Relationships>
The .vtt
files for the slide, which contain the close captioning of the text. The format of a TTS file is:
WEBVTT
Kind: captions
Source: Articulate Closed Captions Editor
Source Version: 3.80.31058.0
00:00:00.150 --> 00:00:04.992
This is an example!
00:00:05.142 --> 00:00:09.292
This is another subtitle!
In principle, we can leave the close captions as they are, because the audio lengths should not vary too much between the different voices.
The Code
I wrote a bit of code to do this automatically, and export a CSV file with a File name and a Content field. The code I wrote is in Python and it's fast enough to retrieve the data from 60+ slides.
Helper functions
Given the data is formatted in XML (barring the .vtt
files), I wrote few helper functions.
The first function is to extract the attributes of an XML node ( dom.minidom
)
def get_attributes(node):
return dict(node.attributes.items())
The second function is to extract the text from a node:
def get_text(nodelist):
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
return ''.join(rc)
Get the ID -> Filename Mapping
This function opens the rels
file described above and creates a dictionary where the keys are the IDs and the values are the file names associated with those IDs:
def get_id_file(zipfile):
'''
Params:
- zipfile - The PyZipFile object created for the .storyline file
Returns: a dict containing the slide ID (key) and slide file name (value)
'''
map = zf.open("story/_rels/story.xml.rels")
map_xml = parseString(map.read())
result = {}
for item in map_xml.getElementsByTagName("Relationship"):
attrs=get_attributes(item)
if attrs['Type'] == 'slide':
result[attrs['Id']] = attrs['Target']
return result
Processing all scenes
Once we got the mapping, we can start processing the scenes. The function below gets the list of scenes from the story/story.xml
file described above.
def process_scenes(zf, map):
# Read the story
story = zf.open("story/story.xml")
story_xml = parseString(story.read())
scenes = story_xml.getElementsByTagName("sceneLst")[0].childNodes
for scene in scenes:
# There's only one scene in a list
process_scene(zf, scene, map)
As you can see, it just iterates through the scenes and calls process_scene
, the function to process one scene, described just below.
Processing one scene
The single scene processor is as follows:
def process_scene(zf, scene, map):
'''
Params:
- zf - The PyZipFile object created for the .storyline file
- scene - the `scene` XML node
- map - the slide ID/slide file name dict
'''
scene_properties = get_attributes(scene)
# Get the slide IDs
idx = 1
for id in scene.childNodes[0].childNodes:
id = get_text(id.childNodes)
file_name = map[id]
process_slide(zf, idx, file_name[1:], scene_properties['name'])
idx += 1
pass
The function does the following:
- Get the scene name (from
scene_properties
) - Gets the list of slide IDs
- For each ID, it retrieves the file name and calls
process_slide
to, well, process the slide :)
For convenience, we also have an iterator, to help with slides with the same names.
Processing one slide
The slide processor function does all the necessary stuff to extract the text-to-speech(TTS) string(s):
def process_slide(zf, slide_idx, file_name, scene_name):
'''
Params:
- zf - The PyZipFile object created for the .storyline file
- slide_idx - the slide order number
- file_name - the file name for the slide
- scene_name - the scene name - used for the output file name CSV column
'''
# Read the story
slide = zf.open(file_name)
buf = slide.read().decode("utf-8")
buf = ">\n<".join(buf.split("><"))
scene_xml = parseString(buf)
scene_attrs = get_attributes(scene_xml.getElementsByTagName("sld")[0])
# print(scene_attrs)
tts_items = scene_xml.getElementsByTagName("ttsPrps")
i=0
if tts_items:
for tts in tts_items:
tts_attrs = get_attributes(tts)
print(f"\"{scene_name}/{slide_idx:02d} - {scene_attrs['name']}.{i:02d}.mp3\",\"{tts_attrs['synthTxt']}\"")
i += 1
else:
print(f"\"{scene_name}/{scene_attrs['name']}\",\"NO TTS FOR THIS SLIDE\"")
There are few things that happen here:
- We open and read the slide's XML file
- We create the
minidom
representation of the file - We identify all TTS components in
tts_items
- For each component, we write a CSV line containing a representation of an output file name and the TTS content
Few notes here:
- We add a
\n
in between>
and<
, because otherwise minidom barfs - The output file name has a path component (the scene name), and the file name component containing the order of the slide and the slide name. This is to avoid the case where multiple slides have the same name
- The TTS string may have line breaks
All together now!
Now that we have everything, the main
part of the script is:
# Open the zip file
zf = PyZipFile("storyline.story")
id_to_slide = get_id_file(zf)
print("File name, Content")
process_scenes(zf, id_to_slide)
This will allow us to get a CSV like this:
File name, Content
"Slide 1/01 - Glossary.00.mp3","ABC. The Alphabet"
"Slide 1/02 - Glossary.00.mp3","BEL. Belgium"
"Slide 1/03 - Glossary.00.mp3","RO. Romania"
You can import the file in Excel following this StackOverflow question.
HTH,