Skip to main content

Subtitle

The SubRip file format is described on the Matroska multimedia container format website as β€œperhaps the most basic of all subtitle formats.” SubRip (SubRip Text) files are named with the extension .srt, and contain formatted lines of plain text in groups separated by a blank line. Subtitles are numbered sequentially, starting at 1. The timecode format used is hours:minutes:seconds,milliseconds with time units fixed to two zero-padded digits and fractions fixed to three zero-padded digits (00:00:00,000). The fractional separator used is the comma, since the program was written in France.

How to load data from subtitle (.srt) files

Please, download the example .srt file from here.

%pip install --upgrade --quiet  pysrt
from langchain_community.document_loaders import SRTLoader
loader = SRTLoader(
"example_data/Star_Wars_The_Clone_Wars_S06E07_Crisis_at_the_Heart.srt"
)
docs = loader.load()
docs[0].page_content[:100]
'<i>Corruption discovered\nat the core of the Banking Clan!</i> <i>Reunited, Rush Clovis\nand Senator A'

Help us out by providing feedback on this documentation page: