MV

Thursday, November 26, 2009

Python File Manipulation

Last week I ripped one of my dvd movies to have a backup. Unfortunately I forgot to add the dutch subtitles but I was too lazy to start all over again. So a simple google search gave me the dutch subtitle files, unfortunately not in sync with my dvd rip.
The subs appeared 1 min and something too fast, so my first intention was to manually edit the file... to no avail, after 2 or 3 lines I already gave up. This was going to take too much time: 1536 subtitles, appx 10s per subtitle... over 4 hours of work.
Ok, what could I do to get the subs synced? I think it's fairly easy by using bash scripting in combination with sed and awk, but I have no experience in bash, so that solution was out of the question.
What else could I do to get this file updated, and that's where Python came to the foreground, the only programming language I know a little, but enough to give it a try. The code in this blog has not indentation, so if you want to use this code, you only need to adjust the correct indentation.
The structure of the file looks like:

1518
02:26:23,091 --> 02:26:24,978
subtitle here

A first thing was to detect the lines that contain the start and end time of the subtitle. In the example above it's line 2 but of course I can not select the time lines by line number. The detection however, is fairly easy to do by using a regular expression.
This is the regex I use: ^(0[0-2])

The mechanism I use to update the file is as follows:
1. Read the source file
2. Check if line match the regular expression
2.1. if match: update the line by adding time to both time-stamps and write to an output file
2.2. if no match: write the line to the same output file.

Fairly straight-forward, isn't it?

When a line matches the regex, I need to filter the start and end time. Luckily these times are always in the same positions so I sliced the lines to get start and end time.
Once I have those times I do some calculations to calculate the new time and then I write the file to the new output file.

finput = open("/home/input", 'r')
foutput = open("/home/newsubtitles", 'w') #new empty file
for line in finput:
    if re.match(regex, line):
        start = line[0:12]
        end = line[17:29]
        newstart = addTime(start)
        newend = addTime(end)
        #replace start and end by newstart and newend
        line.replace(start, newstart)
        line.replace(end, newend)
        #write updated line to output file
        foutput.write(line)
    else:
        #in case of no match, just write line to output
        foutput.write(line)
#when all lines are copied, close file
foutput.close()


So in this file section I use the addTime function. This function expects a string
and will return a new string. This string is always of the format hh:mm:ss,xxx where xxx are the milliseconds. Unfortunately the separator isn't always the colon, so I have to slice up the input string to get the hour, minutes, seconds and milliseconds. Then I add to milliseconds, seconds and minutes the necessary time
and check if adding these times don't pass the normal hour times, since 64s does not exist in the real world. For example if number of seconds exceeds 59s, I add 1 to number of minutes and substract 60 from the number of seconds.
In the code below the "add*" variables are constants, defined somewhere else, so you can update to your own needs.

def addTime(time):
    hh = int(time[0:2])
    mm = int(time[3:5])
    ss = int(time[6:8])
    milli = int(time[9:12])
    milli = milli + addmilli
    ss = ss + addss
    mm = mm + addmm
    if milli > 999:
        milli = milli - 1000
        ss = ss + 1
    if ss > 59:
        ss = ss - 60
        mm = mm + 1
    if mm > 59:
        mm = mm - 60
        hh = hh + 1
    newtime = updateTime(hh, mm, ss, milli)
    return newtime

The updateTime function makes that all hh, mm and ss are always 2 digits and milli is always 3 digits.

def updateTime(hh, mm, ss, milli):
    milli = str(milli)
    if len(milli) == 1:
        milli = "00%s" %milli
    elif len(milli) == 2:
        milli = "0%s" %milli
    ss = str(ss)
    if len(ss) == 1:
        ss = "0%s" %ss
    mm = str(mm)
    if len(mm) == 1:
        mm = "0%s" %mm
    hh = str(hh)
    if len(hh) == 1:
        hh = "0%s" %hh
    newtime = "%s:%s:%s,%s" %(hh, mm, ss, milli)
    return newtime


We're there, I'm quite sure that this script could be done in less code by an experienced programmer, which I'm not. But at least it worked for me and it took me less than 1 hour to get this working, so it gave me time to write this blog and even then I would still be changing the file manually and probably with more mistakes.

Final code:

import re

regex = "^(0[0-2])"
input = "/home/dewolfth/input.txt"
subs = "/home/dewolfth/mymovie.txt"
addhour = 1
addmm = 1
addss = 7
addmilli = 500

def updateTime(hh, mm, ss, milli):
    milli = str(milli)
    if len(milli) == 1:
        milli = "00%s" %milli
    elif len(milli) == 2:
        milli = "0%s" %milli
    ss = str(ss)
    if len(ss) == 1:
        ss = "0%s" %ss
    mm = str(mm)
    if len(mm) == 1:
        mm = "0%s" %mm
    hh = str(hh)
    if len(hh) == 1:
        hh = "0%s" %hh
    newtime = "%s:%s:%s,%s" %(hh, mm, ss, milli)
    return newtime

def addTime(time):        
    hh = int(time[0:2])
    mm = int(time[3:5])
    ss = int(time[6:8])
    milli = int(time[9:12])
    
    milli = milli + addmilli
    ss = ss + addss
    mm = mm + addmm
    
    if milli > 999:
        milli = milli - 1000
        ss = ss + 1
    if ss > 59:
        ss = ss - 60
        mm = mm + 1
    if mm > 59:
        mm = mm - 60
        hh = hh + 1
                
    newtime = updateTime(hh, mm, ss, milli)
    return newtime    

def main():
    finput = open(input, 'r')
    foutput = open(subs, 'w')
    for line in finput:
        if re.match(regex,line):
            start = line[0:12]
            newstart = addTime(start)
            end = line[17:29]
            newend = addTime(end)
            line = line.replace(start, newstart)
            line = line.replace(end, newend)
            foutput.write(line)
        else:
            foutput.write(line)
    foutput.close()
            
if __name__ == "__main__":
    main()


1 comment:

  1. That could probably be done in vim or emacs as a oneline regex search and replace using groups.

    ReplyDelete