open
. It takes one argument – a string which contains the name of the file – and returns a file object:
my_file = open("dna.txt")
A file object is a new type of data. With strings and numbers it was easy to understand what they represented – a single bit of text, or a single number. A file object, in contrast, represents something a bit less tangible – it represents a file on your computer's hard drive.
The first thing we need to be able to do is to read the contents of the file. The file type has a read
method which does this. It doesn't take any arguments, and the return value is a string, which we can store in a variable. Once we've read the file contents into a variable, we can treat them just like any other string – for example, we can print them:
my_file = open("dna.txt")
file_contents = my_file.read()
print(file_contents)
my_file_name = "dna.txt"
my_file = open(my_file_name)
my_file_contents = my_file.read()
What's going on here? On line 1, we store the string dna.txt in the variable my_file_name
. On line 2, we use the variable my_file_name
as the argument to the open
function, and store the resulting file object in the variable my_file
. On line 3, we call the read
method on the variable my_file
, and store the resulting string in the variable my_file_contents
.
The important thing to understand about this code is that there are three separate variables which have different types and which are storing three very different things. my_file_name
is a string, and it stores the name of a file on disk. my_file
is a file object, and it represents the file itself. my_file_contents
is a string, and it stores the text that is in the file.
Remember that variable names are arbitrary – the computer doesn't care what you call your variables. So this piece of code is exactly the same as the previous example:
apple = "dna.txt"
banana = open(apple)
grape = banana.read()
except it is harder to read! In contrast, the file name (dna.txt) is not arbitrary – it must correspond to the name of a file on the hard drive of your computer.
A common error is to try to use the read
method on the wrong thing. Recall that read
is a method that only works on file objects. If we try to use the read
method on the file name:
my_file_name = "dna.txt"
my_contents = my_file_name.read()
we'll get an AttributeError
– Python will complain that strings don't have a read
method:
AttributeError: 'str' object has no attribute 'read'
Another common error is to use the file object when we meant to use the file contents. If we try to print the file object:
my_file_name = "dna.txt"
my_file = open(my_file_name)
print(my_file)
we won't get an error, but we'll get an odd-looking line of output:
<open file 'dna.txt', mode 'r' at 0x7fc5ff7784b0>
We won't discuss the meaning of this line now: just remember that if you try to print the contents of a file but instead you get some output that looks like the above, you have almost definitely printed the file object rather than the file contents.
Let's take a look at the output we get when we try to print some information from a file. We'll use the file dna.txt from the repository. This file contains a single line with a short DNA sequence. Open the file up in a text editor and take a look at it.We're going to write a simple program to read the DNA sequence from the file and print it out along with its length. Putting together the file functions and methods from this session, and the printing / string manipulation stuff from the preperation chapter, we get the following code:
# open the file
my_file = open("dna.txt")
# read the contents
my_dna = my_file.read()
# calculate the length
dna_length = len(my_dna)
# print the output
print("sequence is " + my_dna + " and length is " + str(dna_length))
When we look at the output, we can see that the program is working almost perfectly – but there is something strange: the output has been split over two lines:
sequence is ACTGTACGTGCACTGATC
and length is 19
The explanation is simple once you know it: Python has included the new line character at the end of the dna.txt file as part of the contents. In other words, the variable my_dna
has a new line character at the end of it. If we could view the my_dna
variable directly (In fact, we can do this – there's a function called repr
that returns a representation of a variable) , we would see that it looks like this:
'ACTGTACGTGCACTGATC\n'
The solution is also simple. Because this is such a common problem, strings have a method for removing new lines from the end of them. The method is called rstrip
, and it takes one string argument which is the character that you want to remove. In this case, we want to remove the newline character (\n
). Here's a modified version of the code – note that the argument to rstrip
is itself a string so needs to be enclosed in quotes:
my_file = open("dna.txt")
my_file_contents = my_file.read()
# remove the newline from the end of the file contents
my_dna = my_file_contents.rstrip("\n")
dna_length = len(my_dna)
print("sequence is " + my_dna + " and length is " + str(dna_length))
and now the output looks just as we expected:
sequence is ACTGTACGTGCACTGATC and length is 18
In the code above, we first read the file contents and then removed the newline, in two separate steps:
my_file_contents = my_file.read()
my_dna = my_file_contents.rstrip("\n")
but it's more common to read the contents and remove the newline all in one go, like this:
my_dna = my_file.read().rstrip("\n")
This is a bit tricky to read at first as we are using two different methods (read
and rstrip
) in the same statement. The key is to read it from left to right – we take the my_file
variable and use the read
method on it, then we take the output of that method (which we know is a string) and use the rstrip
method on it. The result of the rstrip
method is then stored in the my_dna
variable.
If you find it difficult write the whole thing as one statement like this, just break it up and do the two things separately – your programs will run just as well.
What happens if we try to read a file that doesn't exist?my_file = open("nonexistent.txt")
We get a new type of error that we've not seen before:
IOError: [Errno 2] No such file or directory: 'nonexistent.txt'
Printing output to the screen only really works well when there isn't very much of it (Linux users may be aware that we can redirect terminal output to a file using shell redirection, which can get around some of these problems). It's great for short programs and status messages, but quickly becomes cumbersome for large amounts of output. Some terminals struggle with large amounts of text, or worse, have a limited scrollback capability which can cause the first bit of your output to disappear. It's not easy to search in output that's being displayed at the terminal, and long lines tend to get wrapped. Also, for many programs we want to send different bits of output to different files, rather than having it all dumped in the same place.
Most importantly, terminal output vanishes when you close your terminal program. For small programs like the examples in this book, that's not a problem – if you want to see the output again you can just re-run the program. If you have a program that requires a few hours to run, that's not such a great option.
In the previous section, we saw how to open a file and read its contents. We can also open a file and write some data to it, but we have to use theopen
function in a slightly different way. To open a file for writing, we use a two-argument version of the open
function, where the second argument is a specially-formatted string describing what we want to do to the file (We call this the mode of the file) . This second argument can be "r" for reading, "w" for writing, or "a" for appending (These are the most commonly-used options – there are a few others) . If we leave out the second argument (like we did for all the examples above), Python uses the default, which is "r" for reading.
The difference between "w" and "a" is subtle, but important. If we open a file that already exists using the mode "w", then we will overwrite the current contents with whatever data we write to it. If we open an existing file with the mode "a", it will add new data onto the end of the file, but will not remove any existing content. If there doesn't already exist a file with the specified name, then "w" and "a" behave identically – they will both create a new file to hold the output.
Quite a lot of Python functions and methods have these optional arguments. For the purposes of this tutorial, we will only mention them when they are directly relevant to what we're doing. If you want to see all the optional arguments for a particular method or function, the best place to look is the official Python documentation.
Once we've opened a file for writing, we can use the file write
method to write some text to it. write
works a lot like print
– it takes a single string argument - but instead of printing the string to the screen it writes it to the file.
Here's how we use open
with a second argument to open a file and write a single line of text to it:
my_file = open("out.txt", "w")
my_file.write("Hello world")
Because the output is being written to the file in this example, you won't see any output on the screen if you run it. To check that the code has worked, you'll have to run it, then open up the file out.txt in your text editor and check that its contents are what you expect (.txt is the standard file name extension for a plain text file. Later when we generate output files with a particular format, we'll use different file name extensions.) .
Remember that with write
, just like with print
, we can use any string as the argument. This also means that we can use any method or function that returns a string. The following are all perfectly OK:
# write "abcdef"
my_file.write("abc" + "def")
# write "8"
my_file.write(str(len('AGTGCTAG')))
# write "TTGC"
my_file.write("ATGC".replace('A', 'T'))
# write "atgc"
my_file.write("ATGC".lower())
# write contents of my_variable
my_file.write(my_variable)
close
. Unsurprisingly, this is the opposite of open
(but note that it's a method, whereas open
is a function). We should call close
after we're done reading or writing to a file – we won't go into the details here, but it's a good habit to get into as it avoids some types of bugs that can be tricky to track down. close
is an unusual method as it takes no arguments (so it's called with an empty pair of parentheses) and doesn't return any useful value:
my_file = open("out.txt", "w")
my_file.write("Hello world")
# remember to close the file
my_file.close()
The open
function is quite happy to deal with files from anywhere on your computer, as long as you give the full path (i.e. the sequence of folder names that tells you the location of the file). Just give a file path as the argument rather than a file name. The format of the file path looks different depending on your operating system. If you're on Linux, it will look like this:
my_file = open("/home/martin/myfolder/myfile.txt")
if you're on Windows, like this (The extra r character before the string is necessary to prevent Python from trying to interpret the backslash in the file path) :
my_file = open(r"c:\windows\Desktop\myfolder\myfile.txt")
and if you're on a Mac, like this:
my_file = open("/Users/martin/Desktop/myfolder/myfile.txt")
>sequence_name
ATCGACTGATCGATCGTACGAT
Where sequence_name is a header that describes the sequence (the greater-than symbol indicates the start of the header line). Often, the header contains an accession number that relates to the record for the sequence in a public sequence database. A single FASTA file can contain multiple sequences, like this:
>sequence_one
ATCGATCGATCGATCGAT
>sequence_two
ACTAGCTAGCTAGCATCG
>sequence_three
ACTGCATCGATCGTACCT
Write a program that will create a FASTA file for the following three sequences – make sure that all sequences are in upper case and only contain the bases A, T, G and C.
Sequence header |
DNA sequence |
---|---|
|
|
|
|
|
|