Exercise 3: String Methods and Dictionaries - Introduction to Python For Bioinformatics

A line in a file has this content: QR57613 1.3 Serpentes Pythonidae. Using Python, how could you count the number of words in the line?
Another line has this content QR57613\t1.3\tSerpentes\tPythonidae\tPython regius. You are told that it comes from a tab separated values (TSV) file. How many fields (separated by tabs) are there in the line?

You have the following output from the FINDHOM tool:

findhom_result = """#FINDHOM v 1.2:
Search results:
Query\tMatch fraction\tScore\tSubject
SMPL001\t0.7\t12331\tAQ10213 Phlebotomus perniciosus
SMPL003\t0.5\t6032\tBZ102363 Phlebotomus papatasi
SMPL004\t0.8\t13123\tRD178237 Sergentomyia dubia
SMPL007\t0.6\t10610\tBQ187981 Phlebotomus papatasi"""

How would you split the string in findhom_result into multiple lines?

The .startswith() method of a string can be used to test if a string starts with a string. E.g. mystring.startswith('Name') tests if mystring starts with 'Name'. How many of the lines from findhom_result start with SMPL?
The FINDHOM output consists of a prelude, then a header line and multiple lines of tab-separated data. Given the findhom_result data, write Python code to count the number of result lines in the findhom_result. Do not make any assumptions about the sample naming, i.e. do not assume that each result line starts with SMPL or similar.
Using the data in findhom_result, write a function process_findhom that reads in a string like that from findhom_results, creates a query_to_subject dictionary which associate the value in the Query field with the species name in the Subject field. Here is an example of process_findhom being called and its results:
```
def process_findhom(fh_results):
    lines = fh_results.split('\n')
    query_to_subject = {}
    process_line = False
    for line in lines:
        if line.startswith('Query'):
            process_line = True
            continue
        if not process_line:
            continue
        else:
            fields = line.split('\t')
            query_to_subject[fields[0]] = fields[3]
        
    return query_to_subject
```
```
process_findhom(findhom_result)
```
{'SMPL001': 'AQ10213 Phlebotomus perniciosus', 'SMPL003': 'BZ102363 Phlebotomus papatasi', 'SMPL004': 'RD178237 Sergentomyia dubia', 'SMPL007': 'BQ187981 Phlebotomus papatasi'}
In process_findhom what would happen if two queries had the same identifier (e.g. if SMPL003 )? In a real world example, can you think of how you would want this situation dealt with?