Shootout/Knucleotide
< Shootout
This is a ShootoutEntry for [1].
[This page is completely unreadable. Would it make sense to put the programs into their own sub-pages?] [ Personally, I agree - Perhaps the original writers would care to do so? BrettGiles 02:24, 15 February 2007 (UTC)]
About the k-nucleotide benchmark
Each program should
- read line-by-line a redirected FASTA format file from stdin
- extract DNA sequence THREE
- define a procedure/function to update a hashtable of k-nucleotide keys and count values, for a particular reading-frame - even though we'll combine k-nucleotide counts for all reading-frames (grow the hashtable from a small default size)
- write the code and percentage frequency, for all the 1-nucleotide and 2-nucleotide sequences, sorted by descending frequency and then ascending k-nucleotide key
- count all the 3- 4- 6- 12- and 18-nucleotide sequences, and write the count and code for specific sequences
We use FASTA files generated by the fasta benchmark as input for this benchmark. Note: the file may include both lowercase and uppercase codes.
Correct output for this 100KB input file (generated with the fasta program N = 10000), is
A 30.284 T 29.796 C 20.312 G 19.608
AA 9.212 AT 8.950 TT 8.948 TA 8.936 CA 6.166 CT 6.100 AC 6.086 TC 6.042 AG 6.036 GA 5.968 TG 5.868 GT 5.798 CC 4.140 GC 4.044 CG 3.906 GG 3.798
562 GGT 152 GGTA 15 GGTATT 0 GGTATTTTAATT 0 GGTATTTTAATTTATAGT
In practice, less brute-force would be used to calculate k-nucleotide frequencies, for example Virus Classification using k-nucleotide Frequencies and A Fast Algorithm for the Exhaustive Analysis of 12-Nucleotide-Long DNA Sequences. Applications to Human Genomics (105KB pdf).
This shootout benchmark has really nailed a deficiency in the standard libraries. We could work around not having fast ascii strings with a few extra functions, but not having a fast associative data structure is another matter. We cannot win this one on speed; this entry's approach cannot win this on lines of code; and memory usage is dominated by the data. That is why I feel the main goal is making a new entry that reads well and avoids the memory leak of the old entry. -- ChrisKuklewicz
Contents |
1 Plan
Ok, where do we go from here:
* Can we write a legal Map-based version of Ketil's Trie somehow?
-- DonStewart
I converted it naively and you need to benchmark it, see sections 5.1 and 5.2 for map and association list version. -- ChrisKuklewicz
2 Benchmarks
We must test all entries on N=250k, as that is what they use in the shootout
N=250,000. Debian Linux/x86.
| Brian Sniffen's Map #1 | 30.7s | 86M | -funbox-strict-fields | |
| Einar's original | 24.8s | 260M | ||
| Generalised Trie #1 | 23.8s | 93M | +RTS -K32m | removed |
| Brian Sniffen's Map #2 | 21.7s | 86M | -funbox-strict-fields | |
| Chris's current entry | 18.5s | 65M | ||
| Generalised Trie #2 | 13.8s | 100M | +RTS -K32m | removed |
| Generalised Trie #2 | 4.1s | 100M | +RTS -K32m -H100m | removed |
| Ketil's Trie | 3.6s | 76M | -funbox-strict-fields | |
| Generalised Trie w/ Assoc List #2 | 3.5s | 100M | -funbox-strict-fields |
2.1 More benchmarks
On a powerbook G4:
It is important to add "+RTS -A100m -RTS" to all the entries. This would help the entry in the shootout, as well.
| PROGRAM | Normal (GC%) | -A (GC%) |
| Data.Hashtable | 59.28s (41.6%) | 44.58s ( 9.7%) |
| Brain Sniffen's Map, v2 | 66.67s (62.2%) | 30.48s (16.0%) |
| Ketil's Trie | 15.85s (84.0%) | 5.04s (43.7%) |
| Ketil's Trie -A800m | - | 3.33s ( 0.0%) |
3 On mutable data structures in Haskell
This benchmark is completely bottlenecked by Data.Hashtable (in GHC 6.4.1) and this discussion is based on the assumption that I am using Hashtable optimally. I downloaded the GHD 0.17 compiler and the DMD entry to benchmark on my machine. The DMD entry uses the "associative arrays" built into the language: "int[char[]] frequencies" and places second (runs in 3.0s). The winning entry is interesting, since the c-code does not have a hash table, and so it uses #include "../../Include/simple_hash.h" which pulls in a dead simple, inline, string to int hashtable and runs in 2s.
The entry below runs 16 slower than the DMD entry on my powerbook G4. Profiling shows the bottleneck. I downloaded simple_hash.h and shamelessly optimized it to replace Data.Hashtable for exactly this benchmark code. This sped up the proposed entry by a factor of 4.1 so that it is now about a factor of 4 slower than the DMD entry. This shows that Data.Hashtable is doing *at least* four times more work that is needed for this usage pattern. And even with my over specialized hash table, Haskell is almost 50% slower than OCaml's "module H = Hashtbl.Make(S)" (note that I my code uses the same hash function as OCaml). Unfortunately I cannot submit this optimized hash table entry to the shootout.
The only mutable data structure that come with GHC besides arrays is Data.Hashtable, which is not comptetitive with OCaml Hashtbl or DMD's associative arrays (unless there is something great hidden under Data.Graph). Is there any hope for GHC 6.6? Does anyone have pointers to an existing library at all? Perl and Python and Lua also have excellent built in hashtable capabilities. Where is a good library for Haskell?
4 Custom HashTable + ByteStrings
Wipes the other HashTable in Haskell off the map.
{-# OPTIONS -fglasgow-exts -fbang-patterns -funbox-strict-fields #-} -- -- The Computer Language Shootout -- http://shootout.alioth.debian.org/ -- -- Contributed by Don Stewart -- Uses a port of the simple hashtable from the Clean entry -- import GHC.Exts import GHC.IOBase import Foreign import Char import List import Maybe import Text.Printf import Data.ByteString.Base import qualified Data.ByteString.Char8 as S import Data.Array.Base import qualified Data.Array.IO as A main = do (PS fp o l) <- get (S.pack ">TH") withForeignPtr fp $ \p -> do let sec = p `plusPtr` o mapM_ (writeFreqs l sec) [1,2] mapM_ (writeFrame l sec) =<< mapM toseq strs strs = ["GGT","GGTA","GGTATT","GGTATTTTAATT","GGTATTTTAATTTATAGT"] get p = do s <- S.getContents let Just n = S.findSubstring p s return $! S.map toUpper -- array fusion! . S.filter ((/=) '\n') . S.dropWhile ((/=) '\n') . S.copy . S.drop n $ s writeFreqs size p n = do h <- htNew n size htInsert size p n h let vs = htNodes h mapM_ draw (sortBy kf vs) putChar '\n' where draw (Node p f) = printf "%s %.3f\n" (ppr n p) pct where pct = (100 * (fromIntegral f) / total) :: Double total = fromIntegral (size - n + 1) kf (Node k x) (Node j y) = case compare y x of EQ -> compare (ppr n k) (ppr n j); x -> x writeFrame size p (n,k) = do h <- htNew n size htInsert size p n h Node k v <- htFind k h putStrLn $ (show v) ++ ('\t' : ppr n k) ppr n p = inlinePerformIO (map w2c `fmap` peekArray n p) toseq s = fmap ((,) (length s)) (newArray0 0 (map c2w s)) ------------------------------------------------------------------------ -- -- An implementation of simpl_hash.c in Haskell -- data Hash = HT !Int !Int !(A.IOArray Int Buckets) data Buckets = Empty | Bucket !(Ptr Word8) !Int | Buckets [Node] data Node = Node !(Ptr Word8) !Int htNew :: Int -> Int -> IO Hash htNew !fl !sz = HT fl nprime `fmap` A.newArray (0,nprime-1) Empty where n = htSize fl sz nprime = head (dropWhile (< n) primes) htSize :: Int -> Int -> Int htSize !fl !buflen = min lim (go (fl-1) 4) where lim = (buflen - fl) `div` 1024 go !n !m | n > 0 && m < lim = go (n-1) (m*4) | otherwise = m htInsert :: Int -> Ptr Word8 -> Int -> Hash -> IO () htInsert !s !p n !h = mapM_ (htInc h . plusPtr p) [0..s-n] htInc :: Hash -> Ptr Word8 -> IO () htInc ht@(HT n size arr) k = case htHash size n k of !i -> do b <- unsafeRead arr i unsafeWrite arr i $! inc b where equal = eq n inc :: Buckets -> Buckets inc (Bucket !k' !v) | k' `equal` k = Bucket k' (v+1) | otherwise = Buckets $ Node k' v : [Node k 1] inc (Buckets b) = Buckets $ incL b inc Empty = Bucket k 1 incL :: [Node] -> [Node] incL (!i@(Node k' v):ls) | k' `equal` k = Node k' (v+1) : ls | otherwise = i : incL ls incL [] = [Node k 1] htNodes :: Hash -> [Node] htNodes ht@(HT n size arr) = items 0 where read i = inlinePerformIO $! unsafeRead arr i items !i | i >= size = [] | otherwise = items_bucket (read i) (i+1) items_bucket !(Bucket !k' !v) i = Node k' v : items i items_bucket !(Buckets !b) i = items_list b i items_bucket Empty !i = items i items_list (!e:l) !i = e : items_list l i items_list [] !i = items i htFind :: Ptr Word8 -> Hash -> IO Node htFind !k !h@(HT n size arr) = do let !i = htHash size n k v <- unsafeRead arr i return $! find v where equal = eq n find (Bucket k' v) | k' `equal` k = Node k' v | otherwise = Node k 0 find (Buckets l) = find' l find Empty = Node k 0 find' (i@(Node !k' v):ls) | k' `equal` k = i | otherwise = find' ls find' [] = Node k 0 htHash :: Int -> Int -> Ptr Word8 -> Int htHash (I# max) (I# size) ptr@(Ptr p) = abs . inlinePerformIO . IO $ go p 0# where lim = p `plusAddr#` size go p acc !s | p `geAddr#` lim = (# s, I# (acc `remInt#` max) #) | otherwise = case readInt8OffAddr# p 0# s of (# s, i #) -> go (p `plusAddr#` 1#) (5# *# acc +# i) s -- A fast Ptr comparison for Hash keys eq !n p q = inlinePerformIO $ do a <- peek p :: IO Word8 b <- peek q :: IO Word8 if a /= b then return False else go n p q where go !n !p !q | n == 0 = return True | otherwise = do a <- peek p :: IO Word8 b <- peek q :: IO Word8 if a /= b then return False else go (n-1) (p `plusPtr` 1) (q `plusPtr` 1) primes = [ 53, 97, 193, 389, 769, 1543, 3079, 6151, 12289, 24593, 49157, 98317, 196613, 93241, 786433, 1572869, 3145739, 6291469, 12582917, 25165843, 50331653, 100663319, 201326611, 402653189, 805306457 ]
5 HashTable + ByteStrings
{-# OPTIONS -fbang-patterns #-} -- -- The Computer Language Shootout -- http://shootout.alioth.debian.org/ -- -- Chris Kuklewicz and Don Stewart -- import Char import Foreign import List import Maybe import Text.Printf import GHC.Exts import GHC.Int import GHC.IOBase import Data.ByteString.Base import qualified Data.ByteString.Char8 as S import qualified Data.HashTable as T main = do (PS fp o l) <- get (S.pack ">TH") withForeignPtr fp $ \p -> do let sec = (l, p `plusPtr` o) mapM_ (writeFreqs sec) [1,2] mapM_ (writeFrame sec) =<< mapM toseq strs strs = ["GGT","GGTA","GGTATT","GGTATTTTAATT","GGTATTTTAATTTATAGT"] get p = do s <- S.getContents let Just n = S.findSubstring p s return $! S.map toUpper -- array fusion! . S.filter ((/=) '\n') . S.dropWhile ((/=) '\n') . S.copy . S.drop n $ s count :: Int -> Ptr Word8 -> Int -> Hash -> IO () count s !p n !h = mapM_ (inc . plusPtr p) [0..s-n] where inc !k = do mold <- T.lookup h k case mold of Nothing -> T.insert h k 1 Just n -> do T.update h k $! n+1 ; return () writeFreqs (size,p) n = do h <- newH n count size p n h mapM_ draw . sortBy kf =<< T.toList h putChar '\n' where draw (p,f) = printf "%s %.3f\n" (ppr n p) pct where pct = (100 * (fromIntegral f) / total) :: Double total = fromIntegral (size - n + 1) kf (k,x) (j,y) = case compare y x of EQ -> compare (ppr n k) (ppr n j); x -> x writeFrame (size,p) (n,k) = do h <- newH n count size p n h v <- T.lookup h k putStrLn $ (show $ fromMaybe 0 v) ++ ('\t' : ppr n k) toseq s = fmap ((,) (length s)) (newArray0 0 (map c2w s)) ppr n p = inlinePerformIO (map w2c `fmap` peekArray n p) ------------------------------------------------------------------------ type Hash = T.HashTable (Ptr Word8) Int newH :: Int -> IO Hash newH n = T.new (eq n) (hash n) hash n (Ptr p) = inlinePerformIO $ IO $ go n 0# p where go !n acc p s | n == 0 = (# s, I32# acc #) | otherwise = case readInt8OffAddr# p 0# s of (# s, i #) -> go (n-1) (5# *# acc +# i) (plusAddr# p 1#) s -- Faster than a memcmp! eq !n (Ptr p) (Ptr q) = inlinePerformIO $ IO $ go n p q where go !n p q s | n == 0 = (# s , True #) | otherwise = case readInt8OffAddr# p 0# s of (# s, a #) -> case readInt8OffAddr# q 0# s of (# s, b #) | a /=# b -> (# s, False #) | otherwise -> go (n-1) (plusAddr# p 1#) (plusAddr# q 1#) s
6 Data.Map #1
I built the following first using naive IO, and very closely following the OCaml implementation. It ran it about 85 seconds. I then lifted the Seq and advancePtr code from Chris and Don's "Haskell #2" entry. It now runs in under 4 seconds on my machine, faster than anything but the C and D entries. Note that "updateCount" is there to satisfy the shootout's demand for an updater function. --BrianSniffen
Note: don't submit entries with {- -} style comments. They aren't lexed correctly on the shootout, and end up contributing towards our line count score. Use -- comments only -- DonStewart
-- new-knucleotide.hs -- -- The Great Computer Language Shootout -- http://shootout.alioth.debian.org/ -- -- By Brian Sniffen, Chris Kuklewicz, and Don Stewart -- import Data.Map as Map (Map, empty, insertWith, fold, foldWithKey, findWithDefault) import Control.Monad import System.IO import Data.List (map,sort,isPrefixOf) import Data.Char (ord,chr,toUpper) import Foreign import Text.Printf (printf) import GHC.Exts import GHC.IOBase type Base = Word8 c2b :: Char -> Base = fromIntegral . ord . toUpper b2c :: Base -> Char = chr . fromIntegral putWord8s = putStr . map b2c -- The ptr are usually into the main fasta data, which is read-only data Seq = Seq !Int !(Ptr Base) deriving Eq instance Ord Seq where compare (Seq size1 ptr1) (Seq size2 ptr2) = case compare size1 size2 of EQ -> inlinePerformIO $ cmpmem size1 ptr1 ptr2 z -> z {-# INLINE cmpmem #-} cmpmem i ptr1 ptr2 = if i == 0 then return EQ else do cmp <- liftM2 compare (peek ptr1) (peek ptr2) case cmp of EQ -> cmpmem (pred i) (ptr1 `advancePtr` 1) (ptr2 `advancePtr` 1) z -> return z seqStrings = ["GGT","GGTA","GGTATT","GGTATTTTAATT","GGTATTTTAATTTATAGT"] substring s start len = take len . drop start $ s -- [counts count k dna] fills the hashtable [count] of -- k-nucleotide keys and count values for a particular reading-frame -- of length [k] of the string [dna]. counts :: Int -> Seq -> Map Seq Int counts = updateCounts Map.empty updateCounts base k (Seq size dna) = foldl' countFrame base [0..(size - k)] where countFrame j i = increment (Seq k (advancePtr dna i)) j increment k = Map.insertWith (+) k 1 -- [writeFrequencies count k dna] writes the frequencies for a -- reading-frame of length [k] sorted by descending frequency and then -- ascending k-nucleotide key. -} percentConcat :: Int -> Seq -> Int -> [(Float,Seq)] -> [(Float, Seq)] percentConcat tot k n l = (100 * (fromIntegral n) / (fromIntegral tot), k) : l writeFrequencies :: Int -> Seq -> IO () writeFrequencies k dna = do let count = counts k dna tot = Map.fold (+) 0 count frq = Map.foldWithKey (percentConcat tot) ([] :: [(Float,Seq)]) count mapM_ writeFreqRecord . reverse . sort $ frq putStrLn "" writeFreqRecord :: (Float,Seq) -> IO () writeFreqRecord (f,k) = do printf "%s %.3f\n" (showSeq k) f writeCount dna seq@(Seq len bases) = do mapM_ putStr [(show c), "\t", showSeq seq] putStrLn "" where c = Map.findWithDefault 0 seq (counts len dna) isThree = isPrefixOf ">THREE " readUntil pred = do l <- getLine if pred l then return () else readUntil pred skipComment = do line <- getLine case line of ';':_ -> skipComment _ -> return (map toUpper line) readSequence acc = do eof <- isEOF if eof then return (reverse acc) else do line <- getLine case line of '>':_ -> return (reverse acc) _ -> readSequence (map toUpper line : acc) -- Routines to convert strings to Seq and back stringToSeq str = do b <- newArray0 0 (map c2b str) s <- lengthArray0 0 b return (Seq s b) showSeq (Seq size ptr) = inlinePerformIO $ do peekArray size ptr >>= return . (map b2c) {-# INLINE inlinePerformIO #-} inlinePerformIO (IO m) = case m realWorld# of (# _, r #) -> r -- Extract DNA sequence "THREE" from stdin -} dnaThree = do readUntil isThree firstLine <- skipComment otherLines <- readSequence [] let blob = concatMap (map c2b) $ firstLine : otherLines baseArray <- newArray0 0 blob size <- lengthArray0 0 baseArray return (Seq size baseArray) main = do contents <- dnaThree writeFrequencies 1 contents writeFrequencies 2 contents tests <- mapM stringToSeq seqStrings mapM_ (writeCount contents) tests
6.1 Data.Map #2
The problem with version 1 was the foldr in updateCounts. Changing to foldl' solves that. But this replacement for updateCounts makes it faster.
Compiling with "ghc --make -O3 -optc-O3 -fglasgow-exts -funbox-strict-fields -fasm" on G4.
-- Tweak to Brian Sniffen's Data.Map version by Chris Kuklewicz import Control.Monad.ST import Data.STRef updateCounts :: Map Seq Int -> Int -> Seq -> Map Seq Int updateCounts base k seq@(Seq size dna) = runST (do let toRef mapST (key,value) = do ref <- newSTRef value return $ Map.insert key ref mapST fromRef (key,ref) = readSTRef ref >>= (\value -> return (key,value)) baseST <- foldM toRef Map.empty (Map.toAscList base) let countframe :: Map Seq (STRef s Int) -> Int -> ST s (Map Seq (STRef s Int)) countframe mapST offset = do let key = Seq k (advancePtr dna offset) case Map.lookup key mapST of Nothing -> newSTRef 1 >>= (\ref -> return $ Map.insert key ref mapST) Just ref -> modifySTRef ref succ >> return mapST mapST <- foldM countframe baseST [0..size-k] mapM fromRef (Map.toAscList mapST) >>= return . (Map.fromAscList) )
6.2 Data.Map #3 (ByteString)
I cobbled this together based on some other entries. It could be a little clearer and certainly a little faster, but is probably a good starting point. It runs faster on the file that I have than any of the other versions I benchmarked.
import Data.Char import qualified Data.Map as Map import Data.List import Text.Printf(printf) import qualified Data.ByteString.Char8 as B import Data.Ord (comparing) type FreqMap = Map.Map B.ByteString Int loadMap :: Int -> [B.ByteString] -> FreqMap loadMap i s = foldl' (\m w -> Map.insertWith (+) w 1 m) Map.empty snips where snips = filter (not . B.null ) $ map (B.take i) s writeFrequencies i dna = let mp = loadMap i dna total = fromIntegral (Map.fold (+) 0 mp ) / 100 :: Double res = map (\(a,b) -> (a, fromIntegral b / total)) (Map.toAscList mp) in mapM_ showFun . sortBy (flip (comparing snd)) $ res showFun :: (B.ByteString, Double) -> IO () showFun (k,f) = printf "%s %.3f\n" (B.unpack k) f writeCount dna sq = printf "%d\t%s\n" cnt (B.unpack sq) where cnt = length $ filter (B.isPrefixOf sq) dna dnaThree :: IO [B.ByteString] dnaThree = process =<< B.getContents where process = return . B.tails . ul . takeNorm . tail . dropComment . dropOther . B.lines dropOther = dropWhile (\str -> not ((B.pack ">THREE") `B.isPrefixOf` str)) dropComment = dropWhile ((';' ==) . B.head) takeNorm = takeWhile (('>' /=) . B.head) ul = B.map toUpper . B.concat main = do three <- dnaThree writeFrequencies 1 three >> putStrLn "" writeFrequencies 2 three >> putStrLn "" mapM_ (writeCount three . B.pack) ["GGT", "GGTA", "GGTATT", "GGTATTTTAATT", "GGTATTTTAATTTATAGT"]
7 KetilMalde : Trie-based entry
This entry uses a lazy suffix trie (loosely based on Kurtz/Giegerich: "A comparison of imperative and purely functional suffix tree constructions").
I guess this is considered cheating, since it is the easy and natural way to do it :-) While the specifications say "hashtable", I interpret it to mean "standard associative data structure" - so using Data.Map would be okay, and perhaps even the true Haskell way. The reason I still consider this entry cheating, is that it relies on lazyness to compute only the requested frequencies, instead of, as the benchmark specifies calculation of -- and subsequent discarding -- all frequencies of the given word lengths.
If I were a mean spirited person (and I am not, no, really), I would point this out as yet another benchmark artificially penalizing a language where it is easy and natural to avoid unnecessary computation. As it is, this can perhaps go in the "Interesting Alternatives" category (as Chris points out).
Note that most of the time is now spent parsing input, if somebody wanted to further improve it, using FPS or similar would be the way to go.
In my opinion, we can always exploit lazyness. If a spec requires us to compute something that isn't used, we should just let our language take care of this. The spec can always be changed if they want to explicitly penalise lazy languages (which I don't think they do -- they just don't think of the alternatives). So, lazyness is definitely ok. -- Don
-- By KetilMalde and ChrisKuklewicz import System.IO import Text.Printf(printf) import Control.Monad(unless) import Data.List hiding (insert) import Data.Maybe(maybe,fromMaybe) import Data.Char(ord,chr,toUpper) seqStrings = ["GGT","GGTA","GGTATT","GGTATTTTAATT","GGTATTTTAATTTATAGT"] data Trie = T { as, cs, gs, ts, ns :: Trie, acount, ccount, gcount, tcount, ncount :: !Int } | Nil deriving (Eq,Show) total (T _ _ _ _ _ ac cc gc tc nc) = ac+cc+gc+tc+nc emptyT = T Nil Nil Nil Nil Nil 0 0 0 0 0 main = do all <- getContents let three = getSection ">THREE" all t = foldl' insert emptyT (tails three) showFrequencies 1 t showFrequencies 2 t mapM_ putStrLn $ map (\(f,s)->(show f)++('\t':s)) $ zip (map (lookupFrequency t) seqStrings) seqStrings insert :: Trie -> String -> Trie insert t [] = t insert Nil (s:ss) = case s of 'A' -> emptyT {as = insert emptyT ss, acount = 1} 'C' -> emptyT {cs = insert emptyT ss, ccount = 1} 'G' -> emptyT {gs = insert emptyT ss, gcount = 1} 'T' -> emptyT {ts = insert emptyT ss, tcount = 1} _ -> emptyT {ns = insert emptyT ss, ncount = 1} insert t (s:ss) = case s of 'A' -> t { as = insert (as t) ss, acount = 1+(acount t) } 'C' -> t { cs = insert (cs t) ss, ccount = 1+(ccount t) } 'G' -> t { gs = insert (gs t) ss, gcount = 1+(gcount t) } 'T' -> t { ts = insert (ts t) ss, tcount = 1+(tcount t) } _ -> t { ns = insert (ns t) ss, ncount = 1+(ncount t) } showFrequencies k t = let printBSF (bs,f) = printf "%s %.3f\n" bs f asPercent = map convert hist where tot = fromIntegral (total t) :: Double hist = writeFrequencies k t convert (s,f) = (s,100 * fromIntegral f / tot) mySort = sortBy freqAndKey where freqAndKey (k1,x) (k2,y) = case compare y x of EQ -> compare k1 k2 z -> z in do mapM_ printBSF (mySort asPercent) putChar '\n' writeFrequencies :: Int -> Trie -> [(String,Int)] writeFrequencies _ Nil = [] writeFrequencies 1 (T _ _ _ _ _ ac cc gc tc nc) = zip (map (:[]) "ACGT") [ac,cc,gc,tc] writeFrequencies k (T as cs gs ts _ ac cc gc tc _) = concatMap freq "ACGT" where k' = k-1 mapc c = map (\(s,f)->(c:s,f)) freq :: Char -> [(String,Int)] freq 'A' = if ac>0 then mapc ('A') (writeFrequencies k' as) else [] freq 'C' = if cc>0 then mapc ('C') (writeFrequencies k' cs) else [] freq 'G' = if gc>0 then mapc ('G') (writeFrequencies k' gs) else [] freq 'T' = if tc>0 then mapc ('T') (writeFrequencies k' ts) else [] lookupFrequency :: Trie -> String -> Int lookupFrequency _ [] = 0 lookupFrequency Nil _ = 0 lookupFrequency (T _
