TODO: Peachies

i am planning on updating the code for my “peachies” newspaper mining script to utilize only the highest circulation volumes according to this page:

http://en.wikipedia.org/wiki/List_of_newspapers_in_the_world_by_circulation

i am only going to be hosting the code (now on github as jcrow) and not a full-fledged web app as i had done previously because hosting is expensive and i’ve been hacked a couple of times using privately served web hosting services and have lost data. i am planning on using the google translator api in my code for the first time in order to mine newspapers in foreign languages. also, i am initially only seeking term frequency across the lexicon and will post my word counts here on this wordpress site. if successful i may initiate the natural language processing of:

www.nltk.org

for the first time for the sake of doing the same with proper names. also, i am using the python library:

http://www.crummy.com/software/BeautifulSoup/

to strip html tags and get the lang attribute of the tag

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: