Eamonn O'Brien-Strain

MastodonBlueskyThreads


date: '2009-08-17 16:45:43' layout: post slug: how-a-scala-script-can-specify-what-jars-should-be-added-to-its-classpath status: publish title: How a Scala script can specify what JARs should be added to its CLASSPATH wordpress_id: '425' categories: Programming


The reason the Scala programming language has such a name is that it is meant to be scalable, i.e. good for both small scripts and large software systems. I am already convinced that Scala is as good, if not better than Java for the large-scale development, but what about the small scale?

Well it turns out that on Linux if you chmod +x a scala script you can make it executable with a bit of shebang magic as shown by this complete, runnable script:

#!/bin/sh exec scala $0 $@ !# println(“Hello World”!“)

(You do have to make sure that “scala” is in your executable path.)

But what if you are using a scala script to control an existing Scala or Java system? It would be nice to express the CLASSPATH requirements in the script itself rather than depending on the calling environment to be configured correctly. Well, I came up with this convenient idiom to collect all the JAR files in a lib directory and add them to the CLASSPATH:

#!/bin/sh L=dirname $0/../lib cp=echo $L/*.jar|sed 's/ /:/g' exec scala -classpath $cp $0 $@ !# import com.my.special.World val world = new World world.hello

This assumes that all the required JAR files are in directory ../lib relative to the script file.


I have the misfortune of being both very interested in languages and very bad at learning them.

In preparation for trips to various countries, I have attempted to learn many, but have seldom got beyond politeness and survival basics like “please”, “thank you”, “hello”, “I would like that”, and “two beers please”. Such rudimentary smatterings were useful in traveling around those countries and possibly showed some respect to the locals, even if they were insufficient for carrying on a proper conversation.

Now I am preparing for my first trip to China and so attempting to learn at least a few words of Chinese because of all the countries I have been, China is probably the place where knowledge of the local language will be most useful. Unfortunately, Chinese is also the language that I am finding most difficult to learn, even at a rudimentary level.

Of course Chinese is not inherently more difficult than English. According to linguists, all widely spoken languages are equally complicated. However, that complexity can manifest itself differently in different languages. European languages have all kinds of ways in which the spelling and sound of words change according to their use in sentences. For example German has so much information encoded in word endings that you can switch around the order of words in sentences without effecting the meaning much. Chinese words however generally always sound the same no matter what their context or what their role in a sentence, but the exact sequence that the words appear in the sentence determines their role.

English is somewhere between Chinese and German in that both word ordering and word modifications encode role information. English has a fixed basic pattern of subject-verb-object for declarative sentences but has some flexibility in placement of some words such as adverbs, which in contrast have a fixed position in Chinese sentences. English verbs do change their endings depending on tense and person (“I walk”, “he walks”, “he walked”) and its nouns and adjectives change endings when used in plural or possessive contexts (“dog, dogs, dog’s”). However, it is missing many of the other word ending complexities of gender and case that appear in many other European languages.

It was a welcome relief in looking at Chinese to discover that I only have to learn one form of every word. For someone who has previously tackled the grammar of European languages, it almost seems like Chinese has no grammar. For example, all the forms of the verb “to be” such as “am” “is”, “are” are all translated by the single word 是. However, as you can see from this example there are some particular challenges for an English speaker. The first and most obvious one is that written Chinese does not use a phonetic alphabet. It turns out that this word 是 in the Mandarin dialect has a sound something like “shi!” said with a falling tone.

And that tone is important, which raises the second challenge for English speakers. In European languages, we use tone to indicate that a sentence is a question (rising tone) or an exclamation (falling tone). In Chinese, these tones are part of the meaning of each individual word itself. So for example if I say the same sound “shi?” but with a rising tone then it means “ten” and has a different character 十.

You might wonder why the Chinese do not just switch to a phonetic alphabet. Well, there is one such phonetic alphabet used in mainland China called pinyin in which 是 is rendered as shì and 十 as shí (note the different tone marks on the vowel). However this brings up a third challenge of Chinese, which is that, even considering the variations due to tones, there actually is quite a limited range of sound combination that are allowed in the language. This means that many words that sound the same, even with the same tone, have different meanings. For example, an online dictionary shows 112 meanings for falling-tone shì – each of these different meanings having a different character. Of course, just as in English when you have homophones like this you can usually figure out the meaning by context. However because there are so many more homophones in Chinese a lot more information is lost when transcribing into phonetic pinyin, possibly rendering some texts unclear and ambiguous, especially those written in a formal terse style. This is probably why pinyin has not replaced Chinese characters but plays a minor role as a teaching aid and as a way to express Chinese words in foreign languages.

Of course, spoken Chinese has all the ambiguities of pinyin because of these homophones. However, in practice, people add extra words when they speak to remove the ambiguities – for example using a combination of two different words for the same concept. In theory, this more verbose speaking style could be used in written Chinese so that it would be clear when written in phonetic pinyin, but it seems the Chinese people like the more terse conventional style of writing.

Written Chinese also has another advantage, in that all the Chinese dialects have same written form even though their spoken forms are sometimes different enough to be like different languages, and in fact non-Chinese languages like Japanese and Korean also use Chinese characters as part of their writing systems. This mutual comprehensibility in writing despite mutual incomprehensibility in speaking is only possible because of the non-phonetic nature of the writing system. It also helps bridge the gap to historical writings, where modern Chinese readers can understand texts even when the spoken language has changed a lot.

So, there are a lot of challenges in learning Chinese for an English speaker. As well as the tones, and the homophones, and the non-phonetic writing system there is an almost complete lack of cognates – the similar-sounding words that mean the same in related languages – that are such a help for English speakers learning another European language. There is also the difficulty that some of the basic consonant and vowel sounds in Chinese are different to, or sometimes have no equivalent to, the sounds in English.

I did discover one small advantage I have as someone who learned the Irish language in school when growing up in Ireland. It turns out that Chinese shares a grammatical feature with Irish in not, strictly speaking, having words for “yes” and “no”. In both Irish and Chinese when responding to a question you answer with the verb in the question or its negation, rather than with “yes” or “no”.

Well, let’s see how it goes when I visit China. Luckily, all my business colleagues there speak excellent English. But outside the office will I be able to use the few score of Chinese words I have learned to effect some basic communication? Will anyone understand my mangled pronunciation? Will I be able to understand any of the replies?


After been given advance hands-on access to Wolfram Alpha, I did some testing and my conclusions about the current state of this tool are:

  1. It would be a good educational tool, especially at the secondary/high-school level and perhaps in some areas at the college level.

  2. It is an entertaining exploration tool for those with an interest in the sciences.

  3. It is not a replacement for Google because it returns nothing useful from most typical Google queries.

  4. In areas of the sciences where it should be strong, its coverage is pretty spotty and once you get past the basics most specialized data seems to be missing.

  5. The single-box Google-style interface works only for very simple queries. Its natural language understanding quickly fails for any kind of complex query, and it also fails to recognize many queries written like mathematical formulas.

  6. It is not easy to drill down to find out more information about a topic. I was expecting that the results returned would contain links to more detailed information, but that was mostly not the case.

  7. It is also not easy to find information about related topics. The results returned y a query do not have links to allow you to explore “sideways” into related areas.

I wish Wolfram well in developing this tool into something that is truly useful, but I fear that the expectations for it are so high that when it is finally released the disappointed reaction to it could be very damaging to its reputation.

There follows more details of my explorations.

Stephen Wolfram has described Wolfram Alpha as the third big endeavor of his life, after the Mathematica symbolic computation system and the New Kind of Science book. I have been a big fan of the first two and so I was thinking that his latest work could possibly be a game-changer for the Internet.

As a computer scientist, I have had Mathematica in my tool chest for several years. Every so often, I encounter problems that require some hairy algebra that Mathematica can perform much more accurately and quickly than I could with pen and paper. Or I sometimes prototype some algorithm using Mathematica’s extensive library of mathematical operations. Or I use its graphing and visualization to get insight into my problem. Or I use its symbolic manipulation to simplify formulae that I then use in my own code.

When Wolfram published the New Kind of Science, I devoured it. I read the huge tome from cover to cover, including the extensive footnotes. He build up a persuasive argument, sparing no detail, from chaos-generating cellular automata, to general theories of computability, to hints of how the physical world might be best described as computing the laws of physics. It was an amazing intellectual achievement, though it remains to be seen if it is indeed the great breakthrough that Wolfram considers it to be.

So, along with many people, I was intrigued by the build-up to the release of Wolfram Alpha, a web application with natural language processing backed by a large corpus of “curated” data and a large grid of computers running Mathematica code. Some of the hype on the Internet, was calling this a Google-killer. I thought that it could realize the vision of the Semantic Web as championed by Tim Berners-Lee, but done in a proprietary closed way rather than as an open network of interacting semantic web services.

I was very happy then to get access to a preview version of Alpha, ten days before the general release.

The first thing I tried to test was to see was it indeed a “Google killer”. Like many people, I use Google a lot, both as part of my work and for personal use. I went back through my Google search history over the last few weeks and tried some of the queries I had made to Google. I tried searching for a particular play, “The Beauty Queen of Leenane”, but apparently the current set of Alpha data does not include plays, even ones that have been on Broadway, and I got a “Wolfram|Alpha isn't sure what to do with your input” message with which I was about to become very familiar. I tried searching for the playwright “Martin McDonagh” and got an unhelpful comparison between the surname “Martin” and the erroneously-corrected surname “Mcdonald”. A tried some more queries such as “java deprecated annotation argument” which I had used successfully on Google to figure out some detailed usage information about a highly technical aspect of programming in the Java programming language. However, Alpha was of no value. In fact, as far as I can tell it knows nothing about any programming languages, except presumably Mathematica itself.

And so I slogged on through my Google query history, but everything I tried returned nothing from Alpha. Eventually, going through weeks of Google history I found one query that it got correct: “speed of sound”. (I had been trying to estimate how far my house was from the Golden Gate Bridge fog horns, given it takes the sound 10 seconds to get here.) And, when I typed in “speed of sound * 10 seconds” I got the correct answer in both Google and Alpha.

I was disappointed that almost none of my Google queries returned anything useful. At one point last week I had been trying to find a formula to convert colours from CMYK (as used by printers) to RGB (as used by displays), and through various Google queries I discovered that this was a very complex issue because the conversion depended on printer characteristics, but I did find an approximate formula that was good enough for my purposes. None of these queries returned anything in Alpha, despite the fact that this kind of conversion would be the kind of thing it would do well.

So Wolfram Alpha is definitely not a Google killer. Its corpus of information is undoubtedly of much higher quality than the Internet information available to Google. However, despite being a lot of data, it is still a tiny fraction of the information on the Internet, so it is unlikely to have information on some random topic that you are interested in.

So, to be fair to Wolfram, I gave up on comparing to Google and instead tried creating queries based on the examples on the Wolfram web site.

And I was able to create some interesting queries. For example “volume of ocean / amazon average discharge in years” revealed to me that it would take 200,000 years to fill the ocean if started empty and was filled at the rate of water flowing from the Amazon river into the sea. However I really wanted to find the average flow of all rivers into the ocean, but I could not craft a query for that, even though I could find the information for each river individually.

Then I tried something fun (at least to me). I had heard the expression “boil the ocean” describing something so overly ambitious as to be effectively impossible. So how much energy would it require to boil the oceans, and how does that compare to say the energy output of the sun per year? I tried the query “”latent heat of vaporization of water * volume of ocean * density of water” which I hoped would give me an answer in units of energy (Joules) but it insisted on expressing the latent heat of vaporization in as the molar heat of vaporization which meant I ended up with a value in the wrong units (gram-kilojoules/mole). Finally after some playing around I got the correct result of 3.124x10^24 kJ (kilojoules) using the query “latent heat of vaporization of water * volume of ocean * density of water / molecular weight of water”. However, I could not find the power output of the Sun or even the solar flux on the earth using Wolfram Alpha – even in this very scientific area its corpus of data seemed lacking. So I cheated and using Google I was easily able to find that the power output of the Sun is about 4x10^26 Watts so I tried dividing the above query by this value, but I could not get it to work despite what kinds of parentheses I used. For example “latent heat of vaporization of water * volume of ocean * density of water / ((molecular weight of water) * (4x10^26 W))” would not parse correctly, nor would “((latent heat of vaporization of water) * (volume of ocean) * (density of water)) / ((molecular weight of water) * (4x10^26 W))”, Finally I just copied and pasted values to enter the query “3.124x10^24 kJ / 4x10^26 W” to discover that if the entire energy output of the Sun were applied to boil the Earth’s oceans ist would take 7.81 seconds to do so.

So, in its domain of strength the Alpha system can produce some useful results, but I found that it was not easy to “drill down” into the results and combine them in interesting ways without a lot of manual cut-and-paste.

Could Alpha be developed to overcome all these limitations?

Perhaps the querying interface could be improved. Maybe instead of just having a single box for typing free-form “natural language” there could be an advanced query mode that allows more complex structured queries.

Perhaps the user interface could be improved to allow more exploration from the results of a query to explore related concepts. For example from the results of my query about the Amazon there could be links to related queries for Brazil, or for other major world rivers.

Perhaps more curated data could be added to fill in the gaps and increase the depth of the knowledge base. However, this seems like a economically hopeless task given the current Wolfram model of using their experts to validate all data. Even if they increased their effort ten-fold they could still only manage to curate a small fraction of all data. They only systems that seem to work are those like Wikipedia or Google which rely on the open cooperative accumulation of large numbers of Internet users – but that does not seem to fit into the envisioned Wolfram business model.


So far, there have been two main approaches to the problem of organizing the world’s information: (1) throw machine-learning techniques at vast quantities of unstructured information, as Google does or (2) create complex networks of interrelated ontologies and apply inference techniques, as the Semantic Web community does.

As a fan of both the tool Mathematica and the book New Kind of Science, I am extremely interested in seeing how the upcoming Wolfram Alpha tool is going to approach the problem.


While scanning through the The IIIP Innovation Confidence Index 2008 Report published today by the The Institute for Innovation & Information Productivity I noticed one surprising finding that is illustrated in Figure 6 of the report.

Figure 6 Relationship between national community values and Innovation Confidence for 22 nations

They found that people in countries whose values are more “traditional” are more open to innovation than people from countries whose values are more “rational” and “secular”.

(Openness to innovation was measured by people's responses to questions on whether they would buy products or services that are new to the market or that use new technologies, and whether they expect those products and services to improve their lives.)

This finding is so surprising to me that I wonder is there some independent confirmation from another study.


One thing that confused me when I learned science in high school was the connection between the spectrum of colors as seen in a rainbow, and the three primary colors. In case anyone else is confused here is my simple explanation.

While light, as you probably know, is composed of a mixture of colors as seen in a rainbow or when light refracts through a glass prism. spectrum There is a continuous range of these spectral colors, corresponding to a range of wavelengths of light from 0.00038 millimeters to 0.00075 millimeters. When white light falls on some surface and bounces off, the surface reflects different wavelengths by different amounts, we perceive the net result as some color. Therefore, to fully characterize such a color you would have to measure the amount that each wavelength is reflected. Depending on how accurate you want to be this would take many numbers, measuring how much each wavelength is reflected.

So how do we go from characterizing colors by many numbers to characterizing it as just three numbers? It turns out that the primary colors are not some property of light. Rather they are a result of how our eyes work. We have three types of color-sensing cell in our eye, each of which responds to a particular range of wavelength in respectively the red, green, and blue area of the spectrum. Our brains then combine these three basic signals to form our perception of colors.

This is clearest in how red mixes with green to form the color that is in between them on the spectrum, namely yellow.

Similarly green and blue form cyan, though it is a little hard to tell that it is in between the two primaries on the spectrum.

The really odd one is what you get when you mix blue and red. These colors are not next to each other, so the result magenta is not a color that is on the spectrum.

Note that the human primary color model would not work for many other animals. While most primates like us have the three types of color cells in their eyes, most other mammals have only two types of color cell, while many birds, reptiles, and fish have four (or maybe more) types of color cells.


Did the aftermath of 9/11 attacks result in fewer people being imprisoned for public order offenses?

I happened across interesting graph on a page on the US Department of Justice Web Site. This is of course interesting for many reasons that are already well known, such as the huge increase in prison population and the particular large percentage increase in people imprisoned for drug offenses in the late 80's and early 90's.

But there was one unexpected thing I noticed: a precipitous decline from 2001 to 2002 in the number of prisoners imprisoned for public order offenses, which include “weapons, drunk driving, escape/flight to avoid prosecution, court offenses, obstruction, commercialized vice, morals and decency charges, liquor law violations”. This is more apparent in the graph showing the year-over-year change:

So what happened between 2001 and 2002?

My first thought was that it was something to do with the 9/11 attacks and their aftermath. Perhaps the surge of patriotism caused people to be better behaved, resulting in fewer public order offenses. Or perhaps the increased military recruiting had swept up people who committed, or would have committed, public order offenses.

But then I showed the first graph to my wife, and she immediately proposed that the first thing to be suspicious of was whether there had been a change in the way offenses were classified. She sees that kind of thing all the time in the data she works with.

Eventually I found a Q&A; on Ask MetaFilter discussing this issue. Several of the postings there also proposed that the classification has changed, and indeed the third graph below shows that the count of people in prison for other offenses had a significant increase in that same year.

So, it seems likely that this is not such an interesting phenomena after all, just a statistical artifact.


I am temporarily putting on a marketing hat and creating a product requirement document (PRD). My first step was to create a template based on a skeleton in a Wikipedia article, together with some valuable details from my colleague Chris. I attempted to make it a bit more “agile” by using “user stories” instead of “features” for the functional specification.

As I fill in this template, I suspect I will need to make some changes. But here is what I have so far.

[product name] Product Requirements

This document describes the requirements of [...] without regard to implementation.

Purpose and scope

[what is the problem we are solving?]

[product concept]

Technical

Business Objective

[how much money/profit will we make]

[how much resources will this take]

[how does this fit into strategic roadmap/company goals]

Stakeholder identification

[partner interactions]

Market assessment and target demographics

[market and/or product problem]

[target market]

[Internationalization requirements]

[user profile][customer profile]

[branding]

[what is competition doing in this space]

[how can we differentiate/compete ... what is the future] [is it “me too”]

[Actual text to be used in marketing]

[How to sell ... marketing plan]

Product overview and user scenarios

[how the user is going to interact and why]

Assumptions and External Dependencies

Requirements

functional requirements

[what product should do]

[prioritized user stories ... best bang for the buck ... difficulty vs. quality vs. time to implement]

ID Priority User Story (Card) Conversation Confirmation

** FOO ** MUST ... FooConversation? FooTest?

** BAR ** SHOULD ... BarConversation? BarTest?

** BAZ ** MAY ... BazConversation? BazTest?

usability requirements

[Actual help text to be provided]

technical requirements

[technology problem] [what technology do we require]

[what technology is best in class to solve the problem — how to measure — is it best for our user in ease of use, time of execution, repeatability stability, quality]

[security] [privacy]

[network — HTTP, HTTPS?, SMTP?]

[platform] [coding language]

[integration] [related features/site interactions]

[client — which versions of what browsers? screen size? mobile?]

[future upgrades] [extensibility]

environmental requirements

support requirements

[customer service]

interaction requirements

how the software should work with other systems

[legal impact]

[actual text of terms of service]

[actual text of privacy policy]

Constraints

Workflow plans, timelines and milestones

[Steps/phases and rollout plan]

[undo plan]

[operations impact]

[financial impact]

Evaluation plan and performance metrics

[Define the goal. How will this be measured? What is success?]

[performance]

Revision History

Date Revision Change By Whom

... 1 Template copied Eamonn



date: '2009-01-05 23:05:38' layout: post slug: spash-iceberg-calving-off-glacier status: publish ref: http://www.flickr.com/photos/eob/3142879795/ title: Spash! — Iceberg calving off glacier. wordpressid: '343' categories: Travel image: http://farm4.static.flickr.com/3201/3142879795e64f32dcbd_m.jpg image-text: Flickr photo


Spash! — Iceberg calving off glacier., originally uploaded by Tolka Rover.

The splash of a new iceberg falling off a glacier in Glacier Bay, Alaska.



date: '2009-01-02 23:18:02' layout: post slug: where-the-freeway-tentacles-have-withdrawn-from-san-francisco status: publish ref: http://www.flickr.com/photos/eob/3161098473/ title: Where the freeway tentacles have withdrawn from San Francisco wordpressid: '340' categories: SF image: http://farm4.static.flickr.com/3111/3161098473059de6341f_m.jpg image-text: Flickr photo


Where the freeway tentacles have withdrawn from San Francisco, originally uploaded by Tolka Rover.

San Francisco in 1991 showing the Embarkadero Skyway and several other stretches of elevated freeway that have since been torn down.

I love maps, and tend to be a bit of a pack rat. But today I decided that in this age of maps on phonea and GPS satnav devices there was no longer any justification in hanging on to the maps I uspreviouslyed kept in my car for navigation. So I tossed them all into the recycling bin.

But, before doing so I noticed this map from 1991, before I lived in San Francisco. It is interesting to see what has changed. In particular I noticed how there were three separate stretches of elevated freeway that are now torn down. In all three cases there are nice wide boulevards that are much more appealing for walking around than the shadowed underside of a looming highway structure.

Original map is © 1991 Thomas Bros Maps.