Leiden Leechbook and other glosses – Breton, Cornish or SW British?

Leave a reply

There was a debate yesterday on the Celtic Linguistics group on Facebook about whether the ninth-century Leiden Leechbook, long considered Old Breton, can be considered Breton at all, rather than South West Brythonic or perhaps even only dialectically so. It all comes down to whether one is a lumper or a splitter, as I noted, given that we know too little to know just how dissimilar these languages were in the tenth century and previously (e.g. their morphology and syntax), apart from a few potentially minor phonetic differences that we could as easily ascribe to dialect as language differences. There is famously no scientifically agreed distinction between the two, so all such terms, at their boundaries at least, are a matter of academic convenience.

I also made the following minor emendation to the text in this Facebook comment:

I note that hobæbl is probably an error for lobæbl, which would then be another word with lob, lub (characteristic for this text) and mean Sambucus Ebulus (Dwarf Elder), fitting nicely with Stokes’ guess. Perhaps Falileyev & Owen already commented on this, as I don’t have access to a copy right now, but the text seems to have at least three different etymons related to the elder, e.g. hobæbl-lobæbl, scau, trom, in which perhaps the glossator had an interest…? The only thing that immediately strikes my eye as having a potentially Breton flavour is <e> (mostly) for inherited short /i/, but that could be just an orthographical matter and not necessarily diagnostic on its own so early anyway…(?)

Since it appears to be out of print, I have not been able to get hold of a copy of Falileyev A., Owen M. E. The Leiden Leechbook: A Study of the Earliest Neo-Brittonic Medical Compilation. Innsbruck: Institut für Sprachen und Literaturen der Universität; 2005 (ISBN: 3851242157).

Since I specialize in Cornish as well as in Brythonic historical linguistics, I would be fascinated to see if others have found themselves more able to ascribe the text specifically to Breton, as Whitley Stokes obviously was in his edition. However, I don’t think that many academics today would be prepared to do so purely on paleographical and orthographical grounds, as was formerly more acceptable, since that risks creating an artificial distinction between language polities that may or may not have existed at the date in question. Just as colloquial Hindustani is separated into colloquial Urdu and colloquial Hindi by digraphia and religio-political identities more than by diglossia, we should not jump to the conclusion that Breton and Cornish were separate languages purely on the basis that one is written in a style influenced more by Frankish scriptorial traditions and the other by Anglo-Saxon.

Equally, who is to say that the 9th century phrase ud rocashaas¹ should not be considered South West Brythonic (and thus equally “Breton” as “Cornish” or even “Dumnonian”, “Somerset Brythonic”, “Dorset Brythonic” and so on)? We don’t know whether or not the glossator had even been to what was later called Cornwall. Given that much more of the South West of Britain was probably speaking Brythonic at this date than merely Cornwall, is it not anachronistic, even if we separate Breton from this language, to call it “Old Cornish”? I would go so far as to say that could even hold true for the 12th-century Vocabularium Cornicum,² especially since one of the main diagnostic features of Cornish at that date, assibilation, is a phonetically trivial feature that could have been merely dialectical. One might reasonably compare /k^w/ > /p/ in certain varieties of Celtic, which is no longer seen as a diagnostic marker of linguistic relatedness as it was in former scholarship.

This all goes to show that language classification is extremely challenging at the margins of knowledge, and that ultimately such convenience boundaries may come down to perspective in the absence of any better morphological or syntactic data.

[1] Sims-Williams, Patrick ‘A New Brittonic Gloss on Boethius: ud rocashaas’, Cambrian Medieval Celtic Studies 50 (Geurey 2005), 77-86.
[2] Mills, Jon, The Vocabularium Cornicum: a Cornish vocabulary? http://ora.ox.ac.uk/objects/uuid%3A479f80db-d8f3-4a5e-ae64-06f8cf9b65d1

Geirfa: “teleport” yn y Gymraeg

Leave a reply

Nid wyf yn rhy hoff o’r gair C. “telegludo” ar gyfer S. “teleport”. Paham felly ni fyddai C. “pellgludo” (un sillaf llai) yn fwy Cymraeg ac yn haws ei ddweud?

Installing Greenstone 2.85 & 3.05

Leave a reply

Yesterday I installed Greenstone 2 (v2.85) and Greenstone 3 (v3.05) on my server. At the time of writing, the test instance of Greenstone 2 is still available here. Just in case I manage to solve the security problems with my test instance of Greenstone 3 and put it back on line, here is the link where it may (or may not) be found.

I notice, first of all, that development of Greenstone 3 appears to have ceased last year, unless things have gone only temporarily quiet. I have spoken about it in the past to George Buchanan, who was responsible for migrating it from the old C++ code of version 2 to the Java code of version 3. It is quite strange, given the popularity of Greenstone up until recently, that there is no evidence of more recent interest. It is also quite hard to find documentation and good how-tos on the Web.

It was relatively easy to install Greenstone 2. The main complication was trying to work out how to make it work with a pre-existing Apache 2 installation because (1) although this was anticipated to some extent, the instructions assumed that Apache would be installed with Greenstone and this was impossible for me because another instance on the server was already installed and listening on the standard ports; (2) the instructions were evidently only for RedHat/Fedora, which uses the httpd version of Apache rather than apache2 as it is implemented on Debian (in my case Ubuntu). This could be confusing for someone who has less experience of Apache than I do.

When it came to installing Greenstone 3 (admittedly this is advertised explicitly as not being a stable version), there were several problems. It does not seem simple, as expected and promised, to change the location of the collections to those used by the previous version of Greenstone, and the test collections provided are empty. This was not reflected on the library front end. Nor was it possible to make new collections appear. It did, however, look like a more modern web application, both in terms of the web interface and much of the stuff under the hood.

I was trying to use a reverse proxy on Apache, using mod_proxy and later mod_proxy_ajp, in order to serve the site on Tomcat via a different port, since I obviously cannot have both Apache and Tomcat listening on port 80. The reason that I was unable to leave it on line was that the application was structured so that I could not point the URL on my server to {$GSDLHOME}/greenstone3/library without losing access to some of the files it required and thus the CSS and images. This meant that my URL pointed to the root of my Tomcat installation, which I do not wish to expose publicly for security reasons. Since my server does not see sufficient demand for me to leave Nginx running as a reverse proxy in front of Apache, it being much faster for static content, I did not try to see if I would have had the same problems, but in principle it seems that the same problem would have arisen.

My personal opinion of Greenstone is that it is very much designed as a top-down, library-oriented system where the user does not interact with the site apart from consulting it like any other catalogue. However, unlike Integrated Library Systems (ILS), it does not allow for easy cataloguing of physical resources such as books, at least as far as I have been able to tell so far, and, as a result, seems to have nothing in place for item locations and borrowing. In short, it seems to have little to recommend it over a repository system such as DSpace or EPrints on the one hand, or an ILS on the other. I hope that my fairly superficial analysis has simply failed to see the real extent of its functionality, but my initial experience of Greenstone was not as immediately positive as I had hoped and expected from what I had previously heard.

Installing CKAN on Ubuntu 13.04 with Tomcat7

Leave a reply

I recently installed a test instance of CKAN on my server, which you can find here. However, I had to do this a little differently from the default installation instructions (Ed.: page now from Wayback Machine) that you can find on their site. Firstly, because they are for Ubuntu 12.04 64 bit server and because I have recently upgraded mine to 13.04 (Ed.: upgraded to 13.10, March 2014), I had to install from source (Ed.: page now from Wayback Machine). Then, because I have been using Tomcat7 for some time, and because Jetty on Ubuntu 13.04 still has some dependencies on Tomcat6 (Ed.: still true in Ubuntu 13.10?), I was unable to install Jetty. So I asked a friend who is a Java developer. His advice was that I didn’t need Jetty anyway and could just use Tomcat. He was right, of course. Why do I need yet another HTTP server and servlet container running anyway?

The following instructions are not a complete walk-through but are intended to show where I departed from the instructions to install from source, in the above link, and to clarify things that I found were not all that obvious or clear and which took me a long time to figure out.

Here is what I did (omitting Jetty):

sudo apt-get install python-dev postgresql libpq-dev python-pip python-virtualenv git-core openjdk-6-jdk

I shall pretend, for the sake of people reading this, that I didn’t already have many of those already installed, so I am leaving everything that you will need in these instructions.

You do not need to add the following to that list unless you want or can use Jetty, which I couldn’t, for the reasons given above:

solr-jetty

However, that means you need to download and install Solr separately. You will need to have previously installed Java and Tomcat7. There are various instructions on the Web to do those things, so I won’t repeat the whole process here. One thing is, though, that you may see some errors in the Solr logs, and in the logging interface. Never fear: Solr is not broken! These will look similar to this one:

WARN SolrResourceLoader Can't find (or read) directory to add to classloader: ../../../contrib/velocity/lib (resolved as: /var/lib/solr/collection1/../../../contrib/velocity/lib).

It turns out that these are just some lines in the default /etc/solr/conf/solrconfig.xml that ought to have been commented out. So do that, if you are concerned:

<!-- A 'dir' option by itself adds any files found in the directory
to the classpath, this is useful for including all jars in a
directory.

When a 'regex' is specified in addition to a 'dir', only the
files in that directory which completely match the regex
(anchored on both ends) will be included.

If a 'dir' option (with or without a regex) is used and nothing
is found that matches, a warning will be logged.

The examples below can be used to load some solr-contribs along
with their external dependencies.
-->
<!--lib dir="../../../contrib/extraction/lib" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-cell-\d.*\.jar" />

<lib dir="../../../contrib/clustering/lib/" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-clustering-\d.*\.jar" />

<lib dir="../../../contrib/langid/lib/" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-langid-\d.*\.jar" />

<lib dir="../../../contrib/velocity/lib" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-velocity-\d.*\.jar" /-->

Ok, apart from having some of these old errors stuck in the logs on the logging page, Solr is working perfectly, although you will need to follow the instructions about how to modify it for CKAN. I did that exactly as directed, so again I do not need to repeat any of that.

I personally ignored the TIP section in (2) Install CKAN into a Python virtual environment because it’s unnecessary and I don’t want those symlinks cluttering things up in my home folder. It then tells you this:

sudo mkdir -p /usr/lib/ckan/default
sudo chown `whoami` /usr/lib/ckan/default
virtualenv --no-site-packages /usr/lib/ckan/default
. /usr/lib/ckan/default/bin/activate

This was confusing and chown failed to work. Type the whoami command separately and see what it does. I didn’t want to run CKAN as my own personal user, so you may want to consider creating a user or perhaps running everything as the user tomcat7. I’m not sure what is best here, but my installation works. So, replace the whoami and the quotes around it with whatever user works for you there.

Continue with the instructions. I found that everything worked until I got to the section on Solr. Here, because I was not using Jetty, it was hard to know what to do. Although it says “The following instructions deploy Solr on the Jetty server, but CKAN does not require it, you can use Tomcat if that is more convenient on your distribution”, in actual fact there are no instructions on what to do without Jetty, and you will notice that tomcat6 is a required dependency of Jetty anyway on Ubuntu <= 13.04, so you have it installed anyway by now if you are using Jetty! Again, why not just use Tomcat? I think that the CKAN people could make clearer instructions for Tomcat. Could someone explain why you need another HTTP server and servlet container when you must already have Tomcat installed anyway? What in particular is special about Jetty that CKAN works better with?

Anyway, follow some other instructions for Solr on Tomcat that you find on the Web, as I did. But don’t panic about this section: you can ignore everything it says about Jetty. Do remember for later, though, that if you are running other services on port 8080 and don’t want to change the Tomcat port to 8983 just for Solr, you don’t have to, but you will need to change the port in the URL in the CKAN config to 8080, or else they will obviously fail to talk to each other as expected. I don’t see why we should need to use the port that Jetty is expecting, so this could be made clearer if there was a specific Tomcat guide.

Just to add to this, when I mistakenly thought that my Tomcat/Solr installation was broken, I tried to use the multiple cores instructions. This did break my Solr installation and consequently CKAN as well. I couldn’t get this to work at all because the XML given doesn’t look anything like the <solrcloud>…</solrcloud> section in the default Solr config. If you simply replace it, the whole thing will break. Anyway, here is another section where the CKAN instructions need to be much clearer, whether you are using Jetty or Tomcat. If anyone knows what to do to make multiple cores work, please feel free to add a comment to this post. What I did learn (the hard way) was that Solr was not broken!

You should set up the DataStore. By and large, these instructions do work. However, there is a very confusing part that breaks part of CKAN if you get it wrong. If you don’t, you will notice that you cannot go to Explore > Preview when looking at a dataset. It will give you a server error. You must get the permissions set correctly. I found that the way that permissions are set up using the virtual environment simply wouldn’t work, so I could not use the first method. I don’t know why. For the second method, I could not even find datastore_setup.py and there is no indication in the instructions where it actually is. It really does seem to be completely missing…

So I gave up hunting through folders and had to use the third method, using SQL instructions. This, in turn, was confusing because it was unclear who the users had to be (probably also a problem if you use the second method). Again, there is no indication about where to find set_permissions.sql in the instructions. Fortunately, I was able to find this one. If you are using the recommended /usr/lib/ as the base folder it will be at /usr/lib/ckan/default/src/ckan/ckanext/datastore/bin (you may want to substitute /opt/ or wherever you are choosing to put it, but I’m not an expert on recommended *nix file system locations). Copy this file somewhere before you edit it.

You have to edit the relevant part of the set_permissions.sql file yourself:

-- name of the main CKAN database
\set maindb "ckan_default"
-- the name of the datastore database
\set datastoredb "datastore_default"
-- username of the ckan postgres user
\set ckanuser "ckan_default"
-- username of the datastore user that can write
\set wuser "ckan_default"
-- username of the datastore user who has only read permissions
\set rouser "datastore_default"

You will notice that I use the default usernames given in the original instructions, for clarity. Although it’s made clear who the read-only user should be, it was not altogether clear who the write user should be, so I kept the default CKAN user for this, and it works fine. I hope that was the right thing to do!

Unfortunately the next instruction is also very confusing if you aren’t familiar with PostgreSQL: up until now, I’ve used MySQL which has different syntax, so I stupidly managed not to realise that the name of the database in the instructions is wrong. Don’t use the default postgres database! Use this (or whatever the name of your database is) instead:

sudo -u postgres psql datastore_default -f set_permissions.sql

Note that, if you did make this mistake, you’ll need to clean up the permissions that you’ve just allowed on your default postgres database. One still seems to be stuck…

After this, everything worked. I then went on to the instructions Deploy a Source Install, using Apache2, which worked well. It’s slightly odd to recommend that most people install Postfix, I must say. If, like me, you are working on a home server, consider that running an email server is a massive operation that is very vulnerable to exploitation by spammers unless you really know what you are doing and have a lot of time to invest in it. (Frankly, installing Postfix is a nightmare and, when I did it some years ago, I was never confident enough that it worked properly to open up my firewall and use it for real.) Just use the details of whatever server you use for email, even just GMail as I did. If, on the other hand, you are in a larger institution, you will already have an email server. Use those details. (Use secure email servers!) Unless, that is, you are a god among sysadmins and/or a masochist prepared to inflict Postfix administration on yourself.

Note that under (5) Create the Apache Config File, there are no instructions for SSL. Duplicate the file in /etc/apache2/sites-available. For instance, mine are ckan.talatchaudhri.com and ckan.talatchaudhri.com-ssl because using the domain names and appending -ssl to the appropriate entries is a naming convention that will always tell you what is what. Also change the virtual hosts directive to <VirtualHost *:443> as appropriate. There are guides on the Web about how to make SSL work with Apache.

Note that this will fail unless you make one change, because you have duplicated the name of the daemon ckan_default and Apache will fall over:

# Deploy as a daemon (avoids conflicts between CKAN instances).
	WSGIDaemonProcess ckan_defaultSSL display-name=ckan_defaultSSL processes=2 threads=15

Obviously, call the duplicate daemon whatever you like, but I just added SSL to the end of the name. Actually, you really ought to consider not serving the passwords and data submissions pages (which include email addresses and other personal details) over HTTP on port 80 at all, since these could be sniffed. If you are worried about man-in-the-middle attacks, then perhaps you should consider not having a mixed HTTP/HTTPS site with only certain secure pages being re-directed, which according to some people is a security risk. It seems to be fine for lots of major web services like WordPress though. Anyway, whatever you do, you’ll need to redirect at least those pages for security: if you are using Apache then you’ll be doing that either in the site configs or (slower but convenient) in .htaccess. If you are using Nginx then there is a new and funky way to do it, which you can google yourself. (I’ve played with Nginx as a reverse proxy over Apache, but it’s not currently serving my web pages and the load on my server is minimal, so I really only did it for general coolness and a vague concern about Apache being a memory hog.)

That is how I did it to the best of my recollection. Apologies in advance if I have skipped any steps, but I hope that I have concentrated on the steps that were unclearly described or unexpected, so that anybody reading this will not have to spend two frustrating days setting up CKAN with Solr as I did. Without meaning to be over-critical of the considerable work that has been put into this documentation by OKFN, these instructions do contain some glaring omissions and, in a few places, give misleading instructions. If they would like to use my comments to add to or improve their documentation, I’d be only too pleased for them to do so. In the meantime, I hope my experience helps somebody.

Email and Skype anti-spam obfuscation with jQuery

2 Replies

I have recently re-visited the issue of email address obfuscation to defeat spam generated from web crawlers. I have also addressed the issue for Skype contact details, about which I have seen considerably less discussion. Techies can jump to the scripts for email and Skype below. Non-techies who can write basic HTML should not be put off by any apparent complexity in the full instructions below: a simple demonstrator is available here on my web site, and it really only needs a few lines of code and some standard JavaScript files that you can just download from my site for free (see below).

Is it worth the effort?

Using methods such as quoting your email address as firstname [dot] surname [at] subdomain [dot] domain [dot] net (with your own details substituted) are a tried and tested method of security through obscurity, but they also defeat the purpose of having user-friendly contact details. It is sometimes argued that this method is no longer worth the effort, but here are some statistics. Busy people will be unlikely to copy your details manually unless they really, really need to, and you will probably lose out on all sorts of genuine, useful contacts. Not to mention annoying people who visit your site and see messy code whose purpose is unclear to many ordinary, non-technical users of the Web.

Another tried and tested method is using JavaScript to obfuscate your address for web crawlers, which typically won’t analyse JavaScript, while making it perfectly readable for browsers that do. Here, for example, is a method from 2003. It is possible to go further and actually encrypt the email address. However, since it has to visible to human users and must therefore be decrypted again by the JavaScript, this is really rather pointless.

For those that might argue that the method will break the page for users who turn off JavaScript, I would point out that this argument is basically empty, as the numbers are extremely low and those few individuals who do so are actively choosing to break the functionality of the vast majority of modern web sites that use JavaScript, so presumably they must know what they are doing. I have seen estimates that less than 1-2% of people turn it off, since doing so makes the use of most web sites nearly impossible, so it has been enabled by default in virtually all modern browsers for many years. Here is a recent analysis with some statistics.

On balance, I would suggest that it is still worth using obfuscation, although in reality you will receive some spam and will need to rely on spam filters too. But that doesn’t mean it’s a good idea to compound the problem by advertising your details freely. It’s one thing to let your friends on social network see them but quite another to allow the web crawlers to do the same without even a basic attempt to slow them down.

Email address obfuscation

My method is adapted from a script that I found using the standard jQuery library for JavaScript. However, it didn’t allow for dots in either the user name or for dots in the domain part of the address, which made it relatively obvious to machine harvesters that, at least, a domain was being obfuscated, which is a clue to it being a possible email address. For a little more security, I modified the script email.js as follows:

function createMailtoLinks(){
$(‘A[data-u][href=””]’).each(function(){
var i = $(this);
//replace # character with . if present in username
$(this).attr(‘data-u’,$(this).attr(‘data-u’).replace(‘#’,’.’));
//replace # character with . if subdomain rather than domain before TLD suffix
$(this).attr(‘data-d’,$(this).attr(‘data-d’).replace(‘#’,’.’));
//replace # character with . if present in TLD suffix
$(this).attr(‘data-t’,$(this).attr(‘data-t’).replace(‘#’,’.’));
i.attr(‘href’, ‘mai’+’lto:’+i.data(‘u’)+’@’+i.data(‘d’)+’.’+i.data(‘t’));
if (i.html()==”){ i.html(i.data(‘u’)+’@’+i.data(‘d’)+’.’+i.data(‘t’)); }
});
}
$(function(){
createMailtoLinks();
});

You need to add the following mark-up into the <head>…</head> section of your web page (it can technically go anywhere in <body>…</body>, but some browsers may not like it, so it’s best to keep it neat and tidy). Note that it is better to use external files for your JavaScript (as you will need to do for jQuery), rather than using in-line JavaScript, as it may invalidate your mark-up otherwise, for example in XHTML if you don’t use messy <script><![CDATA[ … ]]></script> tags; and, anyway, it makes tidier mark-up if people don’t have to wade through JavaScript when viewing the (X)HTML.

<script src=”jquery.js” type=”text/javascript”></script>
<script src=”email.js” type=”text/javascript”></script>

You can download jQuery from their web site, noting that version 1.x will work for any browser but version 2.x will not work for Internet Explorer version 8 (2009) and less. You then need to add the following (X)HTML to your page, adapted as required, with the # character in place of any dot in the address:

<a href=”” data-u=”name#surname” data-d=”my#example” data-t=”net”></a>

Don’t insert a line break between the <a>…</a> tags or it won’t work, and nothing will appear on the page. I have modifed the original script so that any dot will need to be replaced with # in the (X)HTML mark-up, and turned back into a dot by the JavaScript with jQuery. I have also separated the TLD (e.g. .com or .net etc) part of the domain name.

The web crawler will see only an empty link and some rubbish data attributes, but you will see something like name.surname@my.example.net if your browser is using JavaScript. If you want some text other than the email address to appear, put it between the <a>…</a> tags, which will appear as e.g. some text instead, using the following code:

<a href=”” data-u=”name#surname” data-d=”my#example” data-t=”net”>some text</a>

Skype address obfuscation

Lastly, I realised that Skype spam does also exist, i.e. harvesting Skype contact details in order to send spam instant messages to (or potentially even initiate automated voice calls with) users who choose to allow messages and calls, perhaps for business reasons, from other users who are not on their contact list. It is rare, but may increase; so I think it is worth being careful and adapting the method for Skype contact details too, using this script in a file called skype.js or similar:

function createSkypeLinks(){
$(‘A[data-s][href=””]’).each(function(){
var i = $(this);
//replace # character with . if present in Skype name
$(this).attr(‘data-s’,$(this).attr(‘data-s’).replace(‘#’,’.’));
i.attr(‘href’, ‘sky’+’pe:’+i.data(‘s’));
if (i.html()==”){ i.html(i.data(‘s’)); }
});
}
$(function(){
createSkypeLinks();
});

Remember to add the following into your <head>…</head> section as before:

<script src=”skype.js” type=”text/javascript”></script>

Then one of these example lines of code is needed in your page:

<a href=”” data-s=”yours#truly”></a>
<a href=”” data-s=”myskypename101″></a>

These will appear as yours.truly or myskypename101 in your page. Again, putting anything between the <a>…</a> tags will result in something like some text instead.

A note on PGP encryption

Not many people use PGP in their email for identity verification or encrypting personal details (or for other purposes such as verifying the origin of a file being downloaded etc), although perhaps that’s a subject that’s slightly tangential to this one. If you send these in an email, you are asking for trouble, particularly passwords. Even if you connect securely to your email server using TLS/SSL, you have no clear idea which intermediate servers will transmit that email to its destination and whether or not they will encrypt it as you did, which they are under no obligation to do. It is surprisingly easy to use PGP in Thunderbird, for example. I would recommend making the effort to set it up. It’s a shame that email clients and web mail services don’t do this for you automatically when you set up an email account, just as browsers use SSL certificates but ordinary users don’t have to know exactly how they work in order to be protected. It isn’t just for spies! 😉

Anyway, the relevant point is here that, if you do use PGP, you may well have your public key, and thus your email address, on public key servers. It is, of course, perfectly possible for spam harvesters to check these. However, since maintainers of key servers also use code to prevent automatic harvesting, it’s likely to be reasonably effective at hiding your address, as we have done above, from all but the most determined spam harvester. Also, people who use PGP are generally (at the moment, at least) techies who will statistically be poor targets for spam attacks – so why would they bother?

Conclusion

Spammers are basically lazy, so any serious attempt to make it difficult for them will generally defeat them because they go for the easiest targets. There will always be many of these, so it’s not worth them putting in the coding effort to work out how each individual JavaScript obfuscation method works, of which there are many different ones. It’s simply not economically worth their while. At the end of the day, if you need your contact details to be public, they are available for misuse. This means that such methods relying on security through obscurity are still worthwhile additions to spam filters and other spam killers, as a first layer of defence to reduce the harvesting of email addresses.

Dysgu “fy” yn y Gymraeg

1 Reply

“Pam mae fy yn swnio fel ’yn? Dyw e ddim yn edrych fel ’yn o gwbl!”

Nid yw’r gair fy yn hawdd i’w ddysgu i oedolion ym mhob tafodiaith. Caiff ei ynganu yn wahanol ar lafar mewn ambell un ohonynt na’r ffordd y disgwylir yn y cywair ffurfiol. Fel rheol, y mae tri grŵp o dafodieithoedd o ran ynganu’r gair hwn:

(1) Y rhai sy’n colli fy yn llwyr, neu nad yw’n cadw ond y llafariad ’y yn unig o flaen cytseiniau, naill ai wedi eu treiglo’n drwynol neu beidio. Mae hyn yn weddol syml i’w ddysgu, er nad oes achos amlwg i’r treiglad trwynol mewn ysgrifen.

fy enw i –> enw i

fy llong i –> (f)y llong i –> (y) llong i

fy nhad i –> (f)y nhad i –> (y) nhad i

(2) Y rhai sy’n colli fy yn llwyr o flaen cytseiniau neu sy’n cadw’r llafariad ’y yn yr un ffordd ag uchod, ond sy’n cadw f’ yn unig ac sy’n colli llafariad y gair o flaen llafariaid canlynol fel y gwneir yn y cywair ffurfiol weithiau. Mae hyn yn weddol syml i’w ddysgu yn yr un modd.

fy enw i –> f’enw i

fy llong i –> (f)y llong i –> (y) llong i

fy nhad i –> (f)y nhad i –> (y) nhad i

(3) Y rhai sy’n dal i drin y gair fel fyn o’r Gelteg *men(e) gan eu bod yn ynganu ’yn /ǝn/ o flaen llafariaid a chytseiniau na chânt eu treiglo, ond sy’n colli’r gair yn llwyr neu sy’n cadw ond y llafariad ’y lle treiglir cytsain ganlynol yn drwynol.

fyn enw i –> (f)yn enw i

fyn llong i –> (f)yn llong i

fy nhad i –> (f)y nhad i –> (y) nhad i

Rhaid cyfaddef o’r cychwyn bod llawer iawn o siaradwyr yn methu â threiglo yma yn gyfeiliornus, a bod siaradwyr rhai tafodieithoedd yn dweud fi yn naturiol yn hytrach nag i yn y cyfryw gystrawennau (a hefyd fe < efe lle mae eraill yn dweud e < ef). Cymharer y frawddeg Gernyweg ow thas vy “fy nhad i” hefyd. Mae methu â threiglo yn wall sydd wedi ei ddatblygu yn gymharol ddiweddar, ond mae fi (a fe) yn dilyn cytsain yn hanesyddol mewn ambell dafodiaith (ac yn dafodieithol gywir gan hynny) er nad felly yn y rhai y mae’r iaith ffurfiol wedi ei seilio arnynt. Hynny yw, er enghraifft, mae nhad fi yn dderbyniol ar lafar (ond nid mewn ysgrifen) ond nid yw tad fi byth yn dderbyniol gan ei fod yn wallus.

Amlwg yw mai *myn wedi ei dreiglo yn fyn yw’r ffurf wreiddiol sydd wedi peri’r treiglad trwynol yn achos rhai cytseiniau. O flaen seiniau eraill, mae rhai tafodieithoedd wedi colli’r /n/ olaf ac mae rhai eraill wedi ei chadw.

Cymharer y ffordd y dysgir yn [+ treiglad trwynol] i ddysgwyr. Yma ceir sain wedi ei threiglo hyd at ddwywaith: yn gyntaf i ddangos bod /n/ olaf y gair yn wedi mynd yn /m/ o flaen /p/ neu /b/ (a hefyd o flaen /m/ di-dreiglad), ac wedi mynd yn /ŋ/ o flaen /k/ neu /g/; ac yn ail i ddangos y newidiadau cychwynnol /p/ > /m^h/, /b/ > /m/, /k/ > /ŋ^h/, /g/ > /ŋ/, /t/ > /n^h/ a /d/ > /n/ wedi eu hachosi i ddechrau’r gair canlynol. Felly, ysgrifennir ym ac yng yn safonol. Mae hyn yn arfer da oherwydd ei fod yn dangos bod /n/ a’r gytsain ganlynol yn newid trwy gymathiad, h.y. eu bod yn dod yn agosach at eu gilydd fel seiniau.

Hawdd yw dysgu’r treiglad trwynol ar ôl yn oherwydd y gwelir yr achos yn y gair, sef /n/. Anodd yw ei ddysgu ar ôl fy oherwydd nas gwelir o gwbl. Anos byth yw dysgu pam mae fy yn swnio fel ’yn o flaen seiniau nas treiglir. Gellir colli myfyrwyr o resymau llai na hyn.

Yn y Geiriadur Mawr, rhestrir fyn, fym, a fyng gyda serennau i ddangos mai hen ffurfiau ydynt. Digwyddant yn gyffredin yn hanesyddol. Gan mai rhan o hanes yr Iaith Gymraeg ydynt, a chan eu bod mor hawdd eu deall a’u cydnabod, awgrymaf ein bod yn eu hadfer hwy fel y gellir dewis rhwng fy a fyn yn ôl tafodiaith yr ardal.

fy enw i, f’enw i neu fyn enw i –> enw i, f’enw i, (f)yn enw i

fy llong i neu fyn llong i –> (f)(y) llong i, (f)yn llong i

fy nhad i neu fyn nhad i –> (y)(n) nhad i

fy mrawd i neu fym mrawd i –> (y)(m) mrawd i

fy nghath i neu fyng nghath i –> (y)(ng) nghath i

Nid oes angen newid ein ffafriaeth ar gyfer y ffurff fy o flaen cytseiniau di-dreiglad yn y cywair mwyaf ffurfiol, a chael dewis o flaen y rhai a dreiglir. Gellid arfer fyn yn unig o flaen llafariaid neu hyd yn oed peidio’n llwyr yn ôl dewis yr ysgrifennwr.

Dysgaf i’m dysgwyr bod fy enw i yn cael ei ynganu fel ’yn enw i yn dilyn rhan fawr o’r tafodieithoedd. Pe gallwn i ysgrifennu fyn enw i byddai hyn yn haws o lawer. Pe gallwn i esbonio fyn + cath = fyng nghath hefyd (fel yn + Caerdydd = yng Nghaerdydd ar hyn o bryd), byddai dysgu pwyntiau o ramadeg Cymraeg yn haws byth hyd yn oed.

Instant Messaging: Past, Present and Future

Leave a reply

This post was originally published on the Technical Foundations web site at UKOLN.

A brief history…

Instant messaging has been around on the Internet for longer than the World Wide Web. In its earliest, purest (and, it’s probably fair to say, crudest) form, it was possible to use the Unix command line tool write to output a message to another user’s terminal, provided that they had previously typed mesg y (i.e. messaging yes), or indeed to directly echo or even cat the contents of a file to another terminal. Surprisingly, there was also a tool for real-time typing to the other terminal, which eventually settled on a split-screen approach. (Far more recently, this was, one of the supposed “killer” features of Google Wave before its development was abandoned – yet it had existed in a simpler form many years before.) While the original write and talk utilities have been gradually improved so that they can talk to users on different servers – and, for example, provide security over the Secure Sockets Layer (SSL) – they were never a user-friendly tool for the non-technical user. They are still installed by default on some Unix/Linux distributions but are little used even by developers, given the huge variety of more modern, scalable technologies.

What was not provided by these early utilities was the ability to have anything but the crudest control over who could chat with you. Once you had typed mesg y, anybody on that server could contact you until you typed mesg n (i.e. messaging no). In addition, giving somebody else control over your terminal was a major security issue. Even so, the modern concept of contact lists (i.e. friends) and presence information (e.g. available, busy, offline etc) were also missing. You can still tell, although only by user name, who is logged into the same Unix/Linux server as you by typing simply who into the terminal.

The next great step in the history of instant messaging was Internet Relay Chat (IRC). This essentially provides command-line chat rooms, which can be made at least somewhat more user-friendly through graphical user interface (GUI) tools such as mIRC, as well as private messages to individuals. While it is not particularly obvious how to indicate presence, amongst the myriad other commands that are available, all of the functions that one would expect in modern instant messaging are available. It was later made available over SSL, which provides basic security from snooping. However, IRC remains susceptible to netsplits, takeover wars from hackers, denial-of-service attacks, and one is not automatically guaranteed a unique identifier or nick if it has already been used on that server by another user or if the server does not allow a nickserv (nickname registration server). Despite all these failings and its consequent decline in popularity, IRC remains popular with developer communities because of its relative simplicity, in addition to a certain retro chic.

The MSN era and beyond

Instant messaging came to the ordinary user through a myriad of mutually non-interoperable commercial protocols, each with their own Graphical User Interface (GUI) provided by the company in question and many spin-off open source replacements. The underlying technology behind these protocols was not published, but they effectively supplied what would now be called an Application Programming Interface (API), stating how developers could write tools that could communicate with their servers. One could not simply run one’s own server because the underlying technology was proprietary. Many of these are still in use, for example AIM, Yahoo Instant Messaging etc, and perhaps the largest, Microsoft’s MSN or later Windows Live Messenger, has only just been retired through a merger with Skype. (This has, as a side effect, removed the ability of MSN users to chat with users of Yahoo Instant Messaging, as this is not possible in Skype.) For most users, all that has changed since those days is the gradual migration to new tools such as Skype, which adds voice and video chat, and Facebook chat, which is merely convenient because of the critical mass of contacts who are already on Facebook. Similarly, Google Talk offers IM services to anybody who already has a Google account and uses GMail for web-based email. Both Facebook and Google Talk have later added audio and video chat. Together, these dominate the market because they are attached to the most widely used Internet services and are accessible to ordinary, non-technical users. In the case of Facebook and Google Talk, there is the added advantage of access via the Web without downloading any dedicated software.

Open Standards

Both Google Talk and Skype are particularly interesting because, unknown to the bulk of their users, they implement the open standard XMPP (also known as Jabber), although Skype is not fully compliant with the roster system that enables one to have contacts across different XMPP servers. The reason for this is, of course, that they only want Facebook users to chat with other Facebook users rather than enable chat with other XMPP users, which would naturally include competitors such as Google Talk that also implement the protocol. However, the competition for instant messaging does not seem to be as fierce as it was, and the competitors have formed agreements: Facebook chat is now integrated into Skype despite Facebook offering competition in audio and video chat tools from its own Web site. This may be because the free service is effectively a loss leader: it does not provide the commercial income directly, since the service is free. Instead, Skype market additional paid services such as providing Skype Out (calling landline or mobile telephones), providing users with external telephone numbers and voicemail services, group video chat and so on; similarly, Facebook make their revenue through advertising on their site, which is attractive because of the free social networking tools, including instant messaging, audio and video chat. It appears to be in everybody’s interest to cooperate to some degree.

Google stands out among the other commercial players in allowing its users to chat to other XMPP users who have accounts on different servers, either commercial, free or privately operated. One can talk to Google Talk users (or any other XMPP users) using a free account with jabber.org or even run one’s own server (as one could with IRC) using ejabberd or similar open source XMPP server software. However, audio and video chat is limited to users with Google accounts, providing the incentive to prefer their all-in-one, one-stop shop approach to Internet services, which is convenient for most users. The development of open source extensions to XMPP has been slow. It is still difficult to find XMPP servers that deploy Jingle, the extension for audio and video chat, which is considerably harder to do effectively than merely installing an XMPP server, which is the work of an hour or two.

While XMPP is the de facto standard for modern IM, both for open source and increasingly commercial services, it is not without criticism. It is verbose, relying on XML, which can be an issue where bandwidth is an issue. This is a small problem for IM services but a much larger one, for example, when audio and video streams are added: it does not support binary data streams natively. It is designed for a federation network run on a number of servers and its network vulnerability, while not as high as IRC, remains a structural issue. It uses massive unicasting and does not support multicasting, which is a minor efficiency issue in chat rooms but becomes much more of a problem for group audio and video streaming. It is possible to directly substitute a newer, although relatively little known protocol called PSYC, an interserver that supports XMPP and IRC natively, which alleviates most of these problems. It takes about an hour or two to set up the psyced server, about the same as a basic IRC or XMPP server. This does, however, retain the federation approach: in future iterations of the protocol, an entirely re-engineered Peer-to-Peer (P2P) approach is under development. Being an open source project of interest mostly to technical users, development has been relatively slow. This lows XMPP and IRC to interoperate seamlessly, in addition to enabling fine control over notifications to and from other systems, friendcasting, multicasting, news federation, interoperability with microblogging systems such as Twitter and so on, via programmable chatrooms.

Voice Over IP

Coming to the same market from a diametrically opposed perspective is the SIP standard for Voice Over Internet Protocol (VOIP), which began as an audio service and later developed both IM and video services in addition. This is widely used in the commercial sector: for example Vonage in the UK. There are open source varieties that can be deployed by anybody, albeit with some technical difficulty, such as Asterisk and FreeSwitch. These only cost where they connect to the Public Telephone Service Network (PTSN) that provides ordinary landline telephony, but they also enable low-cost, in-house management of telephone extensions, voicemail and related services, as well as making telephony available through computer terminals as well as telephones. One can manage distributed calling, effectively enabling call centres, using this free technology, which can be installed even on a home server. While most people would not have a particular reason to go to such effort, the entry costs to setting up complex systems have been radically reduced to the point where they would now be affordable for small organisations who can rely either on voluntary contributions of development effort or who can outsource the work cheaply.

Why is this important?

Technologies such as XMPP may not be of immediate interest to the average Internet user, either in the HE sector or more widely. However, they underlie so many of the Internet services that we may use on a daily basis that issues such as interoperability of services via open standards are worth knowing about, at the very least in order to gain an understanding of the relative difficulty of providing such services and the costs involved. Given that more and more reliance is being put on an increasingly small group of major providers of Internet services by vast numbers of ordinary users, the consequences for privacy and management of personal information are potentially immense. There is an intense debate going on about whether services taking a federated approach, relying on a network of servers, or a peer-to-peer approach, is the best way (or even a feasible way) to mitigate against these risks is relevant to many other technologies, of which instant messaging is only one: the most significant of these may be social networking. For most people, social networking is vastly more important than, for example, darknet services and/or file sharing, which currently account for the large bulk of peer-to-peer services in widespread use. Indeed, it is social networking, that typically gathers together a number of pre-existing technologies together for convenience with the core microblogging service, that best highlights the widely differing approaches to the future of the architecture of Internet services.

URIs and URLs: Quick Reference

Leave a reply

This post was originally published on the Technical Foundations web site at UKOLN.

It has been explained elsewhere what the difference between URIs and URLs is. The type of URL that one generally sees is an HTTP URL. You can think of the family of URIs rather like in the diagram below, showing some of the most commonly encountered URIs (not by any means a complete list).

All of the URL protocols are associated with commonly used Internet services, of which the World Wide Web is only one, using the HTTP(S) scheme. The secure variants are mostly provided using the Secure Sockets Layer (SSL) or its successor Transport Layer Security (TLS), except in the case of Secure Shell (SSH) which has its own built-in encryption protocol. Unless non-secure data is being transmitted, e.g. ordinary web pages not containing sensitive information, it is almost always a good idea to use the secure varieties of the URL protocols – unless, for example, the data is being sent via an SSH tunnel, or else other security providing protection from both external and other local users is in place: a Virtual Private Network (VPN), for example, will not prevent snooping from other users of that private network. The default ports (a colon then a number) are generally assumed where omitted, although technically there is nothing preventing the use of a non-standard port except the inevitable confusion with other services using those ports. There are well known alternative ports used by developers for testing and similar purposes, e.g. port 8080 instead of the usual 80 for web servers.

While most users will not need to know the protocol syntax for the majority of the URL schemes apart from HTTP(S), and will almost certainly never need to know about the syntax of URN schemes, nevertheless these services underlie the functionality of the Internet services that they use every day. It is at least a good idea to understand the basic pattern, which most schemes share, together with an understanding of when and how to use SSL/TLS. Most users will know roughly how an HTTP(S) URL works, which follows the basic pattern used in most of the other URL schemes. Some protocol schemes are rarely seen expressed as an address in actual software implementations, even though they are widely used: this depends on the purpose and nature of the protocol, and to some extent on whether or not it is ever directly accessed from a command line tool in practice. Most non-technical users never do this except in the case of typing Web addresses into the address bar of a browser, which is why only the HTTP(S) protocol is ubiquitous to the general public.

You will notice that the URL scheme (for locating resources) has a large number of very commonly used protocols, whereas the URN scheme (for naming resources) is not as well known but remains technically important for more complex naming schemes, where more specific semantics are required. These are typically used by libraries and in developing Internet services that require access to large data sets about electronic and real-world resources, but are not seen by the average Internet user. Officially, all of these schemes are URI schemes, including both URLs and URNs, but here they are separated by those that locate resources and those that do not.

    URI
     |
     +--- URL
     |     |
     |     +--- HTTP e.g. http://www.google.com/ (using default port 80, equivalent to http://www.google.com:80/)
     |     |     |
     |     |      +--- HTTPS (secure/encrypted) e.g. https://accounts.google.com/ServiceLogin (using default port 443, equivalent to https://www.google.com:443/)
     |     |
     |     +--- SMTP e.g. smtp://bob.fisher@mymailservice.com:25 (also mailto: bob.fisher@mymailservice.com)
     |     |     |
     |     |     +--- SMTPS (secure/encrypted) e.g. smtps://bob.fisher@mymailservice.com:585 (also mailto: bob.fisher@mymailservice.com)
     |     |  
     |     +--- POP3 e.g. pop://bob.fisher@mymailservice.com:110 (for downloading email from a remote server)
     |     |     |
     |     |     +--- POP3S (secure/encrypted) e.g. pops://bob.fisher@mymailservice.com:995
     |     |
     |     +--- IMAP4 e.g. imap://bob.fisher@mymailservice.com:143 (for synchronising email with a remote server)
     |     |     |
     |     |     +--- IMAP4S (secure/encrypted) e.g. imaps://bob.fisher@mymailservice.com:993
     |     |
     |     +--- FTP e.g. ftp://bob.fisher@myserver.com:/my_folder_path/my_file.example (or ftp:bob.fisher@myserver.com:21/my_folder_path/my_file.example)
     |     |     |
     |     |     +--- FTPS e.g. ftps:bob.fisher@myserver.com:990/my_folder_path/my_file.example (it is now more normal to use SFTP via SSH instead)
     |     |
     |     +--- XMPP e.g. xmpp://bob.fisher@mychatservice.com:5222 (e.g. for GTalk, Facebook, jabber.org or other open protocol instant messaging)
     |     |     |
     |     |     +--- XMPPS (secure/encrypted) e.g. xmpps://bob.fisher@mychatservicecom:5222 (over the same default port or the legacy 5223)
     |     |
     |     +--- IRC e.g. irc://myircserver.org:6667/#mychatchannel
     |     |     |
     |     |     +--- IRCS (secure/encrypted) e.g. irc://myircserver.org:6697/#mychatchannel
     |     |
     |     +--- TELNET e.g. telnet://bob:mypassword@myserver:23 (highly insecure for command line access but occasionally used for other purposes)
     |           |
     |           +--- TELNET (secure/encrypted), as above but using SSL and either the same port or the SSH port 22, usually abandoned in favour of SSH
     |           |
     |           +--- SSH (secure/encrypted) e.g. ssh://bob:mypassword@myserver:22 (for command line access and related purposes)
     |           |
     |           +--- SFTP (secure/encrypted) e.g. sftp://bob:mypassword@myserver:22 (for file downloads, with the related UNIX/LINUX/POSIX scp command)
     |
     +--- URN (these examples were taken from Wikipedia)
     |           |
     |           +--- International Standard Book Number (ISBN) e.g. urn:isbn:0451450523 (the book The Last Unicorn, by Peter S. Beagle, 1968)
     |           |
     |           +--- International Standard Audiovisual Number (ISAN) e.g. urn:isan:0000-0000-9E59-0000-O-0000-0000-2 (the film Spider-Man, 2002)
     |           |
     |           +--- International Standard Serial Number (ISSN) e.g. urn:issn:0167-6423	 (the scientific journal Science of Computer Programming)
     |           |
     |           +--- Request For Comments (RFC) for memoranda of the Internet Engineering Task Force (IETF) on internet standards and protocols, e.g. urn:ietf:rfc:2648
     |           |
     |           +--- MPEG7 e.g. urn:mpeg:mpeg7:schema:2001 (the default namespace rules for MPEG-7 video metadata)
     |           |
     |           +---  Object Identifier (OID), e.g. urn:oid:2.16.840 (the United States of America)
     |           |
     |           +--- UUID e.g. urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66 (a type of unique identifier that is mathematically improbable to duplicate, version 1)
     |           |
     |           +--- National Bibliography Number (NBN) e.g. urn:nbn:de:bvb:19-146642 (a document in the Bibliotheksverbund Bayern, Germany, with library and document number)
     |           |
     |           +--- European Union Directive e.g urn:lex:eu:council:directive:2010-03-09;2010-19-UE (using the Lex URN namespace for legislation)
     |     
     +--- URC (internet standard proposal never developed, largely replaced by XML, RDF, JSON etc in providing metadata)

As noted, Uniform Resource Characteristics (URC) were abandoned in the early history of the internet. Numerous anomalies have developed, such as that noted above where both FTPS and SFTP perform similar functions in a different way, or where some services use a different port for SSL/TLS but XMPP usually does not. The Digital Object Identifier (DOI) scheme is effectively a URN scheme but has never been registered as such and performs the same function whilst officially remaining a URL scheme.

i-affection in Brythonic

Leave a reply

Jackson (LHEB) notes that final i-affection in Brythonic often results in either <ei> /ei/ or <y> /ı/ in Welsh, whereas it always results in <e> in Cornish or Breton. I’m not sure whether this is always true, as <y> is often found in Cornish (this difference would have been eliminated in Breton anyway). Anyway, an example is W. eleirch/elyrch “swans”. In my opinion, the consonant was palatalised and the vowel raised to /e/, effectively giving /ei/, which could be called i-affection according to Jackson’s terminology. The /ei/ would then be caused by what I would call i-epenthesis, as a result of the following palatal consonant. The distinction, as far as I know, has never been made before. As an alternative, the /e/ was sometimes raised to /ı/ in W. elyrch etc.

Why should universities care about identifiers?

Leave a reply

This post was originally published on the Technical Foundations web site at UKOLN.

Why do identifiers matter for research?

Imagine that you are a senior manager in an institution within the UK Higher Education sector with responsibilities for research: you have read some basic details about unique researcher identifiers and perhaps institutional identifiers. However, it may not be immediately apparent just how important these issues are, which may seem on the face of it to be a relatively superficial and/or trivial organisational matter. Clearly, any such strategic decision-maker will long have been aware of the demands of the Research Excellence Framework (REF) and its predecessor the Research Assessment Exercise (RAE), in which successful reporting of the best research outputs of university departments is crucial to the on-going funding of the institution. This is particularly central to the work of research-led universities, which is an increasingly competitive sector: even universities that formerly focussed more on teaching than research are increasingly aware of the need to drive up standards of quality research in order to secure additional funding.

The reality of unique identification in research

However, as anyone who has actually engaged with the business of research reporting to any degree will tell you, it is far from a superficial or trivial matter to carry out such an exercise without thinking very carefully about how researchers are identified; moreover, identifying the research groups, departments, projects and institutions that they may have variously belonged to at different times, all of which may have been re-organised on many occasions, is a considerable challenge raising considerable technical as well as organisational issues.

Perhaps the biggest problem of all derives from the scale of research reporting. On such a massive scale, it has to be done in a systematic way across higher education institutions in order to be useful. Any lack of a systematic approach in collecting the information on the institutional level will inevitably result in higher costs in processing the information later into a useful form, for example by governmental organisations such as HESA and the Research Councils (RCUK) relevant to each area of academic study. This may be carried out for a variety of reasons, amongst them for example:

The need to produce statistics at a national and at an institutional level in order to gauge how successful different parts of the research community are performing in comparison to each other and to similar institutions internationally, which may be a determinant of how funding is allocated.
The production of good, widely accessible information about the work of academic researchers and research groups for the purposes of future research, both in identifying research as a basis for future work and for guiding individuals and groups in terms of who they might work with in future, who their competitors may be, and in creating wider bibliographic information for a whole range of related purposes related to future publications.
Open Access, an increasing requirement imposed by funders where research is publicly funded.
Accountability in the use of public funds for research.

It is precisely the lack of a national approach to providing consistent metadata about individuals and groups connected with research that raises costs, creates inefficiencies and frustrates the development of new software functionality that makes the jobs of research managers more difficult and ultimately reduces the funds available to research and their best use within the sector. It is therefore the business of senior managers of academic research to care about identifiers.

Researcher identifiers: a crucial first step

Before any wider metadata about research may be considered, the most fundamental issue is identifying individuals who carry out research. Before this happens consistently on a national level, there is little point addressing the subsequent issue of identifying groups and institutions engaged in research consistently. It is also important to consider any national approach in terms of interoperability with other international approaches wherever possible: while, on the one hand, funders and statistics agencies can only hope to mandate national identifier schemes, at the same time it is clear that research collaboration is cross-institutional and international in scope, in some cases including researchers from numerous countries in one project or even in the production of one individual paper, data set or other research activity. This is the approach that has been taken by the JISC, together with RCUK, HESA and other partners in setting up the Research Identifiers Task and Finish Group, which is due to report in October 2012.

One emerging candidate with cross-sector and international support is the ORCID researcher identifier scheme, whose rapid development in 2011-12 is scheduled to culminate in a public launch [Ed.: announcement removed] in October 2012. There are, of course, existing, widely-used but relatively simple identifiers such as the HESA researcher identifier, and identifiers provided through commercial providers’ web interfaces, but thus far these have not provided dependable unique identification. All such identifiers could be linked to a system like ORCID that is designed on interoperable principles and is not dependant on any particular software platform or web interface. An alternative approach is taken by the ISNI number: whereas ORCID seeks to offer individual researchers and institutions the ability to manage their data on a distributed model, ISNI represents a centrally moderated, bibliographic approach led by national libraries and other similar institutions with national and strategic responsibilities. It remains to be seen whether these different approaches are in competition or whether they will offer different but complementary functionality within the sector, and much may be dependent on how software vendors implement them.

Current Research Information Systems (CRIS)

It is not simply a matter of tracking publications and other related ouputs, for example in institutional repositories. This part of the equation is by now relatively well established in the UK HE sector, although it continues to develop: the issues surrounding Open Access, for example, have not been fully resolved. This, however, is just at the level of the final outputs of research and does not provide anything like sufficient insight into the processes of research, the projects and groups carrying out, the staff involved or the costs. Traditionally, this information has been gathered in a very long-winded process that is individual to each institution’s particular workflows and processes (although there are obviously great similarities of approach between them), often a partly paper-based exercise that has been migrated to an extremely varied range of systems and databases, few of which are interoperable or complete. Many departments may be involved in the process apart from the institution’s research office and the department in which the researchers are based, but perhaps the most significant would be the finance office, the human resources department and the library, to name just the key players. It will be necessary to keep some information confidential, e.g. personal staff information, salaries and so forth, to share some information internally and with research funders, and to publish other information, e.g. in a research repository that forms the institution’s “shop window” of public outputs, library databases and so forth. The term Research Information Management (RIM) has emerged to cover all of these information gathering and information processing activities.

In order to do this systematically, more sophisticated research information management software has been developed, often known as Current Research Information Systems (CRIS). The market in the HE sector is currently led, in terms of the number of institutions adopting the software, by PURE, produced by ATIRA; other major players are Symplectic Elements, and CONVERIS, produced by AVEDAS. More recent entrants to this market are Thomson Reuters’ Research in View. There are currently no open source products, although a JISC-funded modular approach by the Research Management and Administration Service (RMAS) project may have an increasing impact in this area, depending on subsequent adoption by HE institutions. It is not an overstatement to say that HE institutions are currently in a rush towards early adoption of these CRIS systems, motivated by the need to use research data to compete with each other for funding opportunities.

Next steps: organisational identifiers

In the next 2-3 years, it is likely that the matter of unique researcher identification will be resolved through the emergence of a dominant standard that has sufficient take-up and leverage in the UK and international HE sector to faciliate the work of research institutions and funders. Following this, there will be organisational structures associated with research that will require unique identification, often on a multi-layed basis: for example, a project may be at several institutions, perhaps internationally, and their staff may be in various departments or similar units whose names have changed or have been merged or de-merged at various times, all of which will require careful date and time stamping to make the information reliable for the period that it covers. There will be issues related to copyright, commercialisation and spin-off companies that make the precise provenance of research critical to the future success of academic research and development. Standards for organisational indentifiers are therefore the next important issue on the horizon. Like researcher identification standards, research managers and senior managers with strategic responsibility for research will need to keep abreast of this rapidly developing area.

The Indo-European

Languages, the Web and other stories

Leiden Leechbook and other glosses – Breton, Cornish or SW British?

Geirfa: “teleport” yn y Gymraeg

Installing Greenstone 2.85 & 3.05

Installing CKAN on Ubuntu 13.04 with Tomcat7

Email and Skype anti-spam obfuscation with jQuery

Is it worth the effort?

Email address obfuscation

Skype address obfuscation

A note on PGP encryption

Conclusion

Dysgu “fy” yn y Gymraeg

Instant Messaging: Past, Present and Future

This post was originally published on the Technical Foundations web site at UKOLN.

A brief history…

The MSN era and beyond

Open Standards

Voice Over IP

Why is this important?

URIs and URLs: Quick Reference

This post was originally published on the Technical Foundations web site at UKOLN.

i-affection in Brythonic

Why should universities care about identifiers?

This post was originally published on the Technical Foundations web site at UKOLN.

Why do identifiers matter for research?

The reality of unique identification in research

Researcher identifiers: a crucial first step

Current Research Information Systems (CRIS)

Next steps: organisational identifiers