I recently installed a test instance of CKAN on my server, which you can find here. However, I had to do this a little differently from the default installation instructions (Ed.: page now from Wayback Machine) that you can find on their site. Firstly, because they are for Ubuntu 12.04 64 bit server and because I have recently upgraded mine to 13.04 (Ed.: upgraded to 13.10, March 2014), I had to install from source (Ed.: page now from Wayback Machine). Then, because I have been using Tomcat7 for some time, and because Jetty on Ubuntu 13.04 still has some dependencies on Tomcat6 (Ed.: still true in Ubuntu 13.10?), I was unable to install Jetty. So I asked a friend who is a Java developer. His advice was that I didn’t need Jetty anyway and could just use Tomcat. He was right, of course. Why do I need yet another HTTP server and servlet container running anyway?
The following instructions are not a complete walk-through but are intended to show where I departed from the instructions to install from source, in the above link, and to clarify things that I found were not all that obvious or clear and which took me a long time to figure out.
Here is what I did (omitting Jetty):
sudo apt-get install python-dev postgresql libpq-dev python-pip python-virtualenv git-core openjdk-6-jdk
I shall pretend, for the sake of people reading this, that I didn’t already have many of those already installed, so I am leaving everything that you will need in these instructions.
You do not need to add the following to that list unless you want or can use Jetty, which I couldn’t, for the reasons given above:
solr-jetty
However, that means you need to download and install Solr separately. You will need to have previously installed Java and Tomcat7. There are various instructions on the Web to do those things, so I won’t repeat the whole process here. One thing is, though, that you may see some errors in the Solr logs, and in the logging interface. Never fear: Solr is not broken! These will look similar to this one:
WARN SolrResourceLoader Can't find (or read) directory to add to classloader: ../../../contrib/velocity/lib (resolved as: /var/lib/solr/collection1/../../../contrib/velocity/lib).
It turns out that these are just some lines in the default /etc/solr/conf/solrconfig.xml that ought to have been commented out. So do that, if you are concerned:
<!-- A 'dir' option by itself adds any files found in the directory
to the classpath, this is useful for including all jars in a
directory.
When a 'regex' is specified in addition to a 'dir', only the
files in that directory which completely match the regex
(anchored on both ends) will be included.
If a 'dir' option (with or without a regex) is used and nothing
is found that matches, a warning will be logged.
The examples below can be used to load some solr-contribs along
with their external dependencies.
-->
<!--lib dir="../../../contrib/extraction/lib" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-cell-\d.*\.jar" />
<lib dir="../../../contrib/clustering/lib/" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-clustering-\d.*\.jar" />
<lib dir="../../../contrib/langid/lib/" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-langid-\d.*\.jar" />
<lib dir="../../../contrib/velocity/lib" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-velocity-\d.*\.jar" /-->
Ok, apart from having some of these old errors stuck in the logs on the logging page, Solr is working perfectly, although you will need to follow the instructions about how to modify it for CKAN. I did that exactly as directed, so again I do not need to repeat any of that.
I personally ignored the TIP section in (2) Install CKAN into a Python virtual environment because it’s unnecessary and I don’t want those symlinks cluttering things up in my home folder. It then tells you this:
sudo mkdir -p /usr/lib/ckan/default
sudo chown `whoami` /usr/lib/ckan/default
virtualenv --no-site-packages /usr/lib/ckan/default
. /usr/lib/ckan/default/bin/activate
This was confusing and chown failed to work. Type the whoami command separately and see what it does. I didn’t want to run CKAN as my own personal user, so you may want to consider creating a user or perhaps running everything as the user tomcat7. I’m not sure what is best here, but my installation works. So, replace the whoami and the quotes around it with whatever user works for you there.
Continue with the instructions. I found that everything worked until I got to the section on Solr. Here, because I was not using Jetty, it was hard to know what to do. Although it says “The following instructions deploy Solr on the Jetty server, but CKAN does not require it, you can use Tomcat if that is more convenient on your distribution”, in actual fact there are no instructions on what to do without Jetty, and you will notice that tomcat6 is a required dependency of Jetty anyway on Ubuntu <= 13.04, so you have it installed anyway by now if you are using Jetty! Again, why not just use Tomcat? I think that the CKAN people could make clearer instructions for Tomcat. Could someone explain why you need another HTTP server and servlet container when you must already have Tomcat installed anyway? What in particular is special about Jetty that CKAN works better with?
Anyway, follow some other instructions for Solr on Tomcat that you find on the Web, as I did. But don’t panic about this section: you can ignore everything it says about Jetty. Do remember for later, though, that if you are running other services on port 8080 and don’t want to change the Tomcat port to 8983 just for Solr, you don’t have to, but you will need to change the port in the URL in the CKAN config to 8080, or else they will obviously fail to talk to each other as expected. I don’t see why we should need to use the port that Jetty is expecting, so this could be made clearer if there was a specific Tomcat guide.
Just to add to this, when I mistakenly thought that my Tomcat/Solr installation was broken, I tried to use the multiple cores instructions. This did break my Solr installation and consequently CKAN as well. I couldn’t get this to work at all because the XML given doesn’t look anything like the <solrcloud>…</solrcloud> section in the default Solr config. If you simply replace it, the whole thing will break. Anyway, here is another section where the CKAN instructions need to be much clearer, whether you are using Jetty or Tomcat. If anyone knows what to do to make multiple cores work, please feel free to add a comment to this post. What I did learn (the hard way) was that Solr was not broken!
You should set up the DataStore. By and large, these instructions do work. However, there is a very confusing part that breaks part of CKAN if you get it wrong. If you don’t, you will notice that you cannot go to Explore > Preview when looking at a dataset. It will give you a server error. You must get the permissions set correctly. I found that the way that permissions are set up using the virtual environment simply wouldn’t work, so I could not use the first method. I don’t know why. For the second method, I could not even find datastore_setup.py and there is no indication in the instructions where it actually is. It really does seem to be completely missing…
So I gave up hunting through folders and had to use the third method, using SQL instructions. This, in turn, was confusing because it was unclear who the users had to be (probably also a problem if you use the second method). Again, there is no indication about where to find set_permissions.sql in the instructions. Fortunately, I was able to find this one. If you are using the recommended /usr/lib/ as the base folder it will be at /usr/lib/ckan/default/src/ckan/ckanext/datastore/bin (you may want to substitute /opt/ or wherever you are choosing to put it, but I’m not an expert on recommended *nix file system locations). Copy this file somewhere before you edit it.
You have to edit the relevant part of the set_permissions.sql file yourself:
-- name of the main CKAN database
\set maindb "ckan_default"
-- the name of the datastore database
\set datastoredb "datastore_default"
-- username of the ckan postgres user
\set ckanuser "ckan_default"
-- username of the datastore user that can write
\set wuser "ckan_default"
-- username of the datastore user who has only read permissions
\set rouser "datastore_default"
You will notice that I use the default usernames given in the original instructions, for clarity. Although it’s made clear who the read-only user should be, it was not altogether clear who the write user should be, so I kept the default CKAN user for this, and it works fine. I hope that was the right thing to do!
Unfortunately the next instruction is also very confusing if you aren’t familiar with PostgreSQL: up until now, I’ve used MySQL which has different syntax, so I stupidly managed not to realise that the name of the database in the instructions is wrong. Don’t use the default postgres database! Use this (or whatever the name of your database is) instead:
sudo -u postgres psql datastore_default -f set_permissions.sql
Note that, if you did make this mistake, you’ll need to clean up the permissions that you’ve just allowed on your default postgres database. One still seems to be stuck…
After this, everything worked. I then went on to the instructions Deploy a Source Install, using Apache2, which worked well. It’s slightly odd to recommend that most people install Postfix, I must say. If, like me, you are working on a home server, consider that running an email server is a massive operation that is very vulnerable to exploitation by spammers unless you really know what you are doing and have a lot of time to invest in it. (Frankly, installing Postfix is a nightmare and, when I did it some years ago, I was never confident enough that it worked properly to open up my firewall and use it for real.) Just use the details of whatever server you use for email, even just GMail as I did. If, on the other hand, you are in a larger institution, you will already have an email server. Use those details. (Use secure email servers!) Unless, that is, you are a god among sysadmins and/or a masochist prepared to inflict Postfix administration on yourself.
Note that under (5) Create the Apache Config File, there are no instructions for SSL. Duplicate the file in /etc/apache2/sites-available. For instance, mine are ckan.talatchaudhri.com and ckan.talatchaudhri.com-ssl because using the domain names and appending -ssl to the appropriate entries is a naming convention that will always tell you what is what. Also change the virtual hosts directive to <VirtualHost *:443> as appropriate. There are guides on the Web about how to make SSL work with Apache.
Note that this will fail unless you make one change, because you have duplicated the name of the daemon ckan_default and Apache will fall over:
# Deploy as a daemon (avoids conflicts between CKAN instances).
WSGIDaemonProcess ckan_defaultSSL display-name=ckan_defaultSSL processes=2 threads=15
Obviously, call the duplicate daemon whatever you like, but I just added SSL to the end of the name. Actually, you really ought to consider not serving the passwords and data submissions pages (which include email addresses and other personal details) over HTTP on port 80 at all, since these could be sniffed. If you are worried about man-in-the-middle attacks, then perhaps you should consider not having a mixed HTTP/HTTPS site with only certain secure pages being re-directed, which according to some people is a security risk. It seems to be fine for lots of major web services like WordPress though. Anyway, whatever you do, you’ll need to redirect at least those pages for security: if you are using Apache then you’ll be doing that either in the site configs or (slower but convenient) in .htaccess. If you are using Nginx then there is a new and funky way to do it, which you can google yourself. (I’ve played with Nginx as a reverse proxy over Apache, but it’s not currently serving my web pages and the load on my server is minimal, so I really only did it for general coolness and a vague concern about Apache being a memory hog.)
That is how I did it to the best of my recollection. Apologies in advance if I have skipped any steps, but I hope that I have concentrated on the steps that were unclearly described or unexpected, so that anybody reading this will not have to spend two frustrating days setting up CKAN with Solr as I did. Without meaning to be over-critical of the considerable work that has been put into this documentation by OKFN, these instructions do contain some glaring omissions and, in a few places, give misleading instructions. If they would like to use my comments to add to or improve their documentation, I’d be only too pleased for them to do so. In the meantime, I hope my experience helps somebody.




by