Wednesday, 28 September 2016

Let's Implement "Login with Github" button

Recently we delivered a simple workshop in Spring User Group Singapore about implementing "Login with Github" button using Spring Boot, Spring Security, OAuth2 and Angular

Saturday, 16 April 2016

Spring Oauth2 with JWT Sample

Sometimes ago, we published one article sharing a custom approach to implementing stateless session in the cloud environment. Today, let explore another popular use case of setting up OAuth 2 authentication for a Spring Boot application. In this example, we will JSON Web Token (JWT) as the format of the OAuth 2 token.

This sample was developed partly based on the official sample of Spring Security OAuth 2. However, we will focus on understanding the principal of the OAuth 2 request. The source code is available at


OAuth 2 and JWT 

We will not go to detail when you may want to use OAuth 2 and JWT. In general, OAuth 2 is useful when you need to allow other people to build front end app for you services. We focus on OAuth 2 and JWT because they are the most popular authentication framework and protocol in the market.

Spring Security OAuth 2

Spring Security OAuth 2 is an implementation of OAuth 2 that built on top of Spring Security, which itself is a very extensible authentication framework.

In overall, Spring Security authentication includes 2 steps, creating an authentication object for each request and applying authorization check depending on authentication.

The first step was done in a multi-layer Security Filter. Depending on the configuration, each layer can help to create the authentication object for web request with basic authentication, digest authentication, form authentication or any custom method of authentication. The client side session we built in the previous article is effectively a layer of custom method authentication and Spring Security OAuth 2 is built on the same mechanism.

Because in this example, our application both provides and consume token, Spring Security OAuth 2 should not be the sole authentication layer for the application. We need another authentication mechanism to protect the token provider endpoint so that resource owner can authenticate himself before getting the JWT token.

For a cluster environment, the token or the secret to sign token (for JWT) suppose to be persisted so that the token can be recognized at any resource server but we skip this step to simplify the example. Similarly, the user authentication and client identities are all hard-coded.

System Design


In our application, we need to setup 3 components

  • Authorization Endpoint and Token Endpoint to help to provide OAuth 2 token.
  • A WebSecurityConfigurerAdapter, which is an authentication layer with hard-coded order of 3 (according to Dave Syer). This authentication layer will setup authentication and principal for any request that contains OAuth 2 token.
  • Another authentication mechanism to protect Token endpoint and other resources if the token is missing. In this sample, we choose basic authentication for its simplicity when writing tests. As we do not specify the order, it will take the default value of 100. With Spring security, the lower order, the higher priority; so we should expect OAuth 2 come before basic authentication in the FilterChainProxy. Inspecting in IDE proves that our setup is correct.

In the above picture, OAuth2AuthenticationProcessingFilter appear in front of BasicAuthenticationFilter.

Authorization Server Configuration

Here is our config for Authorization and Token Endpoint

public class AuthorizationServerConfiguration extends AuthorizationServerConfigurerAdapter {

    private String resourceId;
    int accessTokenValiditySeconds = 3600;

    private AuthenticationManager authenticationManager;
    public JwtAccessTokenConverter accessTokenConverter() {
        return new JwtAccessTokenConverter();

    public void configure(AuthorizationServerEndpointsConfigurer endpoints) throws Exception {
    public void configure(AuthorizationServerSecurityConfigurer oauthServer) throws Exception {
        oauthServer.tokenKeyAccess("isAnonymous() || hasAuthority('ROLE_TRUSTED_CLIENT')")

    public void configure(ClientDetailsServiceConfigurer clients) throws Exception {
                .authorizedGrantTypes("authorization_code", "implicit")
                .scopes("read", "write")
                .authorizedGrantTypes("client_credentials", "password")
                .scopes("read", "write")

There are few things worth noticing about this implementation.

  • Setting up JWT token is as simple as using JwtAccessTokenConverter. Because we never specify the signing key, it is randomly generated. If we intended to deploy our application to the cloud environment, it is a must to sync the signing key across all authorization servers.
  • Instead of creating authentication manager, we choose to inject an existing authentication manager from Spring container. With this step, we can share the authentication manager with the Basic Authentication filter.
  • It is possible to have trusted application and not trusted application. Trusted application can have their own secret. This is necessary for client credential authorization grant. Except client credentials, all 3 other grants require resource owner's credential.
  • We allow anonymous for checking token endpoint. With this configuration, the checking token is accessible without basic authentication or OAuth 2 token. 

Resource Server Configuration

Here is our configuration for Resource Server Configuration

public class ResourceServerConfiguration extends ResourceServerConfigurerAdapter {
    private String resourceId;
    public void configure(ResourceServerSecurityConfigurer resources) {

    public void configure(HttpSecurity http) throws Exception {
         http.requestMatcher(new OAuthRequestedMatcher())
    private static class OAuthRequestedMatcher implements RequestMatcher {
        public boolean matches(HttpServletRequest request) {
            String auth = request.getHeader("Authorization");
            // Determine if the client request contained an OAuth Authorization
            boolean haveOauth2Token = (auth != null) && auth.startsWith("Bearer");
            boolean haveAccessToken = request.getParameter("access_token")!=null;
   return haveOauth2Token || haveAccessToken;


Here are few things to take note:
  • The OAuthRequestedMatcher is added in so that the OAuth filter will only process OAuth 2 requests. We added this in so that an unauthorized request will be denied at the Basic Authentication layer instead of OAuth 2 layer. This may not make any difference in term of functionality but we added it in for usability. For the client, they will receive 401 HTTP Status with this new header versus the old header:
    • WWW-Authenticate:Basic realm="Realm"
    • WWW-Authenticate:Bearer realm="spring-boot-application", error="unauthorized", error_description="Full authentication is required to access this resource"
  • With the new response header, a browser will auto prompt user for the username and password. If you do not want the resource to be accessible by any other authentication mechanism, this step is not necessary.
  • Some browsers like Chrome like to send OPTIONS request to look for CORS before making AJAX call. Therefore, it is better to  always allow OPTIONS requests.

Basic Authentication Security Configuration

As mentioned earlier, because we need to protect the token provider endpoint.

@EnableGlobalMethodSecurity(prePostEnabled = true)
public class SecurityConfiguration extends WebSecurityConfigurerAdapter {
    public void globalUserDetails(AuthenticationManagerBuilder auth) throws Exception {
                .password("password").roles("USER", "ADMIN");
    protected void configure(HttpSecurity http) throws Exception {
    public AuthenticationManager authenticationManagerBean() throws Exception {
        return super.authenticationManagerBean();

There are few things to take note:

  • We expose the AuthenticationManager bean so that our two authentication security adapter can share a single authentication manager.
  • Spring Security CSRF working seamlessly with JSP but is a hassle for RestAPI. Because we want this sample app to be used as a base for users to develop their own application, we turned CSRF off and add in a CORS filter so that it can be used right away.


We wrote one test scenario for each authorization grant type following exactly OAuth 2 specifications. Because Spring Security OAuth 2 is an implementation based on Spring Security framework, our interest is veered toward seeing how the underlying authentication and principal are constructed.

Before summarizing the outcome of the experiment, let take a quick look at few notes.

  • Most of the requests to token provider endpoints were sent using POST requests but they include user credential as parameters. Even though we put this credential as part of URL for convenient, never do this in your OAuth 2 client.
  • We created 2 endpoints /resources/principal and /resources/roles to capture the principal and authority for OAuth 2 authentication.
Here is our setup:

resource owner
resource owner

Here is what we find out

Grant Type
Authorization Code
Client Credentials
No Authority
Resource Owner Password Credentials

This result is pretty as expected except for Client Credentials. Interestingly, even though the client retrieves Oauth 2 token by client credential, the approved request still does not have any of client authorities but only client credential. I think this make sense because the token from Implicit Grant Type cannot be reused.

Monday, 11 January 2016

Should you mind your own business?

In a recent Lean Coffee retrospective, each member of our team was asked to raise one question or concern about working environment. For me, my burning question is how much should we mind other business. After voicing out my concern, I got subtle response from the team as people did not feel very comfortable expressing their thought on this controversial topic. Even without the possibility of hiding real opinion to be politically correct, it is quite likely that many of us still do not know which attitude is more desirable in our working environment. Talking about personal preference, I have met people that truly believe on opposite sides of standing. Apparently, if the workplace is mixed with both types of people having opposite mindsets, conflicts tend to happen more often.

However, this topic is rarely being discussed in workplace. Therefore, there is often a lack of consensus in the corporate environment regarding how much should we care about other people works. As a consequence, some people may silently suffer while others may feel frustrated with the lack of communication and cooperation.

So, is there a right mindset that we should adopt in our working environment or should we just let each people to have it their own way and hope that they will slowly adapt to each other? Let discuss if we have any viable solution.

Mind your own business

Micro Management

Usually, the culprits of meddling with other people works are the one who do managing job. It is obvious that no one like to be micro-managed, even the micro manager himself. Being told of what to be done in details would give us limited space for innovation and self learning plus a bitter taste of not being trusted.

In real life, micro management rarely be the optimal solution. Even in best scenario, micro management can only help to deliver the project with average quality and create a bunch of unhappy developers. Following instructions rather than depending on own thinking rarely create excellent works. To be fair, it can help to rescue a project if the progress fall below expectation but it will not help much on achieving excellency. It is even worse that manager, who accomplish the jobs by micro management can be addicted to it and find it harder to place trust on his sub-ordinates in the future.

To avoid misunderstanding

It is even harder to interfere with some one else work if you are not happen to be a senior or the supervisor. In reality, an act of good will can be interpreted as an act of arrogance unless one manage to earn reputation in the work place. Even if the contribution bring obvious outcome, not every working environments encourage heroism or rock star programmer. Moreover, some introvert folks may not feel comfortable with the attention as a side effect.

Similar to above, this heroin act is normally more welcome in crisis time but may not be very well received when the project is already stabled.

Or mind the other people business as well

Reaching team goal

To be fair, no one border to care about someone else work if it is not for the sake of the project. Of course, there are many poisonous managers who want to act busy by creating artificial pressure but in this article, let focus on the people who want to do see the project success. These people sometimes walk out of their role profiles and just do whatever necessary to get job done.

Eventually, because project work is still team work, it may be more beneficial for each member of them team to think and focus on common goal rather than individual mission. It is pretty hard to keep the information flow through and the components to integrate smoothly if each member only focus on fulfilling their role. No matter how good is the plan, there will likely be some missing pieces and that missing pieces need to be addressed as soon as possible to keep project moving forward.

The necessary evil to get job done

Many successful entrepreneurs claimed that the secret of their success is delegating the tasks to capable staffs and place trust on them. It is definitely the best solution when you have capable staffs. However, in reality, most of us are not entrepreneurs and the people working on the project are chosen by available resources rather than best resources. There may be a time when a project is prioritized and granted the best resources available, when it is in a deep crisis. However, we would not want our project to go through that.

Therefore, from my point of view, the necessary evil here is the task oriented attitude over people oriented attitude. We should value people and give everyone chance for self improvement but still, task completion is first priority. If the result of work is not the first priority, it is hard to measure and to improve performance. I fell it even more awkward to hamper performance for the sake of human well-being when the project is failing.

So, what is the best compromise?

I think the best answer is balance. Knowing that meddling with other people works is risky but the project success is the ultimate priority, the best choice should be defining a minimal acceptable performance and be ready to step in if the project is failing. Eventually, we do not work to fulfill our task, we work together to make project success. Personal success provide very little benefit other than your own well-being from the corporate point of view. However, don't let this consume you and be addicted with the feeling of being people's hero. Plus, don't raise the bar too high, otherwise the environment will be stressed and people feel less encouraged.

Monday, 6 July 2015

Designing database

Database design has evolved greatly over the last 10 years. In the past, it used to be the database analyst job to fine-tune the SQL query and database index to ensure performance. Nowadays, developers play a much more crucial role to ensure the scalability of data.

The database design task, which was autonomy, now becomes an exciting task, which requires a lot of creativity. In this short article, let's walk through an example of real life issue to see how database design has changed.


In this example, our business requirement is to build a database to store property information for the country. At first, we need to store any property and its landlords. If a property is being leased, the system need to store tenants information as well. The system should also record activities of properties, including buy, sell and renting.

As a typical database system, the user should be able to query properties out by any information like address, owner name, district, age,... The system need to serve data for both real time query and reporting purpose.


It is pretty obvious that there are few entities in this domain like landlord, tenant, transaction and property. Landlord and tenant can be further analysed as people that acts different roles. For example, one person can rent out his house and rent another house to live, which mean he can be the landlord of one property and the tenant of another. That leaves us with 3 major entities: person, property and transaction. Person and property entities have many to many relations to each other. Transaction entity links to one property and at least one person.

If we group some common attributes like occupation, district, building, it is possible to introduce some other sub-entities that may help to reduce redundancy in information.

The era of relational database

If you are one of a developer that being trapped in the relational database era, the only viable choice for persistence is relational database. Naturally, each entity should be stored in a table. If there are relationship between 2 entities, they are likely to refer to each other by foreign keys.

With this setup, there is zero redundancy and every piece of information has the single source of truth. Obviously, it is the most efficient way in term of storage.

There may be an issue here as it is not easy to implement text searching. Whether 10 years ago or today, text search has never been supported well by relational databases. SQL language provides some wildcard matching in the language itself but it is still very far from a full text search.

Assume that you have completed the task of defining database schema, the fine tuning task is normally the job of database analysts; they will look into every individual query, view, index to increase the performance as much as possible.

However, if the readers have spent years working on relational database, it is quite easy to see the limit of this approach. A typical query may involve joining several tables. While it works well for low amount of records, the solutions seem less feasible when the number of tables increase or the amount of records in each table increase. All kinds of tweaks like data sharding, scaling vertically or adding index only help to increase performance up to a certain level. No magic can help if we are going to deal with hundreds millions of records or joining more than 10 tables.

Extension of relational database

To solve the issue, developers have tried several techniques that may scale better than a traditional relational database. Here are some of them:

Database explosion

This technique reverses the process of normalizing data in relational database. For example, instead of joining property with the building, district or country table, we can simply copy all the column of the relevant records to main table. As a consequence, duplication and redundancy happen. There is no single source of truth for sub-entities like building, district, country. In exchange, the joining part in SQL query is simplified.

Explosion is an expensive process that may take hours or even days to run. It sacrifices space, freshness of data in order to increase real time query performance.

Adding document database

In this technique, relational database is still the source of truth. However, to provide text search, important fields were extracted and stored in a document database. For example, knowing that users will search for people by age, gender and name, we can create document that contains these information plus the record id and store them to Solr or Elastic Search server.

Real time query to the system will first be answered by searching in document database. The answer, which includes bunch of record ids will later be used by relational database to load records. In this approach, document database acts like an external index system that help to provide text search capability.

Storing the whole data to a noSQL database

The final choice is storing data to an object-oriented or document database. However, this approach may add a lot of complexity for data maintenance.

To visualize, we can store the whole property or person to database. The property object contains its owners objects. In reverse, the owner object may includes several property objects. In this case, it is quite a hassle to maintain to set of related objects if the data change.

For example, if a person purchases a property, we need to go to the property object to update owner information and go to that person object to update property information.

Combining relational database and noSQL database

The limits of existing methods

After scanning through the approaches mentioned above, let try to find the limit for each approach.
  • Relational database normalizes data before storing to avoid duplication and redundancy. However, by optimizing storage, it causes additional effort on retrieving the data. Taking consideration that database is normally limit by querying time, not storage, it doesn't seem to be a good trade off.
  • Explosion reverses the normalizing process but it cannot offer fresh data as explosion normally take a long time to run. Comparing running explosion with storing the whole entity object to an object-oriented database, the latter option may be easier to maintain.
  • Adding document database offers text search feature but I feel that it should reverses the options to improve scalability. Document database is faster for retrieval while relational database is better for describing relationship. Hence, it doesn't make sense to send the record ids from document database back to relational database for retrieving records. What may happen if there are millions of records id to be found. Retrieving those records from noSQL database is typically faster than relational database.
  • As mentioned above, when these entities are inter-linked, there is no easy way to separate them out to store to an object-oriented database. 
Proposing combination of relational and noSQL database to store data

Thinking about these limits, I feel that the best way to store data should be combining both relational and document database. Both of them will act as source of truth, storing what they do best. Here is the explanation of how should we split the data.

We store the data similarly to a traditional relational database but splitting the columns to 2 types:
  • Columns that store id or foreign keys to other entity ids ("property_id", "owner_id",..) or unique fields
  • Columns that store data ("name", "age",...)
After this, we can remove any data column from relational database schema. It is possible to keep some simple fields like "name", "gender" if they help to give us some clues when looking at records. After that, we can store the full entities in a document database. We should try to avoid making cross-references in stored documents.

Explain the approach by example

Let try to visualize the approach by describing how should we implements some simple tasks
  • Storing a new property owned by a user
    • Configure JPA to only store name and id for each main entity like person, property. Ignore data fields or sub-entities like building, district, country.
    • Store the property object to relational database. As the result of earlier step, only id and name of the property are persisted. 
    • Update property object with persisted ids.
    • Store property object to document database.
    • Store owner object to document database.
  • Querying property directly
    • Sending query to document database, retrieving back record.
  • Querying property based on owner information
    • Sending query to relational database to find all property that belong to the owner.
    • Sending query to document database to find these property by ids.
In the above steps, we want to store records to relational database first because of the auto id generation. With this approach, we have a very thin relational database that only capture relationships among entities rather than the entities them selves. 

Summary of the approach

Finally, let summarize the new approach
  • Treating main entities as independent records.
  • Treating sub-entities as complex properties, to be included as part of main entities.
  • Storing id, name and foreign keys of main entities inside relational database. The relational database is serving as a bridge, linking independent objects in noSQL database.
  • Storing main entities with updated ids to noSQL database.
  • Any CRUD operation will require committing to 2 databases at the same time.
  • Off-load the storing data task from relational database but let it do what it can do best, stores relationships.
  • Best of both worlds with text search and scalability of noSQL database and relations searching of relational database.
  • Maintaining 2 databases.
  • No single source of truth. Any corruption happen in one of the two databases will cause data loss.
  • Code complexity.
Possible alternative
  • Storing data to a graph database that offer text search capability. This is quite promising as well but I have not done any benchmark to prove feasibility.


The solutions is pretty complex but I found it is interesting because the scalability issue is solved at the code level rather than database level. By splitting the data out, we may tackle the root cause of the issue and be able to find some balance between performance and maintenance effort.

The complexity of this implementation is very high but there is no simple implementation for big data.

Thursday, 11 June 2015

Can java optimize empty array allocation?

Yesterday came across a simple optimization case, here is the original method

public String[] getSomeArray() {
    if (nothing) {
        return new String[0];
    // normal processing ignored for brevity 

at the first sight the allocation looks quite wasteful, and I am tempted to carry out some micro optimization like

private static final String[] EMPTY = new String[0];
public String[] getSomeArray() {
    if (nothing) {
        return EMPTY;
    // normal processing ignored for brevity 

However, another developer Pedro ringed the bell, maybe java can JIT away the allocation all together, and this does looks a very reasonable JIT target.

Let's find out!

jmh to rescue

public void test1() {
    for (int i = 0; i < 10000; i++) {
public String[] get1() {
    return new String[0];

private static final String[] CONST = {};
public void test2() {
    for (int i = 0; i < 10000; i++) {
public String[] get2() {
    return CONST;

A benchmark run gave following result which showed that the two methods ran pretty much at the same speed, therefore the actual allocation could be indeed optimized away

test$ java -jar target/benchmarks.jar -f 1
# JMH 1.9.3 (released 28 days ago)
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.sample.MyBenchmark.test1

# Run progress: 0.00% complete, ETA 00:01:20
# Fork: 1 of 1
# Warmup Iteration   1: 3177146862.839 ops/s
# Warmup Iteration   2: 2969126090.532 ops/s
# Warmup Iteration  19: 3904120378.974 ops/s
# Warmup Iteration  20: 3368973982.889 ops/s
Iteration   1: 3273016452.646 ops/s
Iteration   2: 3720653112.375 ops/s
Iteration  19: 2940755393.888 ops/s
Iteration  20: 3490675218.425 ops/s

Result "test1":
  3150112425.866 ±(99.9%) 346620443.427 ops/s [Average]
  (min, avg, max) = (2526859466.365, 3150112425.866, 3790445537.196), stdev = 399168618.122
  CI (99.9%): [2803491982.439, 3496732869.293] (assumes normal distribution)

# JMH 1.9.3 (released 28 days ago)
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.sample.MyBenchmark.test2

# Run progress: 50.00% complete, ETA 00:00:40
# Fork: 1 of 1
# Warmup Iteration   1: 2646209214.510 ops/s
# Warmup Iteration   2: 3014719359.164 ops/s
# Warmup Iteration  19: 3639571958.173 ops/s
# Warmup Iteration  20: 3127621392.815 ops/s
Iteration   1: 3464961418.737 ops/s
Iteration   2: 2827541432.787 ops/s
Iteration  19: 2888880315.543 ops/s
Iteration  20: 3109114933.979 ops/s

Result "test2":
  3048325924.714 ±(99.9%) 269904767.209 ops/s [Average]
  (min, avg, max) = (2523324876.886, 3048325924.714, 3573386254.596), stdev = 310822731.303
  CI (99.9%): [2778421157.505, 3318230691.923] (assumes normal distribution)

# Run complete. Total time: 00:01:20

Benchmark           Mode  Cnt           Score           Error  Units
MyBenchmark.test1  thrpt   20  3150112425.866 ± 346620443.427  ops/s
MyBenchmark.test2  thrpt   20  3048325924.714 ± 269904767.209  ops/s

to be super conservative, let's check the assembly

  0x00000001051ffa99: movabs $0x11e65a2c8,%rbx  ;   {metadata({method} {0x000000011e65a2c8} 'get1' '()[Ljava/lang/String;' in 'org/sample/MyBenchmark')}

  0x00000001051ffaa3: and    $0x7ffff8,%edx
  0x00000001051ffaa9: cmp    $0x0,%edx

  0x0000000109ae1f51: movabs $0x122f3d440,%rbx  ;   {metadata({method} {0x0000000122f3d440} 'get2' '()[Ljava/lang/String;' in 'org/sample/MyBenchmark')}
  0x0000000109ae1f5b: and    $0x7ffff8,%eax
  0x0000000109ae1f61: cmp    $0x0,%eax

Now we see the exact same native codes were generated.

Case closed.

Java does optimize empty array allocation.

Happy Coding!

by Dapeng

Monday, 25 May 2015

Rethinking database schema with RDF and Ontology

When I joined the industry 10 years ago, my first project used relational database. After that, my next project also used relational database. And as you may guess, my next next projects also used relational database. This went on for so long that I almost forgot that table is just one format to store data.

I only found myself interested in other kind of databases 4 years ago when my company slowly moved to BigData analysis and knowledge management. Over these years, the exposure I have with RDF and Ontology has given me an urge to re-visit and re-think about the approach and principle to build database.

In the scope of this article, I will solely focus my thought on the role of database schema. Please do not feel worry even if the term RDF and Ontology sound unfamiliar to you, I will try my best to introduce these concepts in this article.


Graph as knowledge database

As everyone know, creating relational database start with defining the database schema. It doesn't matter which database we choose, we still need to define tables, columns and any foreign key before inserting data. Apparently, we need to decide how the data would look like before making it.

However, this is not a suitable approach for us when building knowledge database. Because the system suppose to store future knowledge rather than current knowledge, it is impossible to figure out how the data will look like. Therefore, we turned our focus out of relational database and looked for other solutions.

There are many NoSQL databases that can support schema-less data but most of them does not fit our requirement because we want to stored linked data. Hence, graph database seems to be the most sensational choice for us.

Resource Description Framework

A round of shopping in the market give us an unease feeling as there is no widely adopted standard for query language. If we choose a graph database, we may end up writing vendor specific implementation, which doesn't seem to be a good strategy to start.

In an attempt to find some common standards, we managed to find Resource Description Framework, a W3C specification for modelling of information in web resources. It seems to be the best choice for us because Resource Description Framework comes with a very simple mechanism to describe resource linkage. The only side effect that every resource need to be identified by an URI.

For example, to describe that Apple produce iPhone 5, which being sold at 600 USD, we need to generate 2 triples like below.

<> <> <>
<> <> '600 USD'@en

We have no interest in using URI as resource identifier but still need to do so in order to comply to standard. However, we cannot blame RDF because it was invented for the web. The author has a nice dream where we can follow the URI of resource to retrieve it. Obviously, this idea did not come true.

Leave the original idea of RDF one side, the major benefit for us is the query language SPARQL. For example, to figure out the price of any Apple phones, the query should be:

select ?phone ?price where {
       <> <> ?phone .
       ?phone <> ?price

However, RDF is just a concept. In order to use RDF and SPARQL, we still need to find a decent API or implementation, preferably in Java. This leaded us to 2 popular choices, Sesame and Apache Jena. Both APIs are widely accepted and have several vendor implementations. Even better, some vendors provide implementations for both of them.


In the example above, it is easy to see that to make a meaningful query, we need to know how the data look like. Therefore, we still end up should have some kinds of data schema. However, this schema should act more like meta-data rather than data definition. Generally, we do not need the schema to be pre-defined before inserting data.

To address this issue, there are two strategies. People started with defining a common vocabulary for RDF. Because most of the time, we already know the relationship among resources but do not know the resources themselves, the vocabulary should be good enough to help forming SPARQL query. However, due to the wide scope of describing the world, no vocabulary is enough to fully express every domain. Therefore, there are many domain specific vocabularies.

While the first approach only tackles describing possible relationship among resources, the second approach attempts to describe resources as well. For example, to describe the resources in the previous example, we can define a class named smart phone and a class named manufacturer. After this, we can specify that a manufacturer can produce a phone.

<> <> <>
<> <<>
<> <> <>

Those triples above form an Ontology. Compare with vocabulary, we found Ontology as a more descriptive way of describing data schema because it can tell us which kind of relationship is applicable to which kind of resource. Therefore, we will not waste time figuring out whether iPhone 5 produces Apple or Apple produces iPhone 5.

Ontology plus RDF is a good choice to build knowledge database. While the repository can be an RDF store which can take in any triple, we can build another ontology in a separate space to model the knowledge in main repository.

From our point of view, it is better to use ontology to form query rather than data validation because it will help to slightly decouple data and schema. In practice, it is perfectly acceptable to insert data that does not fit Ontology, as long as they does not contradict.

For example, with the Ontology defined earlier, we should allow inserting knowledge like

<> <> <>
<> <> <>

but we can reject any knowledge like

<> <> <>

With this approach, we can allow importing data first, then modelling them later.

Database Schema

A refreshing thought

Relating our approach on building knowledge database with relational database make me feel that the requirement of defining data schema before data insertion is driven by implementation. If we can temporarily forget practical concerns like data indexing, it is possible to insert data first and define data schema later.

This also brought me another thought of whether it is feasible to have more than one data schema for the same piece of data. The idea may sound awkward at first but quite realistic if we look at the daunting task of modelling the world. For example, if we think of Obama as the US president, we may care about the when he started serving, when will he leave office; but if we think of Obama as any other US citizen, then we care more about date of birth, residential area, security number,... In this way, the schema is serving as a perspective for us to inspect and modify resource.

So, if I can travel back to the time people discussing a common query language for database, I would suggest adding some features to SQL to enrich it, or to introduce a new kind of query language that is less strict:

  • Allow insertion without table definition. Automatically create table definition following insertion command parameters.
  • Make the id of record unique per database rather than unique per table. A record can appear in many tables with the same id. An insertion command need to specify which field is the id for the record.
  • The data definition is more generic without size constrain (varchar and int instead of varchar(255) or int(11)).
  • The select command must comply to the table definition. 
  • It is possible to share a field between two tables for the same record.
Before wrapping up this article, let try to do an quick exercise of building an database that can implements these extended features. The underlying storing system can be any schema-less engine but we can use RDF store for simplicity.

  • Insert Obama into US citizen table with name and age and gender. The identifier field is name.
  • Insert Obama into US president table with name, age and elected year. The identifier field is name.
  • Define US citizen table with field name and age.
  • Define US president table with name, age and elected year.
  • Select record from US citizen table will only show name and age as gender is not defined in table definition.
  • Update Obama record in US President table with new age will affect the age in US citizen table because it is a sharing field.

Step 1
  • SQL: insert into citizen(name, gender, age) value ('Barack Obama', 'Male', 53)
  • Triples:
    • <Barack Obama> <type> <citizen>
    • <Barack Obama> <name> 'Barrack Obama'
    • <Barack Obama> <gender> 'Male'
    • <Barack Obama> <age> 53
Step 2
  • SQL: insert into president(name, elected_year) value ('Barrack Obama', 'Male', 53)
  • Triples:
    • <Barack Obama> <type> <president>
    • <Barack Obama> <elected_year> 2009
Step 3
  • SQL: create table citizen ('name' varchar, 'age' int, primary key ('name') )
  • Triples:
    • <citizen> <field> <name>
    • <citizen> <field> <age>
    • <citizen> <primary_key> <name>
Step 4
  • SQL: create table president ('name' varchar, 'elected_year' int, primary key ('name') )
  • Triples:
    • <president> <field> <name>
    • <president> <field> <elect_year>
    • <president> <primary_key> <name>
Step 5
  • SQL: select * from citizen
  • SPARQL: select ?record ?field_name ?field_value where {
    • ?record <type> <citizen>
    • <citizen> <field> ?field_name
    • ?record ?field_name ?field_value
  • }
 Step 6
  • SQL: update president set age=54 where name='Barack Obama'
  • Overwrite <Barack Obama> <age> 53 with <Barack Obama> <age> 54


I think some of the ideas above a bit hard to digest if audiences are not familiar with RDF or Ontology. Still, I hope it can raise some more thoughts on bridging the gap between relational database and knowledge database.

Tuesday, 12 May 2015

First Agile impression

Last year, we had a mass recruitment for Java developers with various level of experience. Unfortunately, from this part of the world (Asia) Agile has not been very widely adopted. Therefore, we ended up spend extra effort getting new members to be familiar with Agile and XP.

Over the team forming process, which lasted about 6 months, most of the new team members provided positive feedback about working in an Agile environment. In this article, I would like to share with audiences one of the feedback that we have. Enjoy!

It has been a quickfire six months working in an agile environment. This is my first foray into agile software development in nearly a  decade of working in software. Hearing and reading about it could not have prepared me for the actual experience of working in this peculiar environment. However, the overwhelming sentiment for me is - what a breath of fresh air!

The morning standups were unsettling, to put it mildly. Suddenly, you stopped having excuses for not having done anything the day before :). They take some getting used to, but slowly I’m learning that it is more a sharing session than a status update. Easier said than done, but I’m getting there.  Not sure what you're going to do today? Tell everyone - there’s always something to do.

I like the inherent openness that Agile brings to the table. Seemingly mundane things like outstanding tasks become more explicit and we are all the better for it. Finish a task at hand? No one's stopping you from going to the board and picking a new one. Stuck with a sticky issue? Bring it up in the standup and more often than not, offers of help can be expected. Oftentimes though, a quick holler is all that will be required.

Whole days dedicated to planning and retrospectives demand concentration and focus and more often than not, creativity. I have found that they mean less disruption during the iteration for the most important work of all - writing good code. Worthwhile, no?

We practice pair programming. For someone new to this, the intensity is unexpected as you try to  align yourself to your pair's thought process in a continuous back-and-forth cycle. Overall, I have found it to be quite draining, but I believe the upside cannot be underestimated. Each pairing session, even with the same colleague, seems to involve a whole new dynamic and the constant adjustment needed can be likened to a skill. Working in such close proximity can be a double-edged sword, though. Some friction is inevitable and I have often experienced a whole range of emotions whilst working through a pairing session. I try to manage these emotions and reflect upon them afterwards to understand why I’d felt the way I’d felt and how I could have done better. It is undeniably rewarding though - I have learnt a great deal about myself and my pairs during those sessions.

Test-driven development is standard practice here. Mastering, or rather, adhering to the red-green-refactor pattern seemed counter-intuitive at first, but it starts to make sense after a while. In my limited experience, writing tests becomes more fluid with practice. Since there is almost never an excuse for not providing test coverage for any code that will see the light of day (much less production code!), buckling down and writing that test will be a good habit to develop and one which I am convinced will prove to be an invaluable skill and an integral part in my journey to become a better developer.

Building software has never been quite so engaging and dare I say Having a close knit team definitely helps. Hopefully, we can keep the good momentum and spirit going as we welcome new members into our fold and the workload ramps up. It will not be easy, but nobody said it would be. Whatever the future holds, I await with bated breath. Onwards.

Zhi Liang