World Modelers INDRA service stack

Using the services

Below, SERVICE_HOST should be replaced by the address of the server on which the services are running.

Check that the service is running

This is a simple health endpoint that can be pinged to check that the service is running.

URL: http://SERVICE_HOST:8001/health
Method: GET
Output: {"state": "healthy", "version": "1.0.0"}

Read text to produce INDRA Statements

Read a given text with a reader and return INDRA Statements (below, <reader> can be eidos, sofia or cwms). Note that for eidos specifically, a webservice parameter should also be passed which points to the address on which the Eidos web service is running (see above):

URL: http://SERVICE_HOST:8000/<reader>/process_text
Method: POST with JSON content header
Input parameters: {"text": "rainfall causes floods"}
Output: <indra statements json>

Submit curations

This endpoint take a single curations parameter which is a list of curations according to the CauseMos JSON format in which curations are represented. This representation contains a corpus_id and a project_id as part of each curation entry, therefore these do not need to be specified separately.

URL: http://SERVICE_HOST:8001/submit_curations
Method: POST with JSON content header
Input parameters: {"curations": <list of curations>}
Output: {}

Persist curations for a given corpus on S3

The service does local caching of curations, however, it does not push curations submitted during runtime to S3 (which can be useful if someone wants to access them as a file independent of the service). This endpoint allows pushing all the curations for a given corpus_id to S3.

URL: http://SERVICE_HOST:8001/save_curations
Method: POST with JSON content header
Input parameters: {"corpus_id": "<corpus id>"}
Output: {}

Update beliefs

This endpoint performs a lightweight belief re-calculation based on curations obtained so far. It takes a required corpus_id argument and an optional project_id argument. If a project_id is provided, beliefs are calculated based on project-specific curations, otherwise, all the curations for the given corpus are taken into account.

URL: http://SERVICE_HOST:8001/update_beliefs
Method: POST with JSON content header
Input parameters: {"corpus_id": "<corpus id>",
                   "project_id": "<project id>"}
Output: {"38ce0c14-2c7e-4df8-bd53-3006afeaa193": 0,
 "6f2b2d69-16af-40ea-aa03-9b3a9a1d2ac3": 0.6979166666666666,
 "727adb95-4890-4bbc-a985-fd985c355215": 0.6979166666666666}

Re-assemble corpus

This endpoint runs a new assembly for a given corpus_id and project_id based on curations and dumps the results on S3. The project-specific statement dump appears as a sub-key under the corpus key base, as indra-models/<corpus id>/<project id>/statements.json.

URL: http://SERVICE_HOST:8001/run_assembly
Method: POST with JSON content header
Input parameters: {"corpus_id": "<corpus id>",
                   "project_id": "<project id>"}
Output: {}

Download curations

This endpoint allows downloading curations and the corresponding curated statements for a corpus. If a reader name is provided, the results are filtered to curations for statements that have the provided reader among its sources, otherwise all curations and their corresponding statements are returned.

URL: http://SERVICE_HOST:8001/download_curations
Method: POST with JSON content header
Input parameters: {"corpus_id": "<corpus id>",
                   "reader": "<reader name>"}
Output: {"curations": <list of curations>,
         "statements": {"38ce0c14-2c7e-4df8-bd53-3006afeaa193": <stmt json>}}

Notify INDRA of a new reader output in DART

URL: http://SERVICE_HOST:8001/notify
Method: POST with JSON content header
Input parameters: {"identity": "eidos",
                   "version": "3.1.4",
                   "document_id": "38ce0c14-2c7e-4df8-bd53-3006afeaa193",
                   "storage_key": "uuid.ext"}
Output: {}

INDRA assemblies on S3

Access to the INDRA-assembled corpora requires credentails to the shared World Modelers S3 bucket “world-modelers”. Each INDRA-assembled corpus is available within this bucket, under the “indra_models” key base. Each corpus is identified by a string identifier (“corpus_id” in the requests above).

The corpus index

The list of corpora can be obtained either using S3’s list objects function or by reading the index.csv file which is maintained by INDRA. This index is a comma separated values text file which contains one row for each corpus. Each row’s first element is a corpus identifier, and the second element is the UTC date-time at which the corpus was uploaded to S3. An example row in this file looks as follows

test1_newlines,2020-05-08-22-34-29

where test1_newlines is the corpus identifier and 2020-05-08-22-34-29 is the upload date-time.

Structure of each corpus

Within the world-modelers bucket, under the indra_models key base, files for each corpus are organized under a subkey equivalent to the corpus identifier, for instance, all the files for the test1_newlines corpus are under the indra_models/test1_newlines/ key base. The list of files for each corpus are as follows

  • statements.json: a JSON dump of assembled INDRA Statements. As of May 2020, each statement’s JSON representation is on a separate line in this file. Any corpus uploaded before that has a standard JSON structure. This is the main file that CauseMos needs to ingest for UI interaction.

  • raw_statements.json: a JSON dump of raw INDRA Statements. This file is typically not needed in downstream usage, however, the INDRA curation service needs to have access to it for internal assembly tasks.

  • metadata.json: a JSON file containing key-value pairs that describe the corpus. The standard keys in this file are as follows:

    • corpus_id: the ID of the corpus (redundant with the corresponding entry in the index).

    • description: a human-readable description of how the corpus was obtained.

    • display_name: a human-readable display name for the corpus.

    • readers: a list of the names of the reading systems from which statements were obtained in the corpus.

    • assembly: a dictionary identifying attributes of the assembly process with the following keys:

      • level: the level of resolution used to assemble the corpus (e.g., “location_and_time”).
      • grounding_threshold: the threshold (if any) which was used to filter statements by grounding score (e.g., 0.7)
    • num_statements: the number of assembled INDRA Statements in the corpus ( i.e., statements.json).

    • num_documents: the number of documents that were read by readers to produce the statements that were assembled.

    Note that any of these keys may be missing if unavailable, for instance, in the case of old uploads.

  • curations.json: a JSON file which persists curations as collected by INDRA. This is the basis of surfacing reader-specific curations in the download_curation endpoint (see above).

Setting up the services locally

These instructions describe setting up and using the INDRA service stack for World Modelers applications, in particular, as a back-end for the CauseMos UI.

The instructions below run each Docker container with the -d option which will run containers in the background. You can list running containers with their ids using docker ps and stop a container with docker stop <container id>.

Setting up the Eidos service

Clone the Eidos repo and cd to the Docker folder

git clone https://github.com/clulab/eidos.git
cd eidos/Docker

Build the Eidos docker image

docker build -f DockerfileRunProd . -t eidos-webservice

Run the Eidos web service and expose it on port 9000

docker run -id -p 9000:9000 eidos-webservice

Setting up the general INDRA service

Pull the INDRA docker image from DockerHub

docker pull labsyspharm/indra

Run the INDRA web service and expose it on port 8000

docker run -id -p 8000:8080 --entrypoint gunicorn labsyspharm/indra:latest \
    -w 1 -b :8000 -t 600 rest_api.api:app

Note that the -w 1 parameter specifies one service worker which can be set to a higher number if needed.

Setting up the INDRA live curation service

Assuming you already have the INDRA docker image, run the INDRA live feedback service with the following parameters:

docker run -id -p 8001:8001 --env-file docker_variables --entrypoint \
python labsyspharm/indra /sw/indra/indra/tools/live_curation/live_curation.py

Here we use the tag --env-file to provide a file containing environment variables to the docker. In this case, we need to provide AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to allow the curation service to access World Modelers corpora on S3. The file content should look like this:

AWS_ACCESS_KEY_ID=<aws_access_key_id>
AWS_SECRET_ACCESS_KEY=<aws_secret_access_key>

Replace <aws_access_key_id> and <aws_secret_access_key> with your aws access and secret keys.